This exercise has an airflow job - airflow_code.py which creates three possible dags as follows: setup - used to setup sqoop jobs and database.
airflow trigger_dag setup
schedule - used to schedule data transfer to hive and report generation. This DAG is scheduled to run daily and does not have to be manually triggered. However the setup dag must be run before this runs.
cleanup - used to clean up sqoop jobs, databse tables and hdfs folders.
airflow trigger_dag cleanup
The dag 'schedule' has 4 tasks:
- mysql_to_hive - copies user and activity_log tables from mysql to hive
- csv_to_hive - loads user upload dump csv to hdfs and creates an external table referenceing the hdfs location
- user_total - generates user_total report which includes total users and new users added.
- user_report - generates a report containing user related metrics such as activity, number of uploads/deletes etc.
The task flow is as follows:
user_total.set_upstream(mysql_to_hive)
user_total.set_upstream(csv_to_hive)
user_report.set_upstream(mysql_to_hive)
user_report.set_upstream(csv_to_hive)
This ensures that both tasks 'mysql_to_hive' and 'csv_to_hive' complete before reports generations happen. Both report generation tasks may happen in parallel once the loading tasks are completed.