Skip to content

krutivan/impala_and_airflow_practise

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

impala_and_airflow_practise

This exercise has an airflow job - airflow_code.py which creates three possible dags as follows: setup - used to setup sqoop jobs and database.

airflow trigger_dag setup

schedule - used to schedule data transfer to hive and report generation. This DAG is scheduled to run daily and does not have to be manually triggered. However the setup dag must be run before this runs.

cleanup - used to clean up sqoop jobs, databse tables and hdfs folders.

airflow trigger_dag cleanup

The dag 'schedule' has 4 tasks:

  1. mysql_to_hive - copies user and activity_log tables from mysql to hive
  2. csv_to_hive - loads user upload dump csv to hdfs and creates an external table referenceing the hdfs location
  3. user_total - generates user_total report which includes total users and new users added.
  4. user_report - generates a report containing user related metrics such as activity, number of uploads/deletes etc.

The task flow is as follows:

user_total.set_upstream(mysql_to_hive)
user_total.set_upstream(csv_to_hive)
user_report.set_upstream(mysql_to_hive)
user_report.set_upstream(csv_to_hive)

This ensures that both tasks 'mysql_to_hive' and 'csv_to_hive' complete before reports generations happen. Both report generation tasks may happen in parallel once the loading tasks are completed.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published