Demo ETL process using Google Dataflow (Apache Beam)

This Python process does the following:

accepts several parameters from the command line
reads in XML files from a GCS bucket using a file pattern
parses the XML file using the ElementTree
extracts desired data and writes it to a BigQuery table
any errors are written to a separate table via a "side output"

If you have thousands of XML files to parse, then Dataflow will scale up the number of "worker" nodes and complete the job more quickly than if you just used one basic node. No extra coding is required for this autoscaling - it's automatic (one of Dataflow's appealing features).

Visualisation from GCP Dataflow console:

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
DataFlowParseXML.py		DataFlowParseXML.py
Dataflow-visual.PNG		Dataflow-visual.PNG
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Demo ETL process using Google Dataflow (Apache Beam)

About

Releases

Packages

Languages

ajhindle/ETL

Folders and files

Latest commit

History

Repository files navigation

Demo ETL process using Google Dataflow (Apache Beam)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages