Skip to content

Latest commit

 

History

History
490 lines (354 loc) · 13 KB

README.md

File metadata and controls

490 lines (354 loc) · 13 KB

Data Migration Tool Introduction

Why you want to use Data Migration Tool

As customers migrate from on-prem enterprise data warehouses to the Cloud and specifically BigQuery, there is no single end-to-end migration path and/or solution, to make it easy and scalable. There are several tools available for SQL translation, data transfer and data validation, but these need to be configured, run and managed separately. All of this adds additional time, complexity and risk to the data warehouse migration effort.

What Data Migration Tool can provide you

The goal of this tool is to provide an end to end automated solution for customers to perform data transfer, translate and validate the DDL,DML and SQL queries to support data warehouse migrations to BigQuery. This solution is built using Google Cloud tools and services, such as the Data Validation Tool, BQ Translation Service, BigQuery Data Transfer Service, Cloud Composer Orchestrated DAGs etc. This tool should help accelerate migrations, reduce risk and improve time to value.

Supported Sources

Sources Translation Schema Migration Data Transfer(Bulk Load) Incremental Load Validation
Teradata Yes Yes Yes No Yes
Hive Yes Yes Yes Yes Yes
Redshift Yes Yes Yes No Yes
Oracle Yes Yes No No Yes

User Story

Data Migration Tool(DMT) is an end-to-end orchestrated solution for migrating a data warehouse to BigQuery. This tool will help users conduct several operations to successfully translate, create and validate tables and transfer data from on-premise data warehouses such as Teradata/Hive to native GCP BQ tables. These operations include:

  1. Automated extraction of DDLs
  2. DDL translation, schema migration and schema validation.
  3. Data migration and validation
  4. DML translation and dry run validation
  5. SQL translation and validation (#5 can be triggered after #3 only.)
  6. Looker Studio dashboard templates for monitoring above listed migration steps.

Extract and translate DDL from on-prem data warehouses into BigQuery DDL syntax.

  1. Users copy their on-premises DDL file by manual or automated method, and a config file into a GCS bucket.
  2. The config file will trigger Pub/Sub > CloudRun > Cloud Composer > BQ translation API for DDL translation.
  3. The translated DDL will land into the GCS bucket.

Create tables in BigQuery.

  1. Cloud composer will create BQ tables using translated DDLs

Validate Schema in BigQuery

  1. A Composer DAG will validate schema in BQ.
  2. Data Validation results will be written into BQ audit logging tables.

Transfer data from on-premise to BigQuery.

  1. Choose one way to upload your data to BQ. Data Migration Tool provide a service to upload data via DTS for Teradata. For Hive, the prerequisite right now is to have the data files present in GCS buckets and load data using Composer DAGs to BQ.

Translate SQL SELECT statements into BigQuery syntax.

  1. Users copy their on-premises SQL file, and a config file into a GCS bucket.
  2. The config file will trigger Pub/Sub > CloudRun > Cloud Composer > BQ translation API for SQL translation automatically.
  3. The translated SQL and/or failed translation files will land into the GCS bucket.

Validate SQL SELECT statements in BigQuery.

  1. Data should be in BQ before this step.
  2. A Composer DAG will validate the SELECT statement in BQ.
  3. Validation results will be written into BQ.

Translate DML statements into BigQuery syntax.

  1. There is no dependency on the step: “Translate SQL SELECT statements into BQ syntax”.
  2. This is a dry run feature which translated provided DML queries from Teradata/Oracle/Redshift dialect to BigQuery dialect similar to DDL and SQL translations

Validate DML statements in BigQuery.

  1. Post the translation of DML statements into BQ dialect, we leverage the BigQuery Dry Run API feature to test the execution of DML files (dry run only) to validate if the translated DML query would work well without any errors at the BQ side.

Design

The following design proposes how to orchestrate an end-to-end pipeline to run Translation, Migration and Validation for enterprise data warehouse migration to BigQuery.

Overview

This is an image

Controller

This is an image

DAG Overview

This is an image

Infrastructure/Services

The following products/tools will be used to build an end-to-end orchestrated pipeline which both migrates a data warehouse and then validates that migration was successful:

Activity Component
Orchestration
  • Cloud Composer
Event-based Triggers
  • Cloud Run
  • Cloud PubSub
SQL Translation
SQL Validation
Data Transfer
Infrastructure
  • Terraform/Terragrunt
Assets Storage
  • BigQuery
  • Cloud Storage
Secret
  • Secret Manager
Results/Summary
  • Looker Studio

Cloud Composer

Cloud Composer / Airflow will connect the different components together in a DAG, the file sensor will be able to trigger a new file arrival event from GCS.

The Composer instance should be created with the following characteristics:

  • Version 2
  • Private endpoint
  • Custom service account (not default compute service account)
  • Workload Identity binding GCP service account to ‘default’ GKE namespace

Refer to this example of using the Google-provided Composer V2 Terraform module as a starting point.

Cloud Run

Cloud Run will intercept GCS OBJECT_FINALIZE events and then decide whether to trigger Composer DAGs. The Cloud Run endpoint should be created with the following characteristics:

  • Triggered via Cloud Storage Object Finalize events
  • Private network; Internal traffic only
  • Custom service account (not default compute service account and not the same as custom Composer service account)

Cloud Pub/Sub

Cloud Pub/Sub will serve as the message bus for the OBJECT_FINALIZE Cloud Storage events that are ultimately going to trigger Composer DAGs.

Pub/Sub invokes Cloud Run endpoint that ultimately triggers Composer DAGs.

Cloud Storage

Cloud Storage will hold the following assets:

  • Input DDL and SQL files (to be translated into BigQuery SQL)
  • Configuration trigger file (holds settings for end-to-end translation pipeline)
  • Output BigQuery SQL files (Output from Batch SQL translator)

Batch SQL Translator

The Batch SQL Translator service will convert a customer’s source SQL assets (DDLs, Queries, UDFs, Stored Procs) into BigQuery SQL assets.

The Batch SQL Translator tool supports translation of the following SQL dialects:

  • Amazon Redshift SQL
  • Apache HiveQL
  • Apache Spark SQL
  • Azure Synapse T-SQL
  • IBM Netezza SQL/NZPLSQL
  • Oracle SQL, PL/SQL, Exadata
  • Snowflake SQL
  • SQL Server T-SQL
  • Teradata SQL
  • Vertica SQL
  • Presto SQL

BigQuery

The Data Validation Tool and our Composer DAGs will need the following BigQuery datasets and tables to run their tests:

Data Validation Tool

The Data Validation Tool (DVT) is an open sourced Python CLI tool that will compare source tables with the equivalent migrated tables in BigQuery.

DVT (as it is used for this SQL migration service) has the following infrastructure dependencies:

  • Connector for source database
  • Connector for BigQuery as a target database
  • BigQuery dataset/tables for writing validation results

Looker Studio

Results of operations and statistics are curated in Looker studio.

This is a template Looker Studio Template - The customer can use this and switch data source and point it to their project->dataset->logging tables

Operations include:

  • Summary result
  • DDL Translation results
  • Schema creation & validation results
  • Data migration & validation results
  • SQL Translation & validations results
  • DML Translation & validations results

Statistics include:

  • Total count - Success, Failures
  • Filters on error details, tables, sql, ddl and dml files

Appendix

DAG Parallelization & Composer Sizing

Airflow config overrides can help in optimizing parallelization for Schema and Validation DAGs.PLease refer to the below information to optimally size your Composer environments depending on the scaling required and the number of files to be translated and validated.

Composer Optimization

Airflow Documentation

Reference URL

Composer 1.0 Sizing

Recommend setting up a Cloud Composer cluster with a relatively small number of powerful machines, keeping in mind that if the number of machines is too small, a failure of one machine will impact the cluster severely.

alt_text

Airflow Configuration Override Parameter Default Value Recommended Value
[scheduler] Parsing_processes

Link

2 >=number_of_cores_per_node

e.g.) 2 x number_of_cores_per_node

[celery] Worker_concurrency

Link

12 * number of workers' CPUs 6-8 * cores_per_node or per_3.75GB_ram

Ignored if worker_autoscale is set

[core] parallelism 32 [Worker_concurrency] * Number of Workers
[core] max_active_tasks_per_dag 100 Equal to [parallelism]
[core] default_pool_task_slot_count 128 Equal to [parallelism]