Data Migration Tool User Guide - Data Transfer

For Data Migration Tool introduction, please refer to /README.md
For Installation user guide, please refer to /docs/installation/README.md

About the service

Features

Current release supports both bulk load and incremental data transfer.
Current release supports Hive to BigQuery translation using BQ Load API.
Row and Column validation of the data in source and target using Data Validation Tool.
The tool currently supports loading of both partition and nonpartition table.
The tool currently supports loading Avro, Parquet, Orc and Text files.

Additional Notes

If tables from multiple databases are to be migrated, create separate config files for each source schema.
Current tool version supports only bulk load which is ideally done only once, note that the transfer shall take place by the process following upload of the config files.
If there are new table additions to a batch/config file, ideally create a new config file with a new table list even if rest details remain the same. If not followed, there might be multiple data transfer configs responsible for migrating the same tables.
Note that reuploading the same configuration file or using a different config file for the same dataset and tables, may create data duplication in the tables for the target dataset.
Please be cautious of the number of parallel processes being triggered for Data Migration. BQ load uses free shared slot pool for data ingestion, BigQuery does not make guarantees about the available capacity of this shared pool or the throughput you will see. Customers can choose flat-rate pricing for guaranteed capacity. Once the data is loaded into BigQuery, you are charged for storage.

Prerequisites

Service account to be created for use by bigquery migration agent on jump server, with correct permissions before uploading the configuration file as below:
- bigquery.transfers.get
- storage.objects.create
- storage.objects.get
- Storage.objects.list

Prepare configuration file for BQ Load

The user uploads a configuration json file to dmt-config-<customer_name provided in TF> bucket data folder which initiates data migration.

As the configuration is uploaded, a new file create/update trigger is sent to pub/sub which triggers the DAG HIVE BQ load dag. This DAG is responsible for parallelizing the data transfer, based on the user defined key - batchDistribution using Airflow Dynamic Task Mapping.

Hive Data Migration Flow

Audit Tables for Data Migration & Validation

The below list of logging tables are created by terraform templates and record activities for Data Transfer Tool

Dataset.Table Name	Migration Phase	Description
dmt_logs.dmt_dvt_results	Data Validation	Contains data validation results for ddl/sql/data
dmt_logs.dmt_dvt_aggregated_results	Data Validation	Contains data validation aggregated results
dmt_logs.bq_load_audit	Data Migration	Contains status of each table load.
dmt_logs.hive_pubsub_audit	Incremental Data Load	Contains status of each incremental file added to GCS.
dmt_logs.hive_inc_load_table_list	Incremental Data Load	Contains copy status of each incremental file.
dmt_logs.hive_inc_bqload_audit	Incremental Data Load	Contains status of each incremental file load.

Configuration file

Data Transfer sample config file location

Field Descriptions

Configuration attribute	Description
batchDistribution	Number of tables to migrate in each batch (for each batch there will be a separate data transfer and agent run). [Number of batches = Total number of tables / batchDistribution]
unique_id	Provide an unique name for identifying the data migration – Note: If the user opted for data migration, along with schema migration through the tool, this unique id should be the same as the one used in the schema migration config file.
type	Type of migration : data
source	Source datawarehouse : hive
dvt_check	It can contain value as Y or N. If set to Y it will pick only those tables for data migration which have passed the DVT validation check for DDL. If set to N it will pick all the tables from hive_ddl_metadata table for data migration.
transfer_config	Sub json config to be used to get configuration details for data migration
transfer_config:dataSourceId	HIVE
transfer_config:rerun_flag	In case the migration activity is stopped abruplty, user can set this as 'Y' to migrate only the left tables from the previous run.
transfer_config:displayName	Data transfer service config display name
transfer_config:params:database_type	Hive
transfer_config:params:hive_db_name	Source Hive database name that contains the tables to migrate
transfer_config:params:gcs_temp_bucket	GCS temp bucket that would used for interim outputs and calculations. \ \ Ideally dmt-temp-<project-id> is created from terraform.
transfer_config:params:hive_gcs_staging_bucket_id	Bucket id that contains data files stored. \ Ideally dmt-teradata-data-<project-id> is created from terraform.
transfer_config:params:hive_gcs_staging_path	Folder name inside the hive_gcs_staging_bucket_id that contains the data files.
transfer_config:params:bq_dataset_id	Target BQ dataset name.
transfer_config:params:destinationDatasetId	The BigQuery target dataset id.
transfer_config:params:project_id	Project number (numeric) to which migration is taking place
transfer_config:params:region	Region in which the bucket belongs
validation_config	Sub json config to be used to run DVT
validation_config:source_config:source_type	DVT Validation Source datawarehouse : Impala
validation_config:source_config:host	hostname/ IP address of on-prem Hive
validation_config:source_config:port	Port number of on-prem Impala
validation_config:validation_type	DVT validation type Possible values - data, column
validation_config:validation_mode	DVT validation mode Possible values - gke, cloudrun
validation_config:validation_object	DVT validation object data

unique_id: name to uniquely identify the load job batch.

Note: If the user opted for data migration, along with schema migration through the tool, this unique id should be the same as the one used in the schema migration config file.

displayName: prefix of the bigquery data transfer service config display name

hive_gcs_staging_bucket_id: user provided bucket name for storing data files extracted from Hive. Ideally dmt-teradata-data-<customer_name provided in TF> is created using the Foundations deployment. If a user wishes to use another bucket, it needs to be created by the user/concerned team before uploading the file.

bq_dataset_id: Target dataset id for the particular database migrated from HIVE.

project-id: numerical project id to which migration is taking place

validation_object: validation_object key in Validation_config needs to be _data _for data transfer configs.

validation_type: validation_type key in Validation_config can be column or row depending on the requirement.

validation_mode: Value can be ‘cloudrun’ or ‘gke’ based on where the user want to conduct data validation.

Validation of data in BigQuery

Composer DAG validation_dag will validate the data migrated to BQ. Results for DVT can be viewed in <PROJECT_ID>.dmt_logs.dmt_dvt_results.

Below query can be used to see the results:

SELECT * FROM `<project-id>.dmt_logs.dmt_dvt_results` where
-- validation_type='Column'
-- run_id = <transfer_run_id>
target_table_name = <target_table_name>

There are two ways to validate data - column or row. Validation_type given as column verifies count aggregation on the source and target tables. Validation_type given as row verifies hash of the rows based on the primary key. Hence for row validations, providing primary_key in Validation_config json is necessary.

Check the field descriptions section for detailed information on the keys in the configuration file.

Known Issues

Hive Table Format	Data Type	Issue Description	Buganizer #	Status
Text	Numeric	Hive does implicit casting for the mentioned precision in the DDL. When the same data is extracted in the form of textfile the entire precision is loaded into the file which conflicts to the precision mentioned in the DDL. The causes error during BQ load.	[b/259589505](http://b/259589505)	Fix suggested by the BQ support team is to set a feature flag at the project level which will handle the rounding off.
Avro Parquet Orc	Float	The way floating precion are stored in hive is different to what the data is in underlying raw files and hive. During the bq load the float column adds additional precision at the end to make it 16 digits precision.	[b/260845659](http://b/260845659)	Reported to the Translation team. Team is suggesting to use Numeric instead of float 64 by overwriting the translation api
Parquet	Array Map	For complex data type such as array or map for parquet file we need struct of array or struct of key. value pair. Translation API does not convert this into the required format so the data load fails.	[b/260647109](http://b/260647109)	Reported to Translation team
Avro Parquet Orc	Timestamp	BQ translator API converts timestamp to Datetime which causes issues to load data for source files having schema inside the datafiles. As the data type in source is timestamp it expects the same in target		Suggested work around is to use the global config to overwrite the translation api (Automatically taken care by the framework)

Known Issues - DVT

DVT cannot perform validation of complex data type such as map, struct or array.
To perform Schema validation through DVT we need to specify each table name.
DVT fails the check if the data type does not match exactly between Hive and Bigquery. For example translation API converts int32 in hive to int64 in BQ. DVT marks this check as fail as it expects similar datatype at both places.
DVT requires connection to the Hive DB. This could be a challenge for customers who are not comfortable with giving access to their hive data warehouse externally.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_hive_datamigration.md

README_hive_datamigration.md

Data Migration Tool User Guide - Data Transfer

About the service

Features

Additional Notes

Prerequisites

Prepare configuration file for BQ Load

Hive Data Migration Flow

Audit Tables for Data Migration & Validation

Configuration file

Field Descriptions

Validation of data in BigQuery

Known Issues

Known Issues - DVT

Files

README_hive_datamigration.md

Latest commit

History

README_hive_datamigration.md

File metadata and controls

Data Migration Tool User Guide - Data Transfer

About the service

Features

Additional Notes

Prerequisites

Prepare configuration file for BQ Load

Hive Data Migration Flow

Audit Tables for Data Migration & Validation

Configuration file

Field Descriptions

Validation of data in BigQuery

Known Issues

Known Issues - DVT