- For Data Migration Tool introduction, please refer to /README.md
- For Installation user guide, please refer to /docs/installation/README.md
- Current release supports both bulk load and incremental data transfer.
- Current release supports Hive to BigQuery translation using BQ Load API.
- Row and Column validation of the data in source and target using Data Validation Tool.
- The tool currently supports loading of both partition and nonpartition table.
- The tool currently supports loading Avro, Parquet, Orc and Text files.
- If tables from multiple databases are to be migrated, create separate config files for each source schema.
- Current tool version supports only bulk load which is ideally done only once, note that the transfer shall take place by the process following upload of the config files.
- If there are new table additions to a batch/config file, ideally create a new config file with a new table list even if rest details remain the same. If not followed, there might be multiple data transfer configs responsible for migrating the same tables.
- Note that reuploading the same configuration file or using a different config file for the same dataset and tables, may create data duplication in the tables for the target dataset.
- Please be cautious of the number of parallel processes being triggered for Data Migration. BQ load uses free shared slot pool for data ingestion, BigQuery does not make guarantees about the available capacity of this shared pool or the throughput you will see. Customers can choose flat-rate pricing for guaranteed capacity. Once the data is loaded into BigQuery, you are charged for storage.
-
Service account to be created for use by bigquery migration agent on jump server, with correct permissions before uploading the configuration file as below:
- bigquery.transfers.get
- storage.objects.create
- storage.objects.get
- Storage.objects.list
The user uploads a configuration json file to dmt-config-<customer_name provided in TF> bucket data folder which initiates data migration.
As the configuration is uploaded, a new file create/update trigger is sent to pub/sub which triggers the DAG HIVE BQ load dag. This DAG is responsible for parallelizing the data transfer, based on the user defined key - batchDistribution using Airflow Dynamic Task Mapping.
The below list of logging tables are created by terraform templates and record activities for Data Transfer Tool
Dataset.Table Name | Migration Phase | Description |
dmt_logs.dmt_dvt_results | Data Validation | Contains data validation results for ddl/sql/data |
dmt_logs.dmt_dvt_aggregated_results | Data Validation | Contains data validation aggregated results |
dmt_logs.bq_load_audit | Data Migration | Contains status of each table load. |
dmt_logs.hive_pubsub_audit | Incremental Data Load | Contains status of each incremental file added to GCS. |
dmt_logs.hive_inc_load_table_list | Incremental Data Load | Contains copy status of each incremental file. |
dmt_logs.hive_inc_bqload_audit | Incremental Data Load | Contains status of each incremental file load. |
Data Transfer sample config file location
Configuration attribute | Description |
batchDistribution | Number of tables to migrate in each batch (for each batch there will be a separate data transfer and agent run).
[Number of batches = Total number of tables / batchDistribution] |
unique_id | Provide an unique name for identifying the data migration
– Note: If the user opted for data migration, along with schema migration through the tool, this unique id should be the same as the one used in the schema migration config file. |
type | Type of migration : data |
source | Source datawarehouse : hive |
dvt_check | It can contain value as Y or N. If set to Y it will pick only those tables for data migration which have passed the DVT validation check for DDL. If set to N it will pick all the tables from hive_ddl_metadata table for data migration. |
transfer_config | Sub json config to be used to get configuration details for data migration |
transfer_config:dataSourceId | HIVE |
transfer_config:rerun_flag | In case the migration activity is stopped abruplty, user can set this as 'Y' to migrate only the left tables from the previous run. |
transfer_config:displayName | Data transfer service config display name |
transfer_config:params:database_type | Hive |
transfer_config:params:hive_db_name | Source Hive database name that contains the tables to migrate |
transfer_config:params:gcs_temp_bucket | GCS temp bucket that would used for interim outputs and calculations. \ \ Ideally dmt-temp-<project-id> is created from terraform. |
transfer_config:params:hive_gcs_staging_bucket_id | Bucket id that contains data files stored. \ Ideally dmt-teradata-data-<project-id> is created from terraform. |
transfer_config:params:hive_gcs_staging_path | Folder name inside the hive_gcs_staging_bucket_id that contains the data files. |
transfer_config:params:bq_dataset_id | Target BQ dataset name. |
transfer_config:params:destinationDatasetId | The BigQuery target dataset id. |
transfer_config:params:project_id | Project number (numeric) to which migration is taking place |
transfer_config:params:region | Region in which the bucket belongs |
validation_config | Sub json config to be used to run DVT |
validation_config:source_config:source_type | DVT Validation
Source datawarehouse : Impala |
validation_config:source_config:host | hostname/ IP address of on-prem Hive |
validation_config:source_config:port | Port number of on-prem Impala |
validation_config:validation_type | DVT validation type
Possible values - data, column |
validation_config:validation_mode | DVT validation mode
Possible values - gke, cloudrun |
validation_config:validation_object | DVT validation object
data |
unique_id: name to uniquely identify the load job batch.
Note: If the user opted for data migration, along with schema migration through the tool, this unique id should be the same as the one used in the schema migration config file.
displayName: prefix of the bigquery data transfer service config display name
hive_gcs_staging_bucket_id: user provided bucket name for storing data files extracted from Hive. Ideally dmt-teradata-data-<customer_name provided in TF> is created using the Foundations deployment. If a user wishes to use another bucket, it needs to be created by the user/concerned team before uploading the file.
bq_dataset_id: Target dataset id for the particular database migrated from HIVE.
project-id: numerical project id to which migration is taking place
validation_object: validation_object key in Validation_config needs to be _data _for data transfer configs.
validation_type: validation_type key in Validation_config can be column or row depending on the requirement.
validation_mode: Value can be ‘cloudrun’ or ‘gke’ based on where the user want to conduct data validation.
Composer DAG validation_dag will validate the data migrated to BQ. Results for DVT can be viewed in <PROJECT_ID>.dmt_logs.dmt_dvt_results.
Below query can be used to see the results:
SELECT * FROM `<project-id>.dmt_logs.dmt_dvt_results` where
-- validation_type='Column'
-- run_id = <transfer_run_id>
target_table_name = <target_table_name>
There are two ways to validate data - column or row. Validation_type given as column verifies count aggregation on the source and target tables. Validation_type given as row verifies hash of the rows based on the primary key. Hence for row validations, providing primary_key in Validation_config json is necessary.
Check the field descriptions section for detailed information on the keys in the configuration file.
Hive Table Format | Data Type | Issue Description | Buganizer # | Status |
Text | Numeric | Hive does implicit casting for the mentioned precision in the DDL. When the same data is extracted in the form of textfile the entire precision is loaded into the file which conflicts to the precision mentioned in the DDL. The causes error during BQ load. | [b/259589505](http://b/259589505) | Fix suggested by the BQ support team is to set a feature flag at the project level which will handle the rounding off. |
Avro Parquet Orc | Float | The way floating precion are stored in hive is different to what the data is in underlying raw files and hive. During the bq load the float column adds additional precision at the end to make it 16 digits precision. | [b/260845659](http://b/260845659) | Reported to the Translation team. Team is suggesting to use Numeric instead of float 64 by overwriting the translation api |
Parquet | Array
Map |
For complex data type such as array or map for parquet file we need struct of array or struct of key. value pair. Translation API does not convert this into the required format so the data load fails. | [b/260647109](http://b/260647109) | Reported to Translation team |
Avro Parquet Orc | Timestamp | BQ translator API converts timestamp to Datetime which causes issues to load data for source files having schema inside the datafiles. As the data type in source is timestamp it expects the same in target | Suggested work around is to use the global config to overwrite the translation api (Automatically taken care by the framework) |
- DVT cannot perform validation of complex data type such as map, struct or array.
- To perform Schema validation through DVT we need to specify each table name.
- DVT fails the check if the data type does not match exactly between Hive and Bigquery. For example translation API converts int32 in hive to int64 in BQ. DVT marks this check as fail as it expects similar datatype at both places.
- DVT requires connection to the Hive DB. This could be a challenge for customers who are not comfortable with giving access to their hive data warehouse externally.