Skip to content

Latest commit

 

History

History
485 lines (383 loc) · 12.7 KB

README_hive_datamigration.md

File metadata and controls

485 lines (383 loc) · 12.7 KB

Data Migration Tool User Guide - Data Transfer

About the service

Features

  • Current release supports both bulk load and incremental data transfer.
  • Current release supports Hive to BigQuery translation using BQ Load API.
  • Row and Column validation of the data in source and target using Data Validation Tool.
  • The tool currently supports loading of both partition and nonpartition table.
  • The tool currently supports loading Avro, Parquet, Orc and Text files.

Additional Notes

  • If tables from multiple databases are to be migrated, create separate config files for each source schema.
  • Current tool version supports only bulk load which is ideally done only once, note that the transfer shall take place by the process following upload of the config files.
  • If there are new table additions to a batch/config file, ideally create a new config file with a new table list even if rest details remain the same. If not followed, there might be multiple data transfer configs responsible for migrating the same tables.
  • Note that reuploading the same configuration file or using a different config file for the same dataset and tables, may create data duplication in the tables for the target dataset.
  • Please be cautious of the number of parallel processes being triggered for Data Migration. BQ load uses free shared slot pool for data ingestion, BigQuery does not make guarantees about the available capacity of this shared pool or the throughput you will see. Customers can choose flat-rate pricing for guaranteed capacity. Once the data is loaded into BigQuery, you are charged for storage.

Prerequisites

  • Service account to be created for use by bigquery migration agent on jump server, with correct permissions before uploading the configuration file as below:

    • bigquery.transfers.get
    • storage.objects.create
    • storage.objects.get
    • Storage.objects.list

Prepare configuration file for BQ Load

The user uploads a configuration json file to dmt-config-<customer_name provided in TF> bucket data folder which initiates data migration.

As the configuration is uploaded, a new file create/update trigger is sent to pub/sub which triggers the DAG HIVE BQ load dag. This DAG is responsible for parallelizing the data transfer, based on the user defined key - batchDistribution using Airflow Dynamic Task Mapping.

Hive Data Migration Flow

alt_text

Audit Tables for Data Migration & Validation

The below list of logging tables are created by terraform templates and record activities for Data Transfer Tool

Dataset.Table Name Migration Phase Description
dmt_logs.dmt_dvt_results Data Validation Contains data validation results for ddl/sql/data
dmt_logs.dmt_dvt_aggregated_results Data Validation Contains data validation aggregated results
dmt_logs.bq_load_audit Data Migration Contains status of each table load.
dmt_logs.hive_pubsub_audit Incremental Data Load Contains status of each incremental file added to GCS.
dmt_logs.hive_inc_load_table_list Incremental Data Load Contains copy status of each incremental file.
dmt_logs.hive_inc_bqload_audit Incremental Data Load Contains status of each incremental file load.

Configuration file

Data Transfer sample config file location

Field Descriptions

Configuration attribute Description
batchDistribution Number of tables to migrate in each batch (for each batch there will be a separate data transfer and agent run).

[Number of batches = Total number of tables / batchDistribution]

unique_id Provide an unique name for identifying the data migration

Note: If the user opted for data migration, along with schema migration through the tool, this unique id should be the same as the one used in the schema migration config file.

type Type of migration : data
source Source datawarehouse : hive
dvt_check It can contain value as Y or N. If set to Y it will pick only those tables for data migration which have passed the DVT validation check for DDL. If set to N it will pick all the tables from hive_ddl_metadata table for data migration.
transfer_config Sub json config to be used to get configuration details for data migration
transfer_config:dataSourceId HIVE
transfer_config:rerun_flag In case the migration activity is stopped abruplty, user can set this as 'Y' to migrate only the left tables from the previous run.
transfer_config:displayName Data transfer service config display name
transfer_config:params:database_type Hive
transfer_config:params:hive_db_name Source Hive database name that contains the tables to migrate
transfer_config:params:gcs_temp_bucket GCS temp bucket that would used for interim outputs and calculations. \ \ Ideally dmt-temp-<project-id> is created from terraform.
transfer_config:params:hive_gcs_staging_bucket_id Bucket id that contains data files stored. \ Ideally dmt-teradata-data-<project-id> is created from terraform.
transfer_config:params:hive_gcs_staging_path Folder name inside the hive_gcs_staging_bucket_id that contains the data files.
transfer_config:params:bq_dataset_id Target BQ dataset name.
transfer_config:params:destinationDatasetId The BigQuery target dataset id.
transfer_config:params:project_id Project number (numeric) to which migration is taking place
transfer_config:params:region Region in which the bucket belongs
validation_config Sub json config to be used to run DVT
validation_config:source_config:source_type DVT Validation

Source datawarehouse : Impala

validation_config:source_config:host hostname/ IP address of on-prem Hive
validation_config:source_config:port Port number of on-prem Impala
validation_config:validation_type DVT validation type

Possible values - data, column

validation_config:validation_mode DVT validation mode

Possible values - gke, cloudrun

validation_config:validation_object DVT validation object

data

unique_id: name to uniquely identify the load job batch.

Note: If the user opted for data migration, along with schema migration through the tool, this unique id should be the same as the one used in the schema migration config file.

displayName: prefix of the bigquery data transfer service config display name

hive_gcs_staging_bucket_id: user provided bucket name for storing data files extracted from Hive. Ideally dmt-teradata-data-<customer_name provided in TF> is created using the Foundations deployment. If a user wishes to use another bucket, it needs to be created by the user/concerned team before uploading the file.

bq_dataset_id: Target dataset id for the particular database migrated from HIVE.

project-id: numerical project id to which migration is taking place

validation_object: validation_object key in Validation_config needs to be _data _for data transfer configs.

validation_type: validation_type key in Validation_config can be column or row depending on the requirement.

validation_mode: Value can be ‘cloudrun’ or ‘gke’ based on where the user want to conduct data validation.

Validation of data in BigQuery

Composer DAG validation_dag will validate the data migrated to BQ. Results for DVT can be viewed in <PROJECT_ID>.dmt_logs.dmt_dvt_results.

Below query can be used to see the results:

SELECT * FROM `<project-id>.dmt_logs.dmt_dvt_results` where
-- validation_type='Column'
-- run_id = <transfer_run_id>
target_table_name = <target_table_name>

There are two ways to validate data - column or row. Validation_type given as column verifies count aggregation on the source and target tables. Validation_type given as row verifies hash of the rows based on the primary key. Hence for row validations, providing primary_key in Validation_config json is necessary.

Check the field descriptions section for detailed information on the keys in the configuration file.

Known Issues

Hive Table Format Data Type Issue Description Buganizer # Status
Text Numeric Hive does implicit casting for the mentioned precision in the DDL. When the same data is extracted in the form of textfile the entire precision is loaded into the file which conflicts to the precision mentioned in the DDL. The causes error during BQ load. [b/259589505](http://b/259589505) Fix suggested by the BQ support team is to set a feature flag at the project level which will handle the rounding off.
Avro Parquet Orc Float The way floating precion are stored in hive is different to what the data is in underlying raw files and hive. During the bq load the float column adds additional precision at the end to make it 16 digits precision. [b/260845659](http://b/260845659) Reported to the Translation team. Team is suggesting to use Numeric instead of float 64 by overwriting the translation api
Parquet Array

Map

For complex data type such as array or map for parquet file we need struct of array or struct of key. value pair. Translation API does not convert this into the required format so the data load fails. [b/260647109](http://b/260647109) Reported to Translation team
Avro Parquet Orc Timestamp BQ translator API converts timestamp to Datetime which causes issues to load data for source files having schema inside the datafiles. As the data type in source is timestamp it expects the same in target Suggested work around is to use the global config to overwrite the translation api (Automatically taken care by the framework)

Known Issues - DVT

  • DVT cannot perform validation of complex data type such as map, struct or array.
  • To perform Schema validation through DVT we need to specify each table name.
  • DVT fails the check if the data type does not match exactly between Hive and Bigquery. For example translation API converts int32 in hive to int64 in BQ. DVT marks this check as fail as it expects similar datatype at both places.
  • DVT requires connection to the Hive DB. This could be a challenge for customers who are not comfortable with giving access to their hive data warehouse externally.