Data Migration Tool User Guide - Data Transfer

For Data Migration Tool introduction, please refer to /README.md
For Installation user guide, please refer to /docs/installation/README.md

Tool Features

Initial release will support bulk load data transfer only.
Initial release will support Redshift to BigQuery translation using BQ DTS, row and column validation of the data in source and target.

Things to Take Note of

If tables from multiple schemas are to be migrated, create separate config files for each source schema.
Current tool version supports only bulk load which is ideally done only once, note that the transfer shall take place by the process following upload of the config files.
If there are new table additions to a batch/config file, ideally create a new config file with a new table list even if rest details remain the same. If not followed, there might be multiple data transfer configs responsible for migrating the same tables.
Note that reuploading the same configuration file or using a different config file for the same dataset and tables, will create data duplication in the tables for the target dataset.
Please be cautious of the number of parallel processes being triggered for Data Migration. BigQuery has a load quota of 15 TB per load job.

Prepare configuration file for bigquery DTS

The user uploads a configuration json file to dmt-config-<customer_name provided in TF> bucket data folder which initiates data migration.

As the configuration is uploaded, a new file create/update trigger is sent to pub/sub which triggers the DAG controller_dag via cloudrun. This DAG is responsible for triggering relevant schema/data migration dag based on operation type and data source given in user config, For redshit data migrationm it'll trigger redshift_data_load_dag which is responsible for parallelizing the data transfer config creation, based on the user defined key - batchDistribution.

Prerequisites

Target dataset creation to be done by the user/concerned team before uploading the configuration file.
If your redshift cluster is private
- Setup VPN connection between cluster's VPC to a GCP VPC (VPN Setup Guide)
- Follow the instructions in Configure inbound rules for SQL clients to allowlist the IP ranges you'll reserve for private cluster. In a later step, you define the private IP range to reserve in this VPC network when you upload the config file (under key: migration_infra_cidr).
- Add custom route advertisemnt under cloud router for the subnet / IP Range you wish to reserve.
If your redshift cluster is public
- Follow the instructions in Configure inbound rules for SQL clients to allowlist the following IP addresses. You can allowlist the IP addresses that correspond to your dataset's location, or you can allowlist all of the IP addresses in the table here. These Google-owned IP addresses are reserved for Amazon Redshift data migrations.

Data Migration Flow

Data Migration Tool Deployment - Redshift Data Migration

Data Migration deployment will take care of the workflow needed to perform

Initial data load
Data Validation using DVT tool
Options to run DVT on cloud run and GKE airflow pod operator depending on scaling requirements.

Ensure that you have followed the steps to deploy Data Migration Tool architecture as mentioned in the readme with the option of _DATA_SOURCE=redshift

If you had previously only deployed translation architecture - you could run a delta architecture deployment for redshift migration with the below command provided the Terraform State files in remote GCS bucket from translation deployment are intact.

gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_DATA_SOURCE=redshift

Check for success messages in your Cloud Build Logs for deployment of
- Google Cloud Storage
- Cloud Pub/Sub
- Cloud Composer
- Cloud Run
- Big Query

Audit Tables for Data Migration & Validation

The below list of logging tables are created by terraform templates and record activities for Data Migration Tool

Dataset.Table Name	Migration Phase (Translation, Schema Mig, Data Mig, Validation)	Description
dmt_logs.dmt_redshift_transfer_tracking	Data Migration	Contains mapping for unique_id, transfer_id and config file
dmt_logs.dmt_redshift_transfer_run_summary	Data Migration	Contains summary for DTS run
dmt_logs.dmt_redshift_transfer_run_jobwise_details	Data Migration	Contains summary for each job in DTS runs

Trigger Data Migration Tool

Data Migration
Data Validations
SQL Validations

Sample configuration file

{
    "unique_id":"unique_id_to_filter_bq_result_table",
   "batchDistribution": 1,
   "type": "data",
   "source": "redshift",
   "table_list_file": "gs://<dmt-redshift-data-<project-id-or-customer-name-given-in-deployment>/tables_list.csv",
   "transfer_config": {
       "dataSourceId": "redshift",
       "displayName": "unique_id_to_filter_bq_result_table",
       "params": {
           "jdbc_url": "jdbc:redshift://<host>.<region>.redshift.amazonaws.com:5439/dev",
           "database_username": "<username>",
           "database_password": "secret-<redshift_secret_name>",
           "access_key_id": "<aws_access_key_id>",
           "secret_access_key": "secret-<aws_access_key>",
           "s3_bucket": "s3://<bucketname>/",
           "redshift_schema": "<redshift_schema_name>",
           "migration_infra_cidr": "<vpc>:<10.0.0.0/24(use this variable only when your Redshift cluster is private. If its public remove this line)>"
       },
       "emailPreferences": {
           "enableFailureEmail": false
       },
       "notificationPubsubTopic": "projects/<project-id>/topics/dmt-redshift-dts-notification-topic-<project-id-or-customer-name-given-in-deployment>",
       "destinationDatasetId": "<bq_target_dataset>"
   },
   "validation_config": {
       "source_config": {
           "source_type": "redshift",
           "host": "<host>.<region>.redshift.amazonaws.com",
           "port": "5439",
           "user-name": "<source_db_username>",
           "password": "secret-<redshift_secret_name>"
       },
       "target_config": {
           "target_type": "BigQuery",
           "project-id": "<project-id>"
       },
        "validation_params_file_path" : "gs://<dmt-redshift-config-<project-id-or-customer-name-given-in-deployment>/validation/redshift/validation_params.csv",
       "validation_mode": "<gke/cloudrun>",
       "validation_type": "<row/column>",
       "pod_operator_mem" :"4000M",
       "pod_operator_cpu" :"800m"
   }
}

Data Transfer sample config file location

Field Descriptions

JSON attribute	Description
batchDistribution	Number of tables to migrate in each batch (for each batch there will be a separate data transfer and agent run). [Number of batches = Total number of tables / batchDistribution]
unique_id	Provide an unique name for identifying the data migration – Note: If the user opted for data migration, along with schema migration through the tool, this unique id should be the same as the one used in the schema migration config file.
type	Type of migration : `data`
source	Source database: `redshift`
table_list_file	File uploaded in GCS bucket same as the one used for uploading config file. This file should provide table names to be migrated from a particular database. Note: `table_list_file` key only needs to be provided in case the user chooses to opt only for data migration through the tool (without schema translation). Tables in the CSV file should always be in the same case as how they exist in source Redshift and ultimately match with validation_config source_target_table mapping in json config for DVT validations to avoid any failures in testdb.
transfer_config	Sub json config to be used to create data transfer config on bigquery Refer: REST Resource: projects.locations.transferConfigs \| BigQuery \| Google Cloud for exhaustive keys under transfer_config sub json.
transfer_config:dataSourceId	redshift
transfer_config:displayName	Prefix of the bigquery data transfer service config display name
transfer_config:params:jdbc_url	The JDBC connection URL is used to locate the Amazon Redshift cluster.
transfer_config:params:database_username	The username to access your database to unload specified tables.
transfer_config:params:database_password	The password used with the username to access your database to unload specified tables.
transfer_config:params:access_key_id	The access key ID to sign requests made to AWS.
transfer_config:params:secret_access_key	The secret access key used with the access key ID to sign requests made to AWS.
transfer_config:params:s3_bucket	The Amazon S3 URI beginning with "s3://" and specifying a prefix for temporary files to be used.
transfer_config:params:redshift_schema	The Amazon Redshift schema that contains all the tables to be migrated.
transfer_config:params:table_name_patterns	Table name patterns separated by a semicolon (;). The table pattern is a regular expression for table(s) to migrate. If not provided, all tables under the database schema are migrated.
transfer_config:params:migration_infra_cidr	Optional: Use this key if your redhsift cluster is not public and you have configured VPN to a VPC network in your GCP project Specify VPC network name and the private IP address range to use in the tenant project VPC network. Specify the IP address range as a CIDR block. The form is VPC_network_name:CIDR, for example: my_vpc:10.251.1.0/24. Use standard private VPC network address ranges in the CIDR notation, starting with 10.x.x.x. The IP range must have more than 10 IP addresses. The IP range must not overlap with any subnet in your project VPC network or the Amazon Redshift VPC network. If you have multiple transfers configured for the same Amazon Redshift instance, make sure to use the same VPC_network_name:CIDR value in each, so that multiple transfers can reuse the same migration infrastructure. Caution: After being configured, the value of this CIDR block is immutable.
validation_config:source_config:source_type	DVT Validation Source datawarehouse : `redshift`
validation_config:source_config:host	Hostname/ IP address of Redshift cluster
validation_config:source_config:port	Port number of Redshift cluster
validation_config:source_config:user-name	Username with permissions to query on source tables
validation_config:source_config:password	Secret Manager key name (Secret should be created in the Secret Manager as _secret-<secret-key-name>_.) secret- <secret key name> For example - secret-edw_credentials
validation_config:target_config:target_type	BigQuery
validation_config:target_config:project-id	BigQuery GCP Project ID
validation_config:validation_params_file_path	GCS location of the CSV file or Excel sheet, containing table or file names along with DVT Validation Flags. Examples: gs://dmt-config-dmt-demo-project/validation/teradata/validation_params.csv gs://dmt-config-dmt-demo-project/validation/teradata/validation_params.xlsx *Read Instructions below to understand how to populate and upload this file/sheet*
validation_config:validation_type	DVT validation type data
validation_config:validation_mode	DVT validation mode Possible values - gke, cloudrun
validation_config:pod_operator_mem	Pod Operator memory configuration when DVT mode is chosen as GKE Default Value - 4000M
validation_config:pod_operator_cpu	Pod Operator cpu configuration when DVT mode is chosen as GKE Default Value - 800m

unique_id: name to uniquely identify the batches or DTS for this config file.

Note if the user opted for data migration, along with schema migration through the tool, this unique id should be the same as the one used in the schema migration config file.

displayName: prefix of the bigquery data transfer service config display name

notificationPubsubTopic: field necessary to continue flow from Data Transfer Run Logging, including data validation. Given in the format _projects/<project-name>/topics/edw-dts-notification-topic-<customer_name provided in TF>. Use the same topic as mentioned above, as required mappings are already done for this topic using the Foundations deployment.

destinationDatasetId: Target dataset id for the particular schema migrated from redshift.

table_list_file: file uploaded in GCS bucket same as the one used for uploading config file. This file should provide table names to be migrated from a particular schema in newlines. Eg:

DEPARTMENT
EMPLOYEE
SALARY
HR_RECORDS

Note that this key only needs to be provided in case the user chooses to opt only for data migration through the tool (without schema translation). As such, tables structure is created by Bigquery DTS itself rather than the CompilerWorks schema migration feature from the tool.

Tables in the CSV file should always be in the same case as how they exist in source Redshift and ultimately match the contents of CSV/Excel file uploaded as validation parameters file in GCS.

project-id: project id to which migration is taking place

Note that target table name case is sensitive due to case sensitive nature of bigquery. So in the case of only Data Migration, it needs to be always be in the same case as how they exist in source Redshift and ultimately match with validation_config source_target_table mapping in config.json for DVT validations

validation_object: validation_object key in Validation_config needs to be _data _for data transfer configs.

validation_type: validation_type key in Validation_config can be column or row depending on the requirement.

validation_mode: Value can be ‘cloudrun’ or ‘gke’ based on where the user wants to conduct data validation.

primary_key: If validation_type is _row, _please provide another key called primary_key in the same section with the primary key for the particular table.

Refer REST Resource: projects.locations.transferConfigs | BigQuery | Google Cloud for exhaustive keys under transfer_config sub json.

Instructions To Populate And Upload Validation Paramaters File

Please do not edit headers, data validation rules and other sheets in the Excel/Google Sheet as it'll alter the data validation checks put in-place.

Fill the columns according to the use-case. Data Validation Rules within the sheet will help to fill the fields correctly. To know more about which flags are relevant for which translation (ddl,sql,custom-query) and migration (data), please refer to the tab named DVT Guide. This sheet contains link to DVT Github pages which explain about all possible flags in DVT CLI.

You can refer to the tab Example Sheet to get an idea of which flags can be supplied for which validation.

To follow along with Excel sheet
- Open this Excel Template File in Microsoft Excel. (not in Google Sheets, as opening .xlsx in Google Sheets will corrupt the existing Data Validation Rules)
- Fill the sheet Validation . Once filled, save the file in .xlsx format. (Excel will convert the template file format .xltx to .xlsx file for you)
To follow along with CSV file format
- Open this Google Sheet in browser (Viewer Access).
- Make a copy, to get an editable sheet.
- Once filled, download CSV file using the Menu Option. (File --> Download --> Comma Separated Values (.csv))
Once filled, upload the corresponding Excel (.xlsx) file or CSV file in GCS bucket, (inside relevant folder within dmt-config-dmt-demo-project/validation) and specify the GCS path in configuration file.

Validation data in BigQuery

Composer DAG validation_dag will validate the data migrated to BQ. Results for DVT can be viewed in <PROJECT_ID>.dmt_logs.dmt_dvt_results

Below query can be used to see the results:

SELECT * FROM `dmt-test-project.dmt_logs.dmt_dvt_results` where
-- validation_type='Column'
-- run_id = <transfer_run_id>
target_table_name = <target_table_name>

There are two ways to validate data - column or row. validation_type given as column verifies count aggregation on the source and target tables. validation_type given as row verifies hash of the rows based on the primary key. Hence for row validations, providing primary_key in validation_config json is necessary.

Check the field descriptions section for detailed information on the keys in the configuration file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_redshift_datamigration.md

README_redshift_datamigration.md

Data Migration Tool User Guide - Data Transfer

Tool Features

Things to Take Note of

Prepare configuration file for bigquery DTS

Prerequisites

Data Migration Flow

Data Migration Tool Deployment - Redshift Data Migration

Audit Tables for Data Migration & Validation

Trigger Data Migration Tool

Sample configuration file

Field Descriptions

Instructions To Populate And Upload Validation Paramaters File

Validation data in BigQuery

Files

README_redshift_datamigration.md

Latest commit

History

README_redshift_datamigration.md

File metadata and controls

Data Migration Tool User Guide - Data Transfer

Tool Features

Things to Take Note of

Prepare configuration file for bigquery DTS

Prerequisites

Data Migration Flow

Data Migration Tool Deployment - Redshift Data Migration

Audit Tables for Data Migration & Validation

Trigger Data Migration Tool

Sample configuration file

Field Descriptions

Instructions To Populate And Upload Validation Paramaters File

Validation data in BigQuery