- For Data Migration Tool introduction, please refer to /README.md
- For Installation user guide, please refer to /docs/installation/README.md
- Initial release will support bulk load data transfer only.
- Initial release will support Redshift to BigQuery translation using BQ DTS, row and column validation of the data in source and target.
- If tables from multiple schemas are to be migrated, create separate config files for each source schema.
- Current tool version supports only bulk load which is ideally done only once, note that the transfer shall take place by the process following upload of the config files.
- If there are new table additions to a batch/config file, ideally create a new config file with a new table list even if rest details remain the same. If not followed, there might be multiple data transfer configs responsible for migrating the same tables.
- Note that reuploading the same configuration file or using a different config file for the same dataset and tables, will create data duplication in the tables for the target dataset.
- Please be cautious of the number of parallel processes being triggered for Data Migration. BigQuery has a load quota of 15 TB per load job.
The user uploads a configuration json file to dmt-config-<customer_name provided in TF> bucket data folder which initiates data migration.
As the configuration is uploaded, a new file create/update trigger is sent to pub/sub which triggers the DAG controller_dag via cloudrun. This DAG is responsible for triggering relevant schema/data migration dag based on operation type and data source given in user config, For redshit data migrationm it'll trigger redshift_data_load_dag which is responsible for parallelizing the data transfer config creation, based on the user defined key - batchDistribution.
- Target dataset creation to be done by the user/concerned team before uploading the configuration file.
- If your redshift cluster is private
- Setup VPN connection between cluster's VPC to a GCP VPC (VPN Setup Guide)
- Follow the instructions in Configure inbound rules for SQL clients to allowlist the IP ranges you'll reserve for private cluster. In a later step, you define the private IP range to reserve in this VPC network when you upload the config file (under key:
migration_infra_cidr
). - Add custom route advertisemnt under cloud router for the subnet / IP Range you wish to reserve.
- If your redshift cluster is public
- Follow the instructions in Configure inbound rules for SQL clients to allowlist the following IP addresses. You can allowlist the IP addresses that correspond to your dataset's location, or you can allowlist all of the IP addresses in the table here. These Google-owned IP addresses are reserved for Amazon Redshift data migrations.
Data Migration deployment will take care of the workflow needed to perform
- Initial data load
- Data Validation using DVT tool
- Options to run DVT on cloud run and GKE airflow pod operator depending on scaling requirements.
Ensure that you have followed the steps to deploy Data Migration Tool architecture as mentioned in the readme with the option of _DATA_SOURCE=redshift
If you had previously only deployed translation architecture - you could run a delta architecture deployment for redshift migration with the below command provided the Terraform State files in remote GCS bucket from translation deployment are intact.
gcloud builds submit . --project ${SOURCE_PROJECT} \
--config cloudbuild_deploy.yaml \
--substitutions \
_DATA_SOURCE=redshift
- Check for success messages in your Cloud Build Logs for deployment of
- Google Cloud Storage
- Cloud Pub/Sub
- Cloud Composer
- Cloud Run
- Big Query
The below list of logging tables are created by terraform templates and record activities for Data Migration Tool
Dataset.Table Name | Migration Phase (Translation, Schema Mig, Data Mig, Validation) | Description |
dmt_logs.dmt_redshift_transfer_tracking | Data Migration | Contains mapping for unique_id, transfer_id and config file |
dmt_logs.dmt_redshift_transfer_run_summary | Data Migration | Contains summary for DTS run |
dmt_logs.dmt_redshift_transfer_run_jobwise_details | Data Migration | Contains summary for each job in DTS runs |
- Data Migration
- Data Validations
- SQL Validations
{
"unique_id":"unique_id_to_filter_bq_result_table",
"batchDistribution": 1,
"type": "data",
"source": "redshift",
"table_list_file": "gs://<dmt-redshift-data-<project-id-or-customer-name-given-in-deployment>/tables_list.csv",
"transfer_config": {
"dataSourceId": "redshift",
"displayName": "unique_id_to_filter_bq_result_table",
"params": {
"jdbc_url": "jdbc:redshift://<host>.<region>.redshift.amazonaws.com:5439/dev",
"database_username": "<username>",
"database_password": "secret-<redshift_secret_name>",
"access_key_id": "<aws_access_key_id>",
"secret_access_key": "secret-<aws_access_key>",
"s3_bucket": "s3://<bucketname>/",
"redshift_schema": "<redshift_schema_name>",
"migration_infra_cidr": "<vpc>:<10.0.0.0/24(use this variable only when your Redshift cluster is private. If its public remove this line)>"
},
"emailPreferences": {
"enableFailureEmail": false
},
"notificationPubsubTopic": "projects/<project-id>/topics/dmt-redshift-dts-notification-topic-<project-id-or-customer-name-given-in-deployment>",
"destinationDatasetId": "<bq_target_dataset>"
},
"validation_config": {
"source_config": {
"source_type": "redshift",
"host": "<host>.<region>.redshift.amazonaws.com",
"port": "5439",
"user-name": "<source_db_username>",
"password": "secret-<redshift_secret_name>"
},
"target_config": {
"target_type": "BigQuery",
"project-id": "<project-id>"
},
"validation_params_file_path" : "gs://<dmt-redshift-config-<project-id-or-customer-name-given-in-deployment>/validation/redshift/validation_params.csv",
"validation_mode": "<gke/cloudrun>",
"validation_type": "<row/column>",
"pod_operator_mem" :"4000M",
"pod_operator_cpu" :"800m"
}
}
Data Transfer sample config file location
JSON attribute | Description |
batchDistribution | Number of tables to migrate in each batch (for each batch there will be a separate data transfer and agent run). [Number of batches = Total number of tables / batchDistribution] |
unique_id | Provide an unique name for identifying the data migration – Note: If the user opted for data migration, along with schema migration through the tool, this unique id should be the same as the one used in the schema migration config file. |
type | Type of migration : data |
source | Source database: redshift |
table_list_file | File uploaded in GCS bucket same as the one used for uploading config file. This file should provide table names to be migrated from a particular database.
Note: Tables in the CSV file should always be in the same case as how they exist in source Redshift and ultimately match with validation_config source_target_table mapping in json config for DVT validations to avoid any failures in testdb. |
transfer_config | Sub json config to be used to create data transfer config on bigquery Refer: REST Resource: projects.locations.transferConfigs | BigQuery | Google Cloud for exhaustive keys under transfer_config sub json. |
transfer_config:dataSourceId | redshift |
transfer_config:displayName | Prefix of the bigquery data transfer service config display name |
transfer_config:params:jdbc_url | The JDBC connection URL is used to locate the Amazon Redshift cluster. |
transfer_config:params:database_username | The username to access your database to unload specified tables. |
transfer_config:params:database_password | The password used with the username to access your database to unload specified tables. |
transfer_config:params:access_key_id | The access key ID to sign requests made to AWS. |
transfer_config:params:secret_access_key | The secret access key used with the access key ID to sign requests made to AWS. |
transfer_config:params:s3_bucket | The Amazon S3 URI beginning with "s3://" and specifying a prefix for temporary files to be used. |
transfer_config:params:redshift_schema | The Amazon Redshift schema that contains all the tables to be migrated. |
transfer_config:params:table_name_patterns | Table name patterns separated by a semicolon (;). The table pattern is a regular expression for table(s) to migrate. If not provided, all tables under the database schema are migrated. |
transfer_config:params:migration_infra_cidr | Optional: Use this key if your redhsift cluster is not public and you have configured VPN to a VPC network in your GCP project Specify VPC network name and the private IP address range to use in the tenant project VPC network. Specify the IP address range as a CIDR block.
|
validation_config:source_config:source_type |
DVT Validation
Source datawarehouse : |
validation_config:source_config:host | Hostname/ IP address of Redshift cluster |
validation_config:source_config:port | Port number of Redshift cluster |
validation_config:source_config:user-name | Username with permissions to query on source tables |
validation_config:source_config:password | Secret Manager key name (Secret should be created in the Secret Manager as _secret-<secret-key-name>_.) secret- <secret key name> For example - secret-edw_credentials |
validation_config:target_config:target_type | BigQuery |
validation_config:target_config:project-id | BigQuery GCP Project ID |
validation_config:validation_params_file_path | GCS location of the CSV file or Excel sheet, containing table or file names along with DVT Validation Flags.
Examples: gs://dmt-config-dmt-demo-project/validation/teradata/validation_params.csv gs://dmt-config-dmt-demo-project/validation/teradata/validation_params.xlsx Read Instructions below to understand how to populate and upload this file/sheet |
validation_config:validation_type | DVT validation type data |
validation_config:validation_mode | DVT validation mode Possible values - gke, cloudrun |
validation_config:pod_operator_mem | Pod Operator memory configuration when DVT mode is chosen as GKE Default Value - 4000M |
validation_config:pod_operator_cpu | Pod Operator cpu configuration when DVT mode is chosen as GKE Default Value - 800m |
unique_id: name to uniquely identify the batches or DTS for this config file.
Note if the user opted for data migration, along with schema migration through the tool, this unique id should be the same as the one used in the schema migration config file.
displayName: prefix of the bigquery data transfer service config display name
notificationPubsubTopic: field necessary to continue flow from Data Transfer Run Logging, including data validation. Given in the format _projects/<project-name>/topics/edw-dts-notification-topic-<customer_name provided in TF>. Use the same topic as mentioned above, as required mappings are already done for this topic using the Foundations deployment.
destinationDatasetId: Target dataset id for the particular schema migrated from redshift.
table_list_file: file uploaded in GCS bucket same as the one used for uploading config file. This file should provide table names to be migrated from a particular schema in newlines. Eg:
- DEPARTMENT
- EMPLOYEE
- SALARY
- HR_RECORDS
Note that this key only needs to be provided in case the user chooses to opt only for data migration through the tool (without schema translation). As such, tables structure is created by Bigquery DTS itself rather than the CompilerWorks schema migration feature from the tool.
Tables in the CSV file should always be in the same case as how they exist in source Redshift and ultimately match the contents of CSV/Excel file uploaded as validation parameters file in GCS.
project-id: project id to which migration is taking place
Note that target table name case is sensitive due to case sensitive nature of bigquery. So in the case of only Data Migration, it needs to be always be in the same case as how they exist in source Redshift and ultimately match with validation_config source_target_table mapping in config.json for DVT validations
validation_object: validation_object key in Validation_config needs to be _data _for data transfer configs.
validation_type: validation_type key in Validation_config can be column or row depending on the requirement.
validation_mode: Value can be ‘cloudrun’ or ‘gke’ based on where the user wants to conduct data validation.
primary_key: If validation_type is _row, _please provide another key called primary_key in the same section with the primary key for the particular table.
Refer REST Resource: projects.locations.transferConfigs | BigQuery | Google Cloud for exhaustive keys under transfer_config sub json.
Please do not edit headers, data validation rules and other sheets in the Excel/Google Sheet as it'll alter the data validation checks put in-place.
Fill the columns according to the use-case. Data Validation Rules within the sheet will help to fill the fields correctly.
To know more about which flags are relevant for which translation (ddl,sql,custom-query) and migration (data), please refer to the tab named DVT Guide
. This sheet contains link to DVT Github pages which explain about all possible flags in DVT CLI.
You can refer to the tab Example Sheet
to get an idea of which flags can be supplied for which validation.
-
To follow along with Excel sheet
- Open this Excel Template File in Microsoft Excel. (not in Google Sheets, as opening .xlsx in Google Sheets will corrupt the existing Data Validation Rules)
- Fill the sheet
Validation
. Once filled, save the file in.xlsx
format. (Excel will convert the template file format.xltx
to.xlsx
file for you)
-
To follow along with CSV file format
- Open this Google Sheet in browser (Viewer Access).
- Make a copy, to get an editable sheet.
- Once filled, download CSV file using the Menu Option. (File --> Download --> Comma Separated Values (.csv))
Once filled, upload the corresponding Excel (.xlsx) file or CSV file in GCS bucket, (inside relevant folder within dmt-config-dmt-demo-project/validation) and specify the GCS path in configuration file.
Composer DAG validation_dag will validate the data migrated to BQ. Results for DVT can be viewed in <PROJECT_ID>.dmt_logs.dmt_dvt_results
Below query can be used to see the results:
SELECT * FROM `dmt-test-project.dmt_logs.dmt_dvt_results` where
-- validation_type='Column'
-- run_id = <transfer_run_id>
target_table_name = <target_table_name>
There are two ways to validate data - column or row. validation_type given as column verifies count aggregation on the source and target tables. validation_type given as row verifies hash of the rows based on the primary key. Hence for row validations, providing primary_key in validation_config json is necessary.
Check the field descriptions section for detailed information on the keys in the configuration file.