Skip to content

Latest commit

 

History

History
216 lines (163 loc) · 9.29 KB

README.md

File metadata and controls

216 lines (163 loc) · 9.29 KB

flake news

A batch loading tool for migrating data from an RDBMS to Snowflake

Currently only supporting MSSQL Server


tl;dr

  1. Setup & install, then export the environment variables

  2. Edit the table_config.yml file with the database, schema, tables you want

  3. Generate the table_rules.json file and the table_ddl.sql file

python -m flakenews -c table_config.yml
  1. Create all the destination tables on Snowflake - hint: table_ddl.sql has what you need

  2. Run the batch-load one-off process to load all the tables

python -m flakenews -r table_rules.json
  1. Go push some Jira tickets to the done column.

When do you need flake news ?

Very rarely.

  • If you have a database that is already bulk-loaded, for example a datamart, and you want to temporarily transfer data from there to Snowflake.
  • If that data mart experiences minimally logged operations like TRUNCATE that preclude the use of CDC
  • If there are infrequent burst loads that would be overwhelming and expensive on tools like Fivetran

When flake news is wrong

Frequently.

  • If you have an application database that experiences small transactional changes
    • you should be using some form of change data capture along with a friendly SaaS tool like Fivetran, Stitchdata, Hevo and similar, or use something like Debezium.
    • For MSSQL Server that means using CDC if possible, or CT if not.
    • For PostgreSQL that means using replication slots
    • For MySQL that means using the binlog

Local Development

Setup & Requirements

  1. Python 3.8+
  2. Create a python virtual environment: python -m venv .venv
  3. Install required python packages: pip install -r requirements.txt
  4. On mac you also need to: brew install freetds which is required by pymssql

1. Environment variables

Export your environment variables

Note: On windows you can run a line for each of these environment variables. e.g. like:

[System.Environment]::SetEnvironmentVariable('FN_SQL_SERVER','<name of server here>','User')
...

MSSQL

Use forward slash / for the following, not a backslash.

  • FN_SQL_SERVER - either a dbhostname/myinstance' or servername
  • FN_SQL_USER - either companydomain/username a username
  • FN_SQL_PASSWORD
  • FN_SQL_PORT - optional, default is 1433

Snowflake

Snowflake database / schema / tablenames need to be uppercase because the SDK will quote them.

  • FN_SF_ACCOUNT - e.g. mh85760.ap-southeast-2 region / privatelink segment may be required
  • FN_SF_USER - login name
  • FN_SF_PASSWORD
  • FN_SF_AUTHENTICATOR - for OKTA use externalbrowser
  • FN_SF_ROLE
  • FN_SF_WAREHOUSE
  • FN_SF_DATABASE
  • FN_SF_SCHEMA

Create a table_config.yml file

fake news can only work with a single MSSQL Server, but can target multiple databases and tables within that server

  • To begin you need to create a table_config.yml file similar to table_config_example.yml following, replacing the fields with your real information:
version: 2

databases:
  - name: demodata
    schemas:
      - name: dbo
        tables:
          - name: bulk_data
         #- name: another_table_in_same_schema
     #- name: another_schema_in_same_database
  - name: alternate
    schemas:
      - name: dbo
        tables:
          - name: hey

Generate table rules and ddl

With the environment variables ready and the table_config.yml file filled out you can generate:

  1. table_ddl.sql file - all of the create or replace statements that you need to run on Snowflake
  2. table_rules.json file - used for running the batch load.

Development setup

  • Azure Data Studio a cross platform app for connecting to SQL Server (replaces SSMS)
    • Or some other SQL tool like DBeaver or if you are on Windows, SSMS
  • Docker Desktop for Mac / Windows or something else to run docker and docker-compose on your OS

Testing with the local container environment

  1. Export environment variables
  • Mac
export FN_SQL_SERVER=localhost
export FN_SQL_USER=sa
export FN_SQL_PASSWORD=Fake!News9000
export FN_SQL_PORT=1433
  • Windows

Make sure you can run powershell scripts - run a terminal as admin and then at the prompt:

Set-ExecutionPolicy RemoteSigned -Scope CurrentUser -Force

Export environment variables

[System.Environment]::SetEnvironmentVariable('FN_SQL_SERVER','localhost','User')
[System.Environment]::SetEnvironmentVariable('FN_SQL_USER','sa','User')
[System.Environment]::SetEnvironmentVariable('FN_SQL_PASSWORD','Fake!News9000','User')
[System.Environment]::SetEnvironmentVariable('FN_SQL_PORT','1433','User')

You might need to restart the terminal

There is no Snowflake local endpoint. If you want to test the complete load then you will need a Snowflake account, either an existing one - or it is easy enough to create a 14-day trial account. You will also need to export environment variables for all of the Snowflake ones (see Setup & Requirements above) If using a Trial account, you don't need to set the Authenticator, if using OKTA you can set Authenticator to externalbrowser.

  1. Launch the MSSQL Server and S3 endpoint containers with setup_test_environment.sh

When you are done, shut them down with teardown_test_environment.sh

On Windows use the equivalent powershell scripts

TODO: Docker on Windows requires Hyper-V to be enabled in BIOS - can't test yet

  1. Test the creation of the table_ddl.sql and table_rules.json files
python -m flakenews -c ./table_config_example.yml
  1. You can connect to the MSSQL Server machine with sqlcmd, Azure Data Studio, SSMS, DBeaver or other tool with the information as per those environment variables to check the data or run any SQL you like; data will not be persisted when the containers are shut down.

  2. Test the S3 service container:

Requires the awscli tool to be installed_

  • Make a bucket of flake news aws --endpoint-url=http://localhost:4566 s3 mb s3://flakenews

  • List buckets aws --endpoint-url=http://localhost:4566 s3 ls

Use cases

  1. Output Snowflake create table DDL statements from a source database and generate a table_rules.json file with detailed metadata about tables to load.

    Use a table_config.yml file to restrict to a subset of tables

python -m flakenews -c ./table_config.yml
  1. Initial load to snowflake - extract csv files from a source database with splitting at every <batch_size> rows and then uploads files to Snowflake tables
  • loads to a staging table then copied to destination
  • For now, you would need to truncate the snowflake table to do a full re-load

_Use the table_rules.json created earlier to upload to Snowflake

python -m flakenews -r ./table_rules.json
  1. Single table full reload
  • Truncate the table on Snowflake
  • backup the table_rules.json and table_config.yml
  • make a new cut-down table_config_single_table.yml file and use that to output a new table_rules.json file
python -m flakenews -c ./table_config_single_table.yml`
  • and then run the initial load with that - it Does a full load to a transient table then meta-swapped to the destination table?
  1. Initial load to S3 - TODO

  2. Ongoing load - not implemented, consider Fivetran or other tool first

  3. Deploying and running - TBD

  • For now this is a tool to run manually in the context of one-off loads
  • Extensions are possible, e.g. using MSSQL Server Change Tracking (CT), Rowversion, or other tools, but for now it is stateless. To use one of those ongoing load methods requires keeping state e.g. in Snowflake.
  • Options for running on AWS or automating: Buildkite, Fargate, Lambda, Step Functions, AWS MWAA (airflow)
    • Lambda's 15 mins and 512MB /temp is too limiting
    • Step Functions makes it a bit more rubegoldbergian - splits and watermarks across a dynamic loop (array driven) state machine...
    • Fargate has 20GB storage and is long-lived and could do a full reload easily in a kind of stateless way (not counting the data itself as state, and assuming there's no delta-loading, just a merge or swap on Snowflake)
  • Logging & Observability - logging has been added in a way that is compatible with AWS Lambda and other AWS services so should appear on Cloudwatch or other connected services.

Notes

  • For development, testing or manual operation, the simplest connection credentials to use are Windows Authentication rather than username / pass, however both are supported.

  • Using SSO web browser auth with Snowflake is hard to use unattended. Web browsers will pop up so the solution is to use the secure-local-storage add on installed along with the snowflake-connection-python.

  • Uploading directly to Snowflake e.g. with SnowSQL or similar is not the optimal choice, but then neither is flake news. Snowpipe is often cheaper as it uses managed virtual warehouse clusters for processing and has some other bulk upload optimisations. It also means this tool requires a Snowflake login and the associated secrets management that goes along with it. It does, however, reduce the infrastructure and setup requirements as you don't need an S3 bucket or a Snowpipe object.