This repo contains the code referenced in the AWS blog post "Build a Genomics data lake on AWS".
ETL - contains the transformation scripts and the cloudformation template to spin up the EMR cluster
EMRGenomics.py - Lambda function that is triggered by the cloudFormation template to create EMR cluster to process VCFs.
EventEMRGenomics.py - Event trigger Lambda function
emr_config.json - JSON file with EMR configuration for this example. This file can be edited to change EMR configuration parameters.
vcfToParquetTransform.py - pySpark script that performs the VCF to parquet transformation using the Hail API. This can be customized to perform any specific transformation steps required.
genomics_datalake_emr.template - Cloudformation template that can be deployed in your account for the solution.
For instructions on how to create the Glue data catalog tables for 1000 Genomes on the Registry of Open Data, please check the DataLakeAsCode repo at https://github.com/aws-samples/data-lake-as-code/tree/roda#readme. The repo also has CloudFormation templates for ClinVar and gnomAD.