Build a Genomics Data Lake on AWS

This repo contains the code referenced in the AWS blog post "Build a Genomics data lake on AWS".

ETL - contains the transformation scripts and the cloudformation template to spin up the EMR cluster

EMRGenomics.py - Lambda function that is triggered by the cloudFormation template to create EMR cluster to process VCFs.

EventEMRGenomics.py - Event trigger Lambda function

emr_config.json - JSON file with EMR configuration for this example. This file can be edited to change EMR configuration parameters.

vcfToParquetTransform.py - pySpark script that performs the VCF to parquet transformation using the Hail API. This can be customized to perform any specific transformation steps required.

genomics_datalake_emr.template - Cloudformation template that can be deployed in your account for the solution.

1000Genomes.ipynb - Python notebook with sample queries

For instructions on how to create the Glue data catalog tables for 1000 Genomes on the Registry of Open Data, please check the DataLakeAsCode repo at https://github.com/aws-samples/data-lake-as-code/tree/roda#readme. The repo also has CloudFormation templates for ClinVar and gnomAD.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
ETL		ETL
1000Genomes.ipynb		1000Genomes.ipynb
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Build a Genomics Data Lake on AWS

ETL - contains the transformation scripts and the cloudformation template to spin up the EMR cluster

1000Genomes.ipynb - Python notebook with sample queries

About

Releases

Packages

Contributors 3

Languages

License

aws-samples/aws-genomics-datalake

Folders and files

Latest commit

History

Repository files navigation

Build a Genomics Data Lake on AWS

ETL - contains the transformation scripts and the cloudformation template to spin up the EMR cluster

1000Genomes.ipynb - Python notebook with sample queries

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages