Auxiliary Data Import

These instructions provide examples of importing certain auxiliary annotation data used the GMS, such as reference genomes, feature sets, and transcript annotations.

These instructions assume that you have followed the installation instructions up to and including the prime-system.pl command. If do not require the demonstration dataset, it is possible to prime the system without downloading the demonstration data by using the --data=none option to prime-system.pl.

Importing a New Human Reference Genome

A new human reference genome can be imported by defining a new imported-reference-sequence model. Defining the model also starts a build of that model.

Below is an example of downloading and importing GRCh37-lite. The URI shell variable is used in this example is used for the sake of brevity. The processing-profile-id and species-name refer to processing profile and taxon that were imported into GMS during system priming.

$ URI='ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/'\
'Homo_sapiens/GRCh37/special_requests/GRCh37-lite.fa.gz'
$ wget $URI
$ gunzip GRCh37-list.fa.gz
$ genome model define imported-reference-sequence             \
  --fasta-file=$PWD/GRCh37-lite.fa                            \
  --processing-profile-id=1990904                             \
  --species-name=human                                        \
  --version=37-lite-test                                      \
  --prefix=GRC                                                \
  --assembly-name=GRCh37-lite                                 \
  --build-name=GRCh37-lite-build37                            \
  --sequence-uri=$URI

Creating a Modified Reference from a Previously Imported Reference Genome

$ genome model define imported-reference-sequence
  --append-to=106942997
  --fasta-file=/ERCC/ERCC92.fa
  --use-default-sequence-uri
  --species-name=human
  --version=37_ERCC

Importing a New Version of dbSNP

$ genome model imported-variation-list import-dbsnp-build
  --vcf-file-url ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b141_GRCh37p13/VCF/00-All.vcf.gz
  --version 141
  --reference-sequence-build 106942997
  --flat-file-pattern ds_flat_chX.flat.gz
  --contig-names-translation-file /reference/scaffold_names
  --from-names-column 2
  --to-names-column 3

Importing a New Version of Ensembl

Home	Install	Docs	Tutorials	FAQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auxiliary Data Import

Importing a New Human Reference Genome

Creating a Modified Reference from a Previously Imported Reference Genome

Importing a New Version of dbSNP

Importing a New Version of Ensembl

Clone this wiki locally