Auxiliary Data Import

These instructions provide examples of importing certain auxiliary annotation data used the GMS, such as reference genomes, feature sets, and transcript annotations.

These instructions assume that you have followed the installation instructions up to and including the prime-system.pl command. If do not require the demonstration dataset, it is possible to prime the system without downloading the demonstration data by using the --data=none option to prime-system.pl.

Importing a New Human Reference Genome

A new human reference genome can be imported by defining a new imported-reference-sequence model. Defining the model also starts a build of that model.

Below is an example of downloading and importing GRCh37-lite. The URI shell variable is used in this example is used for the sake of brevity. The processing-profile-id and species-name refer to processing profile and taxon that were imported into GMS during system priming.

The GRCh37-lite reference is imported into GMS during prime-system.pl and assigned the build ID 106942997.

$ URI='ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/'\
'vertebrates_mammals/Homo_sapiens/GRCh37/special_requests/GRCh37-lite.fa.gz'
$ wget $URI
$ gunzip GRCh37-list.fa.gz
$ genome model define imported-reference-sequence             \
  --fasta-file=$PWD/GRCh37-lite.fa                            \
  --processing-profile-id=1990904                             \
  --species-name=human                                        \
  --version=37-lite                                           \
  --prefix=GRC                                                \
  --assembly-name=GRCh37-lite                                 \
  --build-name=GRCh37-lite-build37                            \
  --sequence-uri=$URI

Creating a Modified Reference based on an Existing Reference Genome

An existing reference genome may be used as the basis for a new reference genome which is the combination both the existing reference and new fasta file. In order to create the modified reference, the build id of the existing reference and the path to the new fasta file are required.

The existing reference genomes may be listed to show the existing reference genomes along with their build IDs:

$ genome model build list --filter="model.type_name='imported reference sequence'"

Once the build ID of the existing reference and the path to the new fasta are both known, a new model can be defined. Defining this model automatically starts a build of this model, so there is no need to separately start a build.

The following command is an example of appending the sequences given in a file named ERCC92.fa to the GRCh37-lite reference.

$ genome model define imported-reference-sequence  \
  --append-to=106942997                            \
  --fasta-file=/home/ubuntu/ERCC92.fa              \
  --use-default-sequence-uri                       \
  --species-name=human                             \
  --version=37_ERCC

Importing a Variation List

import-dbsnp-build

A list of variants may be imported into the GMS from outside sources using genome model imported-variation-list. You may import variants directly from dbSNP with the import-dbsnp-build sub-command. To import a variation list, the build id for a reference sequence using the same coordinates as the variation list must be supplied.

Notice the use of http:// at the beginning of the vcf file url. The ftp protocol is not supported.

$ genome model imported-variation-list import-dbsnp-build \
  --version 141 --reference-sequence-build 106942997      \
  --vcf-file-url 'http://ftp.ncbi.nih.gov/snp/organisms/human_9606_b141_GRCh37p13/VCF/00-All.vcf.gz'

A copy of dbSNP vesion 141 for the GRCh37-lite reference is imported into GMS during prime-system.pl and assigned the build ID 127786607.

import-variants

Other variants in vcf or bed format may be imported with the import-variants sub-command.

Importing a New Version of Ensembl

Home	Install	Docs	Tutorials	FAQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly