For most users, we recommend https://bystro.io .
The web app gives full access to all of Bystro's capabilities, provides a convenient search/filtering interface, supports large data sets (tested up to 890GB uncompressed/129GB compressed), and has excellent performance.
Follow the instructions in INSTALL.md
Bystro relies on pluggable (via Bystro's YAML config) pre-processors to normalize variant inputs (dealing with VCF issues such as padding), calculate whether a site is a transition or transversion, calculate sample maf, identify hets/homozygotes/missing samples, calculate heterozygosity, homozygosity, missingness, and more.
- VCF format: Bystro-Vcf
- SNP format: Bystro-SNP
- Create your own to support other formats!
Please read FIELDS.md
- The config file describes the state of both the database and the annotation. It's required for annotating or building
- It has several keys:
-
tracks
: The highest level organization for database values. Tracks have aname
property, which must be unique, and atype
, which must be one of:- sparse: Any bed file, or any file that can be mapped to chrom, chromStart, and chromEnd columns.
- This is used for dbSNP, and Clinvar records, but many files can be fit this format.
- Mapping fields can be managed by the
fieldMap
key
- score: Accepts any wigFix file.
- Used for phastCons, phyloP
- cadd:
- Accepts any CADD file, or Bystro's custom "bed-like" CADD file, which has 2 header lines, and chrom, chromStart, chromEnd columns, followed by standard CADD fields
- CADD format: http://cadd.gs.washington.edu
- gene: A UCSC gene track field (ex: knownGene, refGene, sgdGene).
- The
local_files
for this are created using ansql_statement
- Ex:
SELECT * FROM hg38.refGene LEFT JOIN hg38.kgXref ON hg38.kgXref.refseq = hg38.refGene.name
- The
- sparse: Any bed file, or any file that can be mapped to chrom, chromStart, and chromEnd columns.
-
chromosomes
: The allowable chromosomes.- Each row of every track must be identified by these chromosomes (during building)
- Each row of any input file submitted for annotation must also be "" "" (during annotation)
- However, Bystro is flexible about the chr prefix
Ex: For the following config
chromosomes: - chr1 - chr2 - chr3
Only chr1, chr2, and chr3 will be accepted. However, Bystro tries to make your life easy
- We currently follow UCSC coneventions for
chromosomes
, meaning they should be prepended by chr - Bystro will automatically append chr to chromosomes read from an input file during annotation.
- Bystro allows the transformation of any field during building, configurable in the YAML config file for that assembly, making it easy to prepend chr to the source file chromosome field
Ex: Clinvar doesn't have a chr prefix, so during building we specify:
tracks: - name: clinvar build_field_transformations: chrom: chr . fieldMap: Chromosome: chrom
Here
fieldMap
allows us to rename header fields, andbuild_field_transformations
allows us to define a prepend operation (chr .
can be interpreted as the perl command"chr" . $chrom
)So: input files do not need to have their chromosomes prepended by chr. Bystro will normalize the name.
In this example chromosomes
1
andchr1
will be built/annotated, but1_rand
will not.
-
These describe where the Bystro database and any source files are located.
files_dir
: The parent folder within which each track'slocal_files
are located
-
Bystro automatically checks for
local_files
atparent/trackName/file
Ex: For the config file containing
files_dir: /path/to/files/ track: - name: refSeq local_files: - hg19.refGene.chr1.gz # and more files
Bystro will expect files in
/path/to/files/refSeq/hg19.refGene.chr1.gz
-
database_dir
: Each database is held withindatabase_dir
, in a folder of the nameassembly
Ex: For the config file containing
assembly: hg19 database_dir: /path/to/databases/
Bystro will look for the database
/path/to/databases/hg19