-
Notifications
You must be signed in to change notification settings - Fork 18
Jasmine User Manual
Jasmine is a program for merging structural variant calls across samples.
Each sample should have its variant calls defined in a VCF file (format 4.2 or later), and each variant should have the following information:
- A unique variant ID
- The chromosome and position where the variant begins
- The length of the variant (either by using the correct REF/ALT format, or with the SVLEN INFO field)
- The sequence being inserted for insertion calls (optional, but enables sequence identity to be taken into account)
- The type of the variant (as an INFO field SVTYPE, or surrounded by <> in the ALT field)
- The strand of the variant, if taking strand into account when merging, denoted by the STRANDS INFO field
Once each sample has a VCF file in this format, the paths to all of these files should be listed on separate lines of a plain-text file, and this file should be passed to Jasmine with the file_list command line argument. The other required command line argument, out_file, indicates the path of the merged VCF that will be produced.
There are a few optional INFO fields which can be included in the input VCF files, if the user wants to modify parameters to the program on a per-variant basis:
- JASMINE_DIST is the maximum distance (defined below) other variants can be from this one in order to be considered merging candidates. A pair of variants can be merged if their distance falls within the distance threshold for either of the two variants (whichever threshold is higher).
- JASMINE_ID is the sequence identity required for the variant to merge with another variant in the case that both are insertions and have sequences provided. Similar to the distance threshold, only the less strict identity threshold of the two variants in a pair must be satisfied for them to be allowed to merge.
Jasmine has a number of optional parameters to customize the merging for different use cases. A list of them is below, and a summary of them can be viewed by running the program without any arguments.
Constraints can be imposed on which pairs of variants are allowed to be merged.
-
Distance: Each variant is represented in 2-D space by its breakpoints (start, length). The distance between two variants is equal to the Euclidean distance between their breakpoint representations. If the JASMINE_DIST INFO field is present, then that value is used for each individual variant. Otherwise, if max_dist_linear is specified, that value is used. If not, the max_dist value is used, or its default value of 1000 if it is left unspecified. If both max_dist and max_dist_linear are specified, then the linearly-scaled threshold will be used, unless for a particular variant this value is higher than max_dist, in which case it is set to max_dist.
- max_dist_linear - a floating point number such that the distance threshold for each variant will be this proportion of its length
- max_dist - a constant integer value such that the distance threshold for every variant will be equal to this value
-
kd_tree_norm - the type of norm p to use in the variant distance metric. The distance between two variants is defined as
(|x distance|^p + |y distance|^p)^(1/p)
. The default value is 2, which is equivalent to the Euclidean distance. A value of 1 uses Manhattan distance instead. - --use_end - use the end breakpoint of each variant in place of its length
-
Sequence identity: For a pair of insertions, in addition to checking distance, the actual content of the sequences can be taken into account by requiring their sequence identity to be above a certain threshold. If the JASMINE_ID INFO field is present, then this field's value will be used for each individual variant. Otherwise, the min_seq_id (default of 0, meaning no sequence identity filtering) will be used as the threshold for all variants. By default, the Jaccard similarity of the two sequences based on 9-mer content is used, but some of the parameters below allow that to be changed.
- min_seq_id - the sequence identity threshold required for two insertions to be merged
- k_jaccard - the value of k used for representing sequences as kmer sets for Jaccard similarity calculation
- --use_edit_dist - use edit distance as the sequence identity measure instead of Jaccard distance
-
Type and strand: By default, only variants of the same type and strand can be merged, but these requirements can be lifted if the user wants to be able to merge variants with different values for these fields.
- --ignore_strand - don't require variants to have the same strand
- --ignore_type - don't require variants to have the same type
-
Component requirements: When considering whether or not to merge a pair of variants, the distance requirement by defaults applies only to that edge. However, in some cases with many samples, it's possible for a chain of variants to all be joined together through edges between consecutive variants, resulting in a component in which some pairs of variants are far apart. While this is not necessarily problematic, there are options to make merging requirements more strict.
- --centroid_merging - requires that the centroid of the points in every component is within the distance threshold for each variant
- --clique_merging - requires that all of the variants within each component are pairwise mergeable
In addition to performing merging, there is a number of other functions that Jasmine can perform prior to the merging, all of which are optional.
-
Treating duplications as insertions: With some read technologies, due to their high error rate, there are cases where duplications are called as insertions because the sequence identity of the inserted sequence wasn't resolved precisely enough to align to the preceding reference sequence. This can be mitigated by taking all duplications and reframing them as insertion calls during merging, allowing them to merge with insertions, and then converting any merged variants representing duplications back to their original calls afterwards.
- --dup_to_ins - perform this preprocessing step
- max_dup_length - leave any duplications above this length alone
-
Marking specific calls: A common use case of structural variant merging is to determine which variants are unique to each particular sample. When doing so, it is often desirable to filter out variants which have evidence of presence in other samples, even if their calls in other samples do not fit the same read support or length requirements. One way of doing this is to obtain for each sample two separate callsets - a large, sensitive callset, as well as a smaller, specific callset. Then, to get a callset unique to a given sample with high confidence, the merged results can be used by looking for variants which are supported by that sample's specific callset and none of the other samples' sensitive callsets. Jasmine allows the user to mark the specific calls in each sample's full (sensitive) callset based on length and read support requirements to avoid the need to produce two separate callsets where one is a subset of the other.
- --mark_specific - perform this preprocessing step
- spec_reads - the minimum number of reads required to support a variant for it to be considered a specific call
- spec_len - the minimum length of a variant for it to be considered a specific call
-
Running Iris: In addition, Jasmine allows the user to run Iris, a program developed for more precisely resolving the sequences of insertion calls. It is a separate repository which is used as an optional Jasmine submodule, and usage information for it can be found on its Github page.
- --run_iris - perform this preprocessing step
- iris_args - any optional command-line arguments that should be used for running Iris, separated by commas
-
--preprocess_only - run only the preprocessing and not the SV merging itself, which is particularly useful for generating finalized input files to be used in comparisons with other tools.
-
--normalize_type - convert the type of each variant to INS, DEL, DUP, INV, or TRA prior to processing to account for unusual type labels output by some variant callers
-
--normalize_chrs - normalize the chromosome names to account for different datasets (particularly in human) which label the chromosome names with and without the "chr" prefix.
- chr_norm_file - by default, this converts the chromosome names according to the NCBI standard (1, 2, 3, ..., X, Y, MT), but the chr_norm_file parameter can be used to pass a file where each line consists of a chromosome as well as a space-separated list of all aliases for that chromosome name which are present in the dataset.
- --output_genotypes - output the genotypes in the final VCF file based on the genotypes in the input VCFs
-
genome_file - the reference genome used when making the variant calls, required for preprocessing if running Iris or duplication-to-insertion conversion
-
bam_list - the list of BAM-format alignment files, in the same order as the corresponding VCF files in file_list, required for preprocessing if running Iris
-
out_dir - the directory where intermediate files go, particularly the outputs of any preprocessing steps
-
min_support - the minimum number of variants which must be merged into an output variant for it to be output in the merged VCF file
-
--keep_var_ids - make the ID of each output variant equal to that of the first variant merged into it rather than following the default behavior of changing the IDs to ensure uniqueness
- threads - the number of threads to use for merging (any preprocessing steps' thread number should be specified separately as indicated above)
Each variant call in the merged VCF file that is produced has a number of INFO fields which get added in and indicate information about the variants which were merged to produce the call:
- SUPP is the number of variants which were merged into this call
- SUPP_VEC is a binary vector with a length equal to the total number of samples, with a 1 in each position corresponding to a sample which contains this variant
- SVMETHOD indicates the method used to merge the variants, and is equal to "JASMINE" for all output variants to distinguish them from calls produced by other merging software.
- IDLIST is the list of variant IDs which were merged to produce this call. They are comma-separated, and appear in order of sample (i.e., variants which occur in samples listed earlier in the input file list are listed first).
- END is the average end breakpoint of all variants which were merged.
- STARTVARIANCE is the variance of the start breakpoint of all variants which were merged.
- ENDVARIANCE is the variance of the end breakpoint of all variants which were merged.
- SVLEN is the average length of all variants which were merged.