This repository contains the visualization specification for the DECIDER visualization showcased in "Deciphering Cancer Genomes with GenomeSpy: A Grammar-Based Visualization Toolkit" by Lavikka, et al. (2024). In addition, this repository contains a small example data set to demonstrate the visualization.
An interactive version with the full data set is available for exploration at https://csbi.ltdk.helsinki.fi/p/genomespy-paper-2024/. All visualization and data processing occur in the web browser, which receives the data as static files.
This README provides an overview of the visualization specification and how the data are processed and visualized in GenomeSpy. Detailed description of the visualization grammar and data processing can be found in the GenomeSpy documentation.
You can alternatively access, explore, and modify the visualization locally:
-
Clone this repository or download it as a zip archive.
-
Start a local web server in the repository directory. For instance, using Python 3:
python3 -m http.server
-
Open the visualization in a web browser by navigating to http://localhost:8000.
The index.html
file loads the GenomeSpy App JavaScript bundle
and initializes the visualization with the specification in the
spec.json
file. The specification has been split into multiple
files to make understanding, adapting, and reusing easier. These files are
imported into the main
specification. The track-based layout, familiar from genome browsers, is
implemented using vertical
concatenation,
one of the view composition methods provided by GenomeSpy.
In addition to several annotation tracks,
spec.json
displays the ovarian cancer samples as sub-tracks that
can be manipulated interactively through filtering, grouping, etc. The samples
are displayed using the Sample
View,
which applies the same specification to all samples.
The list of samples and their metadata are stored in
decider-data/samples.tsv
. Although the GenomeSpy
App provides default scales and color schemes for metadata attributes, some of
them have been customized for better visual presentation. For instance, the
timepoints have been given (subjectively) more meaningful colors:
"sampleTime": {
"type": "ordinal",
"scale": {
"domain": ["primary", "interval", "relapse"],
"range": ["#facf5a", "#f95959", "#455d7a"]
}
},
The segmented copy-number data are stored in
decider-data/segments.tsv
, which contains one row
per segment per sample. The visualization specification is in the
cnv-sements.json
file, which provides two layers: logR
(Log2 copy number ratio) and LOH (loss of heterozygosity). In addition, both
layers are summarized into a G score and LOH score, respectively.
Both layers use the same data but with partially different encodings. However, the sample identifier and segments' start and end coordinates are common:
"encoding": {
"sample": {
// Using the "sample" field as the sample identifier
"field": "sample"
},
"x": {
// Using both "chr" and "startpos" fields for the genomic coordinates.
// GenomeSpy automatically linearizes the coordinates to a single axis.
"chrom": "chr",
"pos": "startpos",
"type": "locus",
...
},
"x2": {
"chrom": "chr",
"pos": "endpos"
}
},
LogR is visualized as a heatmap with a diverging color scale where gray
represents zero, which corresponds to the average ploidy of the sample.
Deletions and amplifications are shown as blue and red, respectively. The values
are read from the purifiedLogR
field, which represents the logR value after
removing the normal contamination. Thus, it allows for easy comparison of
samples, while still capturing all subclonal copy-number variation.
"encoding": {
"color": {
"type": "quantitative",
"field": "purifiedLogR",
"scale": {
"domain": [-2.5, 0, 2.5],
"range": ["#0050f8", "#f6f6f6", "#ff3000"],
"clamp": true
}
},
...
},
G score summarizes the logR values separately for deletions and amplifications, as descibed in Beroukhim et al. ( 2007). It is the first step in the GISTIC method, which Beroukhim et al. developed to find recurrent copy-number alterations and driver regions in cancer genomes. However, it is useful for summarizing the copy-number data for visualization purposes, as it also facilitates the comparison of different groups of samples.
The G score track in the visualization is implemented using GenomeSpy's data transformation pipeline. In brief, it comprises the following steps:
- Merge the segments from each sample group into a sorted (by chromosome and start position) stream of segments.
- Using a filter transform, retain only the segments having the absolute value of the logR above a certain threshold (e.g., 0.1).
- Using a formula transform, clamp the logR values to a certain range to avoid extreme values.
- Split the data stream into two new streams that are filtered separately for deletions and amplifications.
- Use the coverage transform to calculate a weighted coverage for the segments, using the logR value as the weight. This step produces a score that is equivalent to the G score.
- Finally, divide the coverage values by the number of samples in the group to get a normalized value that allows for comparison between groups with different numbers of samples.
The logR threshold and logR limits have been implemented as parameters in the specification to make it easier to adjust them interactively. Moreover, the specification uses templates to avoid code duplication – deletions and amplifications are handled separately and layered into the same view.
For instance, the following code snippet shows how the threshold
parameter is
defined and used in the G score calculation:
"params": [
{
"name": "threshold",
"value": 0.5,
"bind": {
// Provides the user with a slider to adjust the threshold interactively
"input": "range",
"name": "LogR threshold",
...
}
},
...
},
...
"transform": [
{
"type": "filter",
// Only segments with logR values above the threshold are retained
"expr": "abs(datum.purifiedLogR) > threshold"
},
...
]
Typically, LOH is visualized as B-allele frequency (BAF) or allelic fraction, where 0.5 represents the heterozygous state. However, to emphasize the abnormal state (the loss) and allow for easier comparison of samples, we define LOH as follows: abs(BAF - 0.5) × 2. Thus, LOH gets values between zero (fully heterozygous) and one (fully homozygous, i.e., all heterozygosity lost).
We encode LOH as the height of translucent gray bars
("rect"
marks) overlaid on
the logR heatmap:
"encoding": {
"y": {
"field": "purifiedLoh",
"type": "quantitative",
"axis": null,
"scale": {
// 0 = fully heterozygous, 1 = fully homozygous
"domain": [0, 1],
// Clamp to ensure that the bars do not extend beyound the track height
"clamp": true
}
},
...
}
LOH is summarized in a similar way as the G score. The LOH score is calculated
as a weighted coverage of the purifiedLoh
values.
The short-variants.json
file specifies a visualization
for single and multi-nucleotide variants (SNVs and MNVs). They are visualized
using "point"
marks, employing
multiple visual channels for different attributes:
- Color encodes the functional category, e.g., missense, stopgain, etc.
- Size encodes the variant allele frequency (VAF) of the mutation.
- Shape encodes the filter status: circle = PASS, cross = all other.
The mutation data is stored as multiple files in the data
directory. Each file
is named after the patient ID and contains the mutations for all samples of the
respective patient. The VAF of each sample is stored in a sample-specific
column. For instance, the decider-data/patient1.tsv
file contains the following columns:
- CHROM
- POS
- REF
- ALT
- ID
- FILTER
- CADD_phred
- CLNSIG
- Func
- Gene.refGene
- patient1_p1.AF
- patient1_r1.AF
- patient1_r2.AF
Columns 1-10 are common to all samples, while columns 11-13 contain the VAF for
each sample. This "wide" data is transformed into "long" format using the
"regexFold"
transform:
"transform": [
{
"type": "regexFold",
// Match all columns that end with ".AF"
// The part before the ".AF" is used as the sample identifier
"columnRegex": "^(.*)\\.AF$",
// The value from the matched column is stored in a new "VAF" field
"asValue": "VAF",
// The extracted sample identifier is stored in a new "sample" field
"asKey": "sample"
},
...
The data could be alternatively stored in a single file with a row for each mutation and sample. However, the "wide" format is more compact, avoiding redundant information, such as genomic coordinates, functional annotations, and other attributes, which are the same for all samples.
Score-based semantic zooming facilitates exploration by displaying only the most important mutations at each zoom level. The interactive filtering is coupled to the zoom level and is performed using GPU acceleration to ensure smooth interaction even with large datasets.
The visualization uses the "CADD_phred"
field as the semantic score. The
semanticZoomFraction
property controls the fraction of mutations displayed at
each zoom level. In this case, the fraction can be adjusted using an
interactive slider, which makes the setting available as a parameter called
semanticZoomFraction
.
"params": [
{
"name": "semanticZoomSlider",
"value": 0.015,
"bind": {
"input": "range",
...
},
...
},
"mark": {
"type": "point",
"semanticZoomFraction": { "expr": "semanticZoomSlider" },
...
},
"encoding": {
"semanticScore": {
"field": "CADD_phred"
},
...
}
Several annotation tracks are included in the visualization to provide context for the copy-number aberrations and mutations.
The chromosome ideogram (e.g., cytobands) is visualized using the
cytobands.json
specification. A comprehensively commented
specification is available in the Annotation
tracks
ObservableHQ notebook.
The blacklist tracks show regions that were excluded from copy-number segmentation due to repetitive sequences, mapping artifacts, or other reasons. Two blacklist are provided.
The two blacklist tracks are bundled together using vertical concatenation. The additional level of nesting allows the user to toggle the visibility of both tracks simultaneously.
ENCODE blacklist contains regions provided in Ameniya et al.
2019. The spec and data are stored
in the following files: hg38-blacklist.v2.json
and
external-data/hg38-blacklist.v2.tsv
.
DECIDER blacklist covers regions that contain aberrations in at least three
normal samples in the DECIDER cohort. The aberrations may be due to mapping
artifacts, common germline copy-number variations, or other reasons. The spec
and data are stored in the following files:
decider-cnv-blacklist.json
and
decider-data/decider-cnv-blacklist-v1.tsv
.
The RefSeq gene visualization is explained comprehensively in the Annotation tracks ObservableHQ notebook.
This track visualizes "A highly annotated database of genes associated with platinum resistance in cancer", introduced in Huang et al. 2021.
The data are stored in the platinum_resistance.tsv
The visualization specification is public domain under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. You are fee to adapt it for your own data sets.