You can use Variant Transforms to load multiple files at once
by providing a pattern like --input_pattern gs://mybucket/*.vcf
. This
will create a merged BigQuery schema that contains all variants from all
matching files. Note that the VCF fields must be compatible across files (e.g.
INFO
and FORMAT
fields with the same key have compatible types and numbers
across files).
However, by default, calls between files will not be merged. For instance, consider the following VCF files:
gs://mybucket/a.vcf
:
## <headers omitted>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2
1 100 rs1 A T 29 PASS DP=14 GT:GQ 0|1:48 0/0:30
gs://mybucket/b.vcf
:
## <headers omitted>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S3
1 100 rs1 A T 7 q10 DP=11;AA=G GT:GQ 1/1:2
If you load both files together (i.e. by specifying
--input_pattern gs://mybucket/*.vcf
), you would end up with 2 rows in
BigQuery. One with calls S1
and S2
, and another with just call S3
even
though all calls have the same variant.
Variant merging can be used to merge calls across files. There are different
merging strategies and you can specify any of the following
strategies by adding --variant_merge_strategy <merge_strategy>
to the
arguments.
This strategy merges calls that have the same reference_name
,
start_position
, end_position
, reference_bases
, and all alternate_bases
across files. By default, the variant fields are merged as follows:
Calls
are merged together in no particular order.- The highest
quality
value is chosen among all variants being merged. - All
filter
values are merged and deduplicated. - All
names
(ID
column in the VCF spec) are merged and deduplicated. - For each
INFO
field, one is chosen as representative for the entire merged variant record in no particular order. Note that for repeated fields, a single set is chosen as representative for that particular field.
Depending on your use case, you may customize the merge strategy as follows:
- Adding
--copy_quality_to_calls
and/or--copy_filter_to_calls
to the arguments will copy thequality
and/orfilter
values in each file to the set of calls specified in that file. This option is useful when merging single-call VCF files asquality
andfilter
values usually correspond to call-level quality/filter. Note that the variant-levelquality
andfilter
values will still be merged according to the above logic. - Adding
--info_keys_to_move_to_calls_regex <regex>
will moveINFO
fields that match the provided regular expression to the associated calls. This is useful to ensure allINFO
fields are kept when merging variants. For instance, specifying.*
will move allINFO
keys to calls, or specifying^(AA|AF)$
will only moveAA
andAF
fields to calls. See here for more details on the regular expression specifications. Note that if a key is the same in bothINFO
andFORMAT
, then such a key cannot be moved to calls and an error will be raised.
The examples below help with illustrating the details of the above options.
You may also read the code and documentation in
move_to_calls_strategy.py.
Particularly the get_merged_variants
method implements the merging logic.
The following examples illustrate the merging outcome using the example files
gs://mybucket/a.vcf
and gs://mybucket/b.vcf
that are defined above.
- Default (i.e. just adding
--variant_merge_strategy MOVE_TO_CALLS
):
reference_name: "1"
start_position: 100
end_position: 101
reference_bases: "A"
alternate_bases: {alt: "T"}
names: ["rs1"]
quality: 29
filter: ["PASS", "q10"]
call: {
[
name: "S3",
genotype: [1, 1]
phaseset: null
GQ: 2
],
[
name: "S1",
genotype: [0, 1]
phaseset: "*"
GQ: 48
],
[
name: "S2",
genotype: [0, 0]
phaseset: null
GQ: 30
]
}
DP: 14 # This can also be 11.
AA: "G"
- With
--copy_quality_to_calls
and--copy_filter_to_calls
:
reference_name: "1"
start_position: 100
end_position: 101
reference_bases: "A"
alternate_bases: {alt: "T"}
names: ["rs1"]
quality: 29
filter: ["PASS", "q10"]
call: {
[
name: "S3",
genotype: [1, 1]
phaseset: null
GQ: 2
quality: 7
filter: ["q10"]
],
[
name: "S1",
genotype: [0, 1]
phaseset: "*"
GQ: 48
quality: 29
filter: ["PASS"]
],
[
name: "S2",
genotype: [0, 0]
phaseset: null
GQ: 30
quality: 29
filter: ["PASS"]
]
}
DP: 14 # This can also be 11.
AA: "G"
- With
--info_keys_to_move_to_calls_regex ^DP$
,--copy_quality_to_calls
,--copy_filter_to_calls
:
reference_name: "1"
start_position: 100
end_position: 101
reference_bases: "A"
alternate_bases: {alt: "T"}
names: ["rs1"]
quality: 29
filter: ["PASS", "q10"]
call: {
[
name: "S3",
genotype: [1, 1]
phaseset: null
GQ: 2
quality: 7
filter: ["q10"]
DP: 11
],
[
name: "S1",
genotype: [0, 1]
phaseset: "*"
GQ: 48
quality: 29
filter: ["PASS"]
DP: 14
],
[
name: "S2",
genotype: [0, 0]
phaseset: null
GQ: 30
quality: 29
filter: ["PASS"]
DP: 14
]
}
AA: "G"
This strategy is not yet production ready and can't be used by users. This section is included here to provide more context about our ongoing work (Issue #80 and Issue #88).
The MOVE_TO_CALLS
strategy is effective for merging calls that have exactly
the same reference and alternate bases. However, this strategy is not effective
for merging
gVCF files
since non-variants are represented as intervals in the genome.
Consider the following VCF file:
gs://mybucket/c.g.vcf
:
## <headers omitted>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S4
1 50 . T <NON_REF> . . END=150 GT:GQ 0/0:40
If you specify --input_pattern gs://mybucket/*.vcf
with MOVE_TO_CALLS
merge
strategy, the non-variant region from call S4
will not be merged with calls
S1
, S2
, and S3
. However, we know that call S4
has genotype 0/0
in
chr1:100 since it matches the reference from position 50-150 in chr1.
The MERGE_WITH_NON_VARIANTS
strategy will infer 0/0
for call S4
at
chr1:100 so that the resulting BigQuery table has a single row with all calls
S1-4
.
Our initial work is focused on correctly calling SNPs (Issue #80). We will then attempt to merge insertions/deletions (Issue #88).