-
Notifications
You must be signed in to change notification settings - Fork 0
/
BCFv2.tex
85 lines (77 loc) · 4.7 KB
/
BCFv2.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
\documentclass[10pt]{article}
\usepackage{color}
\definecolor{gray}{rgb}{0.7,0.7,0.7}
\usepackage{framed}
\usepackage{enumitem}
\usepackage{longtable}
\addtolength{\textwidth}{3.4cm}
\addtolength{\hoffset}{-1.7cm}
\addtolength{\textheight}{4cm}
\addtolength{\voffset}{-2cm}
\makeindex
\title{BCF2 Quick Reference}
\author{}
\date{}
\begin{document}
\maketitle
{\small
In BCF2, a typed value consists of a typing byte and the actual value with type
mandated by the typing byte. In the typing byte, the lowest four bits give the
atomic type. If the number represented by the higher 4 bits is smaller than 15,
it is the size of the following vector; if the number equals 15, the following
typed integer is the array size. The table below gives the atomic types and
their missing values:
\begin{center}
{\small\begin{tabular}{rlrl}
\hline
Bit 0--3 & C type & Missing value & Description \\
\hline
1 & {\tt int8\_t} & {\tt 0x80} & signed 8-bit integer \\
2 & {\tt int16\_t} & {\tt 0x8000} & signed 16-bit integer \\
3 & {\tt int32\_t} & {\tt 0x80000000} & signed 32-bit integer \\
5 & {\tt float} & {\tt 0x7F800001} & IEEE 32-bit floating pointer number \\
7 & {\tt char} & `{\tt \char92 0}' & character \\
\hline
\end{tabular}}
\end{center}
In BCF2, a genotype (GT) is encoded as an integer vector with each integer
describing an allele and its phase w.r.t. the previous allele. The first allele
does not carry the phase information. In the vector, each integer is organized
as `{\tt \char40 allele+1\char41\char60\char60 1\char124 phased}' where
{\tt allele} is set to -1 if the allele in GT is a dot `.' (thus the higher
bits are all 0). The vector is be padded with missing values if the GT having fewer ploidy.
}
\begin{table}[ht]
\centering
{\small
\begin{tabular}{|l|l|l|p{8.2cm}|l|r|}
\cline{1-6}
\multicolumn{3}{|c|}{\bf Field} & \multicolumn{1}{c|}{\bf Description} & \multicolumn{1}{c|}{\bf Type} & \multicolumn{1}{c|}{\bf Value} \\\cline{1-6}
\multicolumn{3}{|l|}{\sf magic} & BCF2 magic string & {\tt char[4]} & {\tt BCF\char92 2}\\\cline{1-6}
\multicolumn{3}{|l|}{\sf l\_text} & Length of the header text, including any {\sf NULL} padding & {\tt uint32\_t} & \\\cline{1-6}
\multicolumn{3}{|l|}{\sf text} & {\sf NULL}-terminated plain VCF header text & {\tt char[{\sf l\_text}]} & \\\cline{1-6}
\multicolumn{6}{|c|}{\textcolor{gray}{\it List of VCF records (until the end of the BGZF section)}} \\\cline{2-6}
& \multicolumn{2}{l|}{\sf l\_shared} & Data length from {\sf ID} to the end of {\sf INFO} & {\tt uint32\_t} & \\\cline{2-6}
& \multicolumn{2}{l|}{\sf l\_indiv} & Data length of {\sf FORMAT} and individual genotype fields & {\tt uint32\_t} & \\\cline{2-6}
& \multicolumn{2}{l|}{\sf CHROM} & Reference sequence ID & {\tt int32\_t} & \\\cline{2-6}
& \multicolumn{2}{l|}{\sf POS} & 0-based leftmost coordinate & {\tt int32\_t} & \\\cline{2-6}
& \multicolumn{2}{l|}{\sf rlen} & Length of reference sequence & {\tt int32\_t} & \\\cline{2-6}
& \multicolumn{2}{l|}{\sf QUAL} & Variant quality; {\tt 0x7F800001} for a missing value & {\tt float} & \\\cline{2-6}
& \multicolumn{2}{l|}{\sf n\_allele\_info} & {\tt n\_allele\char60\char60 16\char124 n\_info}& {\tt uint32\_t} & \\\cline{2-6}
& \multicolumn{2}{l|}{\sf n\_fmt\_sample} & {\tt n\_fmt\char60\char60 24\char124 n\_sample}& {\tt uint32\_t} & \\\cline{2-6}
& \multicolumn{2}{l|}{\sf ID} & Variant identifier & {\tt typed str} & \\\cline{2-6}
& \multicolumn{5}{c|}{\textcolor{gray}{\it List of alleles in the REF and ALT fields (n=n\_allele)}} \\\cline{3-6}
& & {\sf allele} & A reference or alternate allele & {\tt typed str} & \\\cline{2-6}
& \multicolumn{2}{l|}{\sf FILTER} & List of filters; filters are defined in the dictionary & {\tt typed vec} & \\\cline{2-6}
& \multicolumn{5}{c|}{\textcolor{gray}{\it List of key-value pairs in the INFO field (n=n\_info)}} \\\cline{3-6}
& & {\sf info\_key} & Info key, defined in the dictionary & {\tt typed int} & \\\cline{3-6}
& & {\sf info\_value} & Value & {\tt typed val} &\\\cline{2-6}
& \multicolumn{5}{c|}{\textcolor{gray}{\it List of FORMATs and sample information (n=n\_fmt)}} \\\cline{3-6}
& & {\sf fmt\_key} & Format key, defined in the dectionary & {\tt typed int} & \\\cline{3-6}
& & {\sf fmt\_type} & Typing byte of each individual value, possibly followed by a typed int for the vector length & {\tt uint8\_t+} & \\\cline{3-6}
& & {\sf fmt\_value} & Array of values. The information of each individual is concatenated in the vector. Every value is of the same {\sf fmt\_type}.
Variable-length vectors are padded with missing values; a string is stored as a vector of {\tt char}. & (by {\sf fmt\_type}) &\\
\cline{1-6}
\end{tabular}}
\end{table}
\end{document}