-
Notifications
You must be signed in to change notification settings - Fork 270
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
13 changed files
with
439 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,10 @@ | ||
# BED format | ||
|
||
The BED format is a widely used format to store the genome | ||
annotation. It's flexible and has many variation depends on how many | ||
fields exists in the file. The official definition of BED format can | ||
be found | ||
[here](https://genome.ucsc.edu/FAQ/FAQformat.html#format1). Formats | ||
like BEDPE, narrowPeak, broadPeak, or gappedPeak can be regarded as | ||
variation of BED format. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,26 @@ | ||
# SAM/BAM/BAMPE format | ||
|
||
SAM and BAM formats are the most common format used to store alignment | ||
information of sequencing reads. The BAM format is in fact the binary | ||
version of SAM which is a text file. Please refer to [SAM format | ||
specification](https://samtools.github.io/hts-specs/SAMv1.pdf) for | ||
detail. As in MACS3, we call the BAM file from paired-end reads | ||
mapping as `BAMPE` -- BAM file for Paired-End data. When you specify | ||
the input format as `BAMPE` such as `-f BAMPE` in `callpeaks`, you | ||
will trigger the PE mode for MACS3 where we will consider the whole | ||
fragment between read pair as the DNA fragment bound by the protein of | ||
interest. On the contrast, even if the BAM storing PE mapping | ||
information, if you specify `-f BAM`, MACS3 will treat the input as | ||
single-end data and use only the 1st mate of read pair to estimate the | ||
fragment length using the cross-correlation model. | ||
|
||
Most of MACS3 modules only take the mapping location of reads from | ||
SAM/BAM/BAMPE file. If necessary, you can use `macs3 filterdup | ||
--keep-dup all` or `macs3 randsample -p 100` to convert the | ||
SAM/BAM/BAMPE into BED or [BEDPE](./BEDPE.md) format, then gzip them | ||
to save storage. | ||
|
||
The only exception is that the `callvar` command in MACS3 will need to | ||
read the actual sequences and mapping information such as the | ||
mismatch, insertion, deletion, etc from BAM. Also, please make sure | ||
that the BAM file for `callvar` has to be sorted and indexed. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,13 @@ | ||
# bedGraph format | ||
|
||
The official definition of bedGraph can be found | ||
[here](https://genome.ucsc.edu/goldenPath/help/bedgraph.html). As for | ||
all other files generated directly from MACS3, you may need to add a | ||
trackline to the beginning of the file when you upload the file to | ||
UCSC genome browser for display. A minimal trackline for bedGraph is | ||
`track type=bedGraph`. Please read the above link to the official | ||
definition page for more instruction. The bedGraph format file can be | ||
further converted into a binary format named 'bigWig' for smaller file | ||
size and for fast and random access. Check this | ||
[gist](https://gist.github.com/taoliu/2469050) for a script to convert | ||
the plain text bedGraph to binary bigWig. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,58 @@ | ||
# broadPeak format | ||
|
||
You can find the official definition of broadPeak format | ||
[here](https://genome.ucsc.edu/FAQ/FAQformat.html#format13). We chose | ||
this format for broad peak calling results. It is a BED variant, and | ||
comparing with narrowPeak format, it just misses the last peak summit | ||
column. | ||
|
||
The fields/columns in the file are defined as followed (based on UCSC | ||
definition, with specific notes on MACS3 output). | ||
|
||
1. chrom - Name of the chromosome (or contig, scaffold, etc.), as used | ||
in the input alignment file for MACS3. | ||
2. chromStart - The starting position of the feature in the chromosome | ||
or scaffold. The first base in a chromosome is numbered **0**. | ||
3. chromEnd - The ending position of the feature in the chromosome or | ||
scaffold. The chromEnd base **is not included** in the display of | ||
the feature. For example, the first 100 bases of a chromosome are | ||
defined as chromStart=0, chromEnd=100, and span the bases numbered | ||
0-99. This BED-style coordination system is machine-friendly but | ||
not human-friendly. As a note, this is different with the | ||
coordination system used in GFF format file, where the same region | ||
in the above example will be defined as chromStart=1, chromEnd=100. | ||
4. name - Name given to a region (preferably unique). In MACS3, it | ||
will be the experiment name plus a number such as 'FoxA1_99'. | ||
5. score - Indicates how dark the peak will be displayed in the | ||
browser (0-1000). Thus, it's for the purpose of displaying on | ||
genome browser. In MACS3 `callpeak` output, we use the | ||
-log10qvalue*10. However, it may happen when the value in this | ||
column goes above 1000, and cause trouble while loading it in | ||
genome browsers. In this case, use the following `awk` command to | ||
fix: `awk -F'\t' '{ if ($5 > 1000) $5=1000; OFS="\t"; print }' peak.broadPeak` | ||
6. strand - `+`/`-` to denote strand or orientation (whenever | ||
applicable). We use `.` since we can't decide the strand of DNA | ||
where the binding happens. | ||
7. signalValue - Measurement of overall (usually, average) enrichment | ||
for the region. In MACS3, we use the fold-enrichment values | ||
here. Please note that, same as the columns 8 and 9, this value | ||
corresponds to the mean value (in this case, the fold-enrichment) | ||
across the whole broad peak region. | ||
8. pValue - Measurement of statistical significance (**-log10**). We | ||
are using a 'local Poisson' model where the Poisson lambda is | ||
estimated (by default) as the average input signal from surrounding | ||
region. | ||
9. qValue - Measurement of statistical significance using false | ||
discovery rate (**-log10**). We used Benjamini-horchberg process | ||
over the whole genome, treating the p-value calculation on each | ||
basepair as an independent test, to get the FDR and q-value. | ||
|
||
But please also note that, by default, we don't include a header line | ||
(called trackline) in the broadPeak output from MACS3. However, this | ||
line will be necessary while uploading the broadPeak file to UCSC for | ||
display. Therefore, if you plan to upload the broadPeak file, either | ||
you turn on `--trackline` option in corresponding MACS3 subcommands, | ||
or add the trackline mannually to the beginning of the file. A minimal | ||
trackline is like: | ||
|
||
`track type=broadPeak name="track name" description="track description"` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,97 @@ | ||
# `callpeak` output XLS format | ||
|
||
This plain text file format is used as the standard output from | ||
`callpeak` command. It's not a Microsoft XLS format. We use the suffix | ||
as`.xls` because we can trick the OS to open this plain text file with | ||
Microsoft Excel. This is a legacy feature from early MACS. In fact, | ||
this is a `tsv` format file with commented header lines. | ||
|
||
## Header lines | ||
|
||
The commented header lines starting with '#' include important runtime | ||
messages from `macs3 callpeak` command that have been used to generate | ||
this output file. Here is an example from a run based on the data in | ||
the `test` directory of MACS3 GitHub repository. | ||
|
||
``` | ||
# This file is generated by MACS version 3.0.2 | ||
# Command line: callpeak -t CTCF_12878_5M.bed.gz -c Input_12878_5M.bed.gz -n speedtest -B --trackline | ||
# ARGUMENTS LIST: | ||
# name = speedtest | ||
# format = AUTO | ||
# ChIP-seq file = ['CTCF_12878_5M.bed.gz'] | ||
# control file = ['Input_12878_5M.bed.gz'] | ||
# effective genome size = 2.91e+09 | ||
# band width = 300 | ||
# model fold = [5, 50] | ||
# qvalue cutoff = 5.00e-02 | ||
# The maximum gap between significant sites is assigned as the read length/tag size. | ||
# The minimum length of peaks is assigned as the predicted fragment length "d". | ||
# Larger dataset will be scaled towards smaller dataset. | ||
# Range for calculating regional lambda is: 1000 bps and 10000 bps | ||
# Broad region calling is off | ||
# Paired-End mode is off | ||
# tag size is determined as 36 bps | ||
# total tags in treatment: 4999978 | ||
# tags after filtering in treatment: 4636124 | ||
# maximum duplicate tags at the same position in treatment = 1 | ||
# Redundant rate in treatment: 0.07 | ||
# total tags in control: 4999975 | ||
# tags after filtering in control: 4895150 | ||
# maximum duplicate tags at the same position in control = 1 | ||
# Redundant rate in control: 0.02 | ||
# d = 101 | ||
# alternative fragment length(s) may be 101 bps | ||
``` | ||
|
||
The first paragraph contains major parameters for this run, and the | ||
second paragraph contains important statistics of the dataset. The | ||
numbers for 'tags after filtering' or 'fragments after filtering' (for | ||
PE data), after removing duplicates, are considerred as the library | ||
size when MACS3 tries to estimate genome-wide background or to | ||
normalize signals. You can also find the estimated fragment size, for | ||
single-end dataset; for PE data, you will see the average insert | ||
length of all the read pairs after filtering. | ||
|
||
## Content of peak calls | ||
|
||
After the commented header lines, you will see the content of peak | ||
calls. The content will start with a header line describing each | ||
columns: | ||
|
||
``` | ||
chr start end length abs_summit pileup -log10(pvalue) fold_enrichment -log10(qvalue) name | ||
``` | ||
|
||
The columns include: | ||
|
||
1. chromosome name - Name of the chromosome (or contig, scaffold, etc.), as used | ||
in the input alignment file for MACS3. | ||
2. start position of peak - The starting position of the feature in the chromosome | ||
or scaffold. The first base in a chromosome is numbered **1** which | ||
is different with BED-style file but similar to GFF-style file. | ||
3. end position of peak - The ending position of the feature in the chromosome or | ||
scaffold. The chromEnd base **is included** in the peak which | ||
is different with BED-style file but similar to GFF-style file. | ||
4. length of peak region - `=end-start+1`. | ||
5. absolute peak summit position - This is the field we record the | ||
so-called peak summit location, where the enrichment is the | ||
highest in the peak region. This is the absolute location of | ||
summit, different with the relative summit location in narrowPeak | ||
format. | ||
6. pileup height at peak summit | ||
7. p-score - It's the `-log10(pvalue)` from local Poisson test for the | ||
peak summit (e.g. pvalue =1e-10, then this value should be 10) | ||
8. fold enrichment for this peak summit against random Poisson | ||
distribution with local lambda, | ||
9. q-score - It's the `-log10(qvalue)` at peak summit. We used | ||
Benjamini-horchberg process over the whole genome, treating the | ||
p-value calculation on each basepair as an independent test, to get | ||
the FDR and q-value. | ||
|
||
Note: When `--broad` is enabled for broad peak calling, the pileup, | ||
p-value, q-value, and fold change in this file will be the mean value | ||
across the entire peak region, since peak summit won't be called in | ||
broad peak calling mode. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,76 @@ | ||
# gappedPeak format | ||
|
||
You can find the official definition of gappedPeak format | ||
[here](https://genome.ucsc.edu/FAQ/FAQformat.html#format14). We chose | ||
this format for broad peak calling, since it can represent a nested | ||
peak calling result where there are some highlighted narrower regions | ||
inside of a wider genomic region. | ||
|
||
The fields/columns in the file are defined as followed (based on UCSC | ||
definition, with specific notes on MACS3 output). | ||
|
||
1. chrom - Name of the chromosome (or contig, scaffold, etc.), as used | ||
in the input alignment file for MACS3. | ||
2. chromStart - The starting position of the feature in the chromosome | ||
or scaffold. The first base in a chromosome is numbered **0**. | ||
3. chromEnd - The ending position of the feature in the chromosome or | ||
scaffold. The chromEnd base **is not included** in the display of | ||
the feature. For example, the first 100 bases of a chromosome are | ||
defined as chromStart=0, chromEnd=100, and span the bases numbered | ||
0-99. This BED-style coordination system is machine-friendly but | ||
not human-friendly. As a note, this is different with the | ||
coordination system used in GFF format file, where the same region | ||
in the above example will be defined as chromStart=1, chromEnd=100. | ||
4. name - Name given to a region (preferably unique). In MACS3, it | ||
will be the experiment name plus a number such as 'FoxA1_99'. | ||
5. score - Indicates how dark the peak will be displayed in the | ||
browser (0-1000). Thus, it's for the purpose of displaying on | ||
genome browser. In MACS3 `callpeak` output, we use the | ||
-log10qvalue*10. However, it may happen when the value in this | ||
column goes above 1000, and cause trouble while loading it in | ||
genome browsers. In this case, use the following `awk` command to | ||
fix: `awk -F'\t' '{ if ($5 > 1000) $5=1000; OFS="\t"; print }' peak.broadPeak` | ||
6. strand - `+`/`-` to denote strand or orientation (whenever | ||
applicable). We use `.` since we can't decide the strand of DNA | ||
where the binding happens. | ||
7. thickStart - The starting position at which the feature is drawn | ||
thickly. Not used in gappedPeak type, set to 0. | ||
8. thickEnd - The ending position at which the feature is drawn | ||
thickly. Not used in gappedPeak type, set to 0. | ||
9. itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). Not used | ||
in gappedPeak type, set to 0. | ||
10. blockCount - The number of blocks (exons) in the BED line. In | ||
MACS3 broad peak calling mode, the 'blocks' are not exons but the | ||
stronger peaks within this peak region. Because of the restriction | ||
in this format -- see the blockStarts, we have to make the first | ||
base and the last base in the region in the 'blocks' if they are | ||
not. | ||
11. blockSizes - A comma-separated list of the block sizes. The number | ||
of items in this list should correspond to blockCount. | ||
12. blockStarts - A comma-separated list of block starts. The first | ||
value must be 0 and all of the blockStart positions should be | ||
calculated relative to chromStart. The number of items in this | ||
list should correspond to blockCount. | ||
13. signalValue - Measurement of overall (usually, average) enrichment | ||
for the region. In MACS3, we use the fold-enrichment values | ||
here. Please note that, same as the columns 8 and 9, this value | ||
corresponds to the mean value (in this case, the fold-enrichment) | ||
across the whole broad peak region. | ||
14. pValue - Measurement of statistical significance (**-log10**). We | ||
are using a 'local Poisson' model where the Poisson lambda is | ||
estimated (by default) as the average input signal from surrounding | ||
region. | ||
15. qValue - Measurement of statistical significance using false | ||
discovery rate (**-log10**). We used Benjamini-horchberg process | ||
over the whole genome, treating the p-value calculation on each | ||
basepair as an independent test, to get the FDR and q-value. | ||
|
||
Please also note that, by default, we don't include a header line | ||
(called trackline) in the gappedPeak output from MACS3. However, this | ||
line will be necessary while uploading the gappedPeak file to UCSC for | ||
display. Therefore, if you plan to upload the gappedPeak file, either | ||
you turn on `--trackline` option in corresponding MACS3 subcommands, | ||
or add the trackline mannually to the beginning of the file. A minimal | ||
trackline is like: | ||
|
||
`track type=gappedPeak name="track name" description="track description"` |
Oops, something went wrong.