Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
taoliu committed Sep 6, 2024
1 parent ddcab9c commit c02c662
Show file tree
Hide file tree
Showing 13 changed files with 439 additions and 11 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# MACS: Model-based Analysis for ChIP-Seq

![Status](https://img.shields.io/pypi/status/macs3.svg) ![License](https://img.shields.io/github/license/macs3-project/MACS) ![Programming languages](https://img.shields.io/github/languages/top/macs3-project/MACS) ![CI x64](https://github.com/macs3-project/MACS/workflows/MACS3%20CI%20x64/badge.svg?branch=master) ![CI non x64](https://github.com/macs3-project/MACS/workflows/MACS3%20CI%20non%20x64/badge.svg?branch=master) ![CI Mac OS](https://github.com/macs3-project/MACS/actions/workflows/build-and-test-MACS3-macos.yml/badge.svg?branch=master)
![Status](https://img.shields.io/pypi/status/macs3.svg) ![License](https://img.shields.io/github/license/macs3-project/MACS) ![Programming languages](https://img.shields.io/github/languages/top/macs3-project/MACS) ![CI x64](https://github.com/macs3-project/MACS/workflows/MACS3%20CI%20x64/badge.svg?branch=master) ![CI non x64](https://github.com/macs3-project/MACS/workflows/MACS3%20CI%20non%20x64/badge.svg?branch=master) ![CI Mac OS](https://github.com/macs3-project/MACS/actions/workflows/build-and-test-MACS3-macos.yml/badge.svg?branch=master) [![CZI's Essential Open Source Software for Science](https://chanzuckerberg.github.io/open-science/badges/CZI-EOSS.svg)](https://czi.co/EOSS)


[![PyPI
download](https://img.shields.io/pypi/dm/macs3?label=pypi%20downloads)](https://pypistats.org/packages/macs3)
Expand Down
9 changes: 9 additions & 0 deletions docs/source/docs/BED.md
Original file line number Diff line number Diff line change
@@ -1 +1,10 @@
# BED format

The BED format is a widely used format to store the genome
annotation. It's flexible and has many variation depends on how many
fields exists in the file. The official definition of BED format can
be found
[here](https://genome.ucsc.edu/FAQ/FAQformat.html#format1). Formats
like BEDPE, narrowPeak, broadPeak, or gappedPeak can be regarded as
variation of BED format.

7 changes: 4 additions & 3 deletions docs/source/docs/BEDPE.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,11 @@ The BEDPE format is specifically designed for keeping the alignment
locations of each read pair from Paired-End library. This is not a
general format but only a format for MACS3. It only contains three
columns -- the chromosome, the leftmost position of read pair, and the
rightmost position of the read pair. These three columns All other information from
rightmost position of the read pair. All other information from
alignment will not be kept in this format, such as the length of the
read, the mismatches/gaps in the alignment, and etc. An example is as
followed:
read, the mismatches/gaps in the alignment, and etc. We can use this
format to generate a simplified alignment file for PE library and gzip
it to minimize the file size. An example is as followed:

```
chrXIII 0 60
Expand Down
11 changes: 5 additions & 6 deletions docs/source/docs/INSTALL.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,12 +82,11 @@ Then activate it by

`$ source MACS3env/bin/activate`

If you use 'conda', it will also provide virtual environment. Please
read:
[conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html)
or [miniconda](https://docs.conda.io/en/latest/miniconda.html) For
example, after installing 'conda', you can use `conda create -n MACS3`
to create a new environment called 'MACS3' then switch to this
If you use 'conda' through 'miniforge' project, it will also provide
virtual environment. Please read:
[miniforge](https://github.com/conda-forge/miniforge). For example,
after installing 'conda', you can use `conda create -n MACS3` to
create a new environment called 'MACS3' then switch to this
environment with `conda activate MACS3`.

There is another solution, [pipx](https://pipx.pypa.io/), to make a
Expand Down
24 changes: 24 additions & 0 deletions docs/source/docs/SAMBAMBAMPE.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,26 @@
# SAM/BAM/BAMPE format

SAM and BAM formats are the most common format used to store alignment
information of sequencing reads. The BAM format is in fact the binary
version of SAM which is a text file. Please refer to [SAM format
specification](https://samtools.github.io/hts-specs/SAMv1.pdf) for
detail. As in MACS3, we call the BAM file from paired-end reads
mapping as `BAMPE` -- BAM file for Paired-End data. When you specify
the input format as `BAMPE` such as `-f BAMPE` in `callpeaks`, you
will trigger the PE mode for MACS3 where we will consider the whole
fragment between read pair as the DNA fragment bound by the protein of
interest. On the contrast, even if the BAM storing PE mapping
information, if you specify `-f BAM`, MACS3 will treat the input as
single-end data and use only the 1st mate of read pair to estimate the
fragment length using the cross-correlation model.

Most of MACS3 modules only take the mapping location of reads from
SAM/BAM/BAMPE file. If necessary, you can use `macs3 filterdup
--keep-dup all` or `macs3 randsample -p 100` to convert the
SAM/BAM/BAMPE into BED or [BEDPE](./BEDPE.md) format, then gzip them
to save storage.

The only exception is that the `callvar` command in MACS3 will need to
read the actual sequences and mapping information such as the
mismatch, insertion, deletion, etc from BAM. Also, please make sure
that the BAM file for `callvar` has to be sorted and indexed.
12 changes: 12 additions & 0 deletions docs/source/docs/bedGraph.md
Original file line number Diff line number Diff line change
@@ -1 +1,13 @@
# bedGraph format

The official definition of bedGraph can be found
[here](https://genome.ucsc.edu/goldenPath/help/bedgraph.html). As for
all other files generated directly from MACS3, you may need to add a
trackline to the beginning of the file when you upload the file to
UCSC genome browser for display. A minimal trackline for bedGraph is
`track type=bedGraph`. Please read the above link to the official
definition page for more instruction. The bedGraph format file can be
further converted into a binary format named 'bigWig' for smaller file
size and for fast and random access. Check this
[gist](https://gist.github.com/taoliu/2469050) for a script to convert
the plain text bedGraph to binary bigWig.
57 changes: 57 additions & 0 deletions docs/source/docs/broadPeak.md
Original file line number Diff line number Diff line change
@@ -1 +1,58 @@
# broadPeak format

You can find the official definition of broadPeak format
[here](https://genome.ucsc.edu/FAQ/FAQformat.html#format13). We chose
this format for broad peak calling results. It is a BED variant, and
comparing with narrowPeak format, it just misses the last peak summit
column.

The fields/columns in the file are defined as followed (based on UCSC
definition, with specific notes on MACS3 output).

1. chrom - Name of the chromosome (or contig, scaffold, etc.), as used
in the input alignment file for MACS3.
2. chromStart - The starting position of the feature in the chromosome
or scaffold. The first base in a chromosome is numbered **0**.
3. chromEnd - The ending position of the feature in the chromosome or
scaffold. The chromEnd base **is not included** in the display of
the feature. For example, the first 100 bases of a chromosome are
defined as chromStart=0, chromEnd=100, and span the bases numbered
0-99. This BED-style coordination system is machine-friendly but
not human-friendly. As a note, this is different with the
coordination system used in GFF format file, where the same region
in the above example will be defined as chromStart=1, chromEnd=100.
4. name - Name given to a region (preferably unique). In MACS3, it
will be the experiment name plus a number such as 'FoxA1_99'.
5. score - Indicates how dark the peak will be displayed in the
browser (0-1000). Thus, it's for the purpose of displaying on
genome browser. In MACS3 `callpeak` output, we use the
-log10qvalue*10. However, it may happen when the value in this
column goes above 1000, and cause trouble while loading it in
genome browsers. In this case, use the following `awk` command to
fix: `awk -F'\t' '{ if ($5 > 1000) $5=1000; OFS="\t"; print }' peak.broadPeak`
6. strand - `+`/`-` to denote strand or orientation (whenever
applicable). We use `.` since we can't decide the strand of DNA
where the binding happens.
7. signalValue - Measurement of overall (usually, average) enrichment
for the region. In MACS3, we use the fold-enrichment values
here. Please note that, same as the columns 8 and 9, this value
corresponds to the mean value (in this case, the fold-enrichment)
across the whole broad peak region.
8. pValue - Measurement of statistical significance (**-log10**). We
are using a 'local Poisson' model where the Poisson lambda is
estimated (by default) as the average input signal from surrounding
region.
9. qValue - Measurement of statistical significance using false
discovery rate (**-log10**). We used Benjamini-horchberg process
over the whole genome, treating the p-value calculation on each
basepair as an independent test, to get the FDR and q-value.

But please also note that, by default, we don't include a header line
(called trackline) in the broadPeak output from MACS3. However, this
line will be necessary while uploading the broadPeak file to UCSC for
display. Therefore, if you plan to upload the broadPeak file, either
you turn on `--trackline` option in corresponding MACS3 subcommands,
or add the trackline mannually to the beginning of the file. A minimal
trackline is like:

`track type=broadPeak name="track name" description="track description"`
2 changes: 1 addition & 1 deletion docs/source/docs/callpeak.md
Original file line number Diff line number Diff line change
Expand Up @@ -350,7 +350,7 @@ the essentials.
## Output files
1. `NAME_peaks.xls` is a tabular file which contains information about
called peaks. You can open it in excel and sort/filter using excel
called peaks. You can open it in Excel and sort/filter using excel
functions. Information include:
- chromosome name
Expand Down
96 changes: 96 additions & 0 deletions docs/source/docs/callpeakxls.md
Original file line number Diff line number Diff line change
@@ -1 +1,97 @@
# `callpeak` output XLS format

This plain text file format is used as the standard output from
`callpeak` command. It's not a Microsoft XLS format. We use the suffix
as`.xls` because we can trick the OS to open this plain text file with
Microsoft Excel. This is a legacy feature from early MACS. In fact,
this is a `tsv` format file with commented header lines.

## Header lines

The commented header lines starting with '#' include important runtime
messages from `macs3 callpeak` command that have been used to generate
this output file. Here is an example from a run based on the data in
the `test` directory of MACS3 GitHub repository.

```
# This file is generated by MACS version 3.0.2
# Command line: callpeak -t CTCF_12878_5M.bed.gz -c Input_12878_5M.bed.gz -n speedtest -B --trackline
# ARGUMENTS LIST:
# name = speedtest
# format = AUTO
# ChIP-seq file = ['CTCF_12878_5M.bed.gz']
# control file = ['Input_12878_5M.bed.gz']
# effective genome size = 2.91e+09
# band width = 300
# model fold = [5, 50]
# qvalue cutoff = 5.00e-02
# The maximum gap between significant sites is assigned as the read length/tag size.
# The minimum length of peaks is assigned as the predicted fragment length "d".
# Larger dataset will be scaled towards smaller dataset.
# Range for calculating regional lambda is: 1000 bps and 10000 bps
# Broad region calling is off
# Paired-End mode is off
# tag size is determined as 36 bps
# total tags in treatment: 4999978
# tags after filtering in treatment: 4636124
# maximum duplicate tags at the same position in treatment = 1
# Redundant rate in treatment: 0.07
# total tags in control: 4999975
# tags after filtering in control: 4895150
# maximum duplicate tags at the same position in control = 1
# Redundant rate in control: 0.02
# d = 101
# alternative fragment length(s) may be 101 bps
```

The first paragraph contains major parameters for this run, and the
second paragraph contains important statistics of the dataset. The
numbers for 'tags after filtering' or 'fragments after filtering' (for
PE data), after removing duplicates, are considerred as the library
size when MACS3 tries to estimate genome-wide background or to
normalize signals. You can also find the estimated fragment size, for
single-end dataset; for PE data, you will see the average insert
length of all the read pairs after filtering.

## Content of peak calls

After the commented header lines, you will see the content of peak
calls. The content will start with a header line describing each
columns:

```
chr start end length abs_summit pileup -log10(pvalue) fold_enrichment -log10(qvalue) name
```

The columns include:

1. chromosome name - Name of the chromosome (or contig, scaffold, etc.), as used
in the input alignment file for MACS3.
2. start position of peak - The starting position of the feature in the chromosome
or scaffold. The first base in a chromosome is numbered **1** which
is different with BED-style file but similar to GFF-style file.
3. end position of peak - The ending position of the feature in the chromosome or
scaffold. The chromEnd base **is included** in the peak which
is different with BED-style file but similar to GFF-style file.
4. length of peak region - `=end-start+1`.
5. absolute peak summit position - This is the field we record the
so-called peak summit location, where the enrichment is the
highest in the peak region. This is the absolute location of
summit, different with the relative summit location in narrowPeak
format.
6. pileup height at peak summit
7. p-score - It's the `-log10(pvalue)` from local Poisson test for the
peak summit (e.g. pvalue =1e-10, then this value should be 10)
8. fold enrichment for this peak summit against random Poisson
distribution with local lambda,
9. q-score - It's the `-log10(qvalue)` at peak summit. We used
Benjamini-horchberg process over the whole genome, treating the
p-value calculation on each basepair as an independent test, to get
the FDR and q-value.

Note: When `--broad` is enabled for broad peak calling, the pileup,
p-value, q-value, and fold change in this file will be the mean value
across the entire peak region, since peak summit won't be called in
broad peak calling mode.
75 changes: 75 additions & 0 deletions docs/source/docs/gappedPeak.md
Original file line number Diff line number Diff line change
@@ -1 +1,76 @@
# gappedPeak format

You can find the official definition of gappedPeak format
[here](https://genome.ucsc.edu/FAQ/FAQformat.html#format14). We chose
this format for broad peak calling, since it can represent a nested
peak calling result where there are some highlighted narrower regions
inside of a wider genomic region.

The fields/columns in the file are defined as followed (based on UCSC
definition, with specific notes on MACS3 output).

1. chrom - Name of the chromosome (or contig, scaffold, etc.), as used
in the input alignment file for MACS3.
2. chromStart - The starting position of the feature in the chromosome
or scaffold. The first base in a chromosome is numbered **0**.
3. chromEnd - The ending position of the feature in the chromosome or
scaffold. The chromEnd base **is not included** in the display of
the feature. For example, the first 100 bases of a chromosome are
defined as chromStart=0, chromEnd=100, and span the bases numbered
0-99. This BED-style coordination system is machine-friendly but
not human-friendly. As a note, this is different with the
coordination system used in GFF format file, where the same region
in the above example will be defined as chromStart=1, chromEnd=100.
4. name - Name given to a region (preferably unique). In MACS3, it
will be the experiment name plus a number such as 'FoxA1_99'.
5. score - Indicates how dark the peak will be displayed in the
browser (0-1000). Thus, it's for the purpose of displaying on
genome browser. In MACS3 `callpeak` output, we use the
-log10qvalue*10. However, it may happen when the value in this
column goes above 1000, and cause trouble while loading it in
genome browsers. In this case, use the following `awk` command to
fix: `awk -F'\t' '{ if ($5 > 1000) $5=1000; OFS="\t"; print }' peak.broadPeak`
6. strand - `+`/`-` to denote strand or orientation (whenever
applicable). We use `.` since we can't decide the strand of DNA
where the binding happens.
7. thickStart - The starting position at which the feature is drawn
thickly. Not used in gappedPeak type, set to 0.
8. thickEnd - The ending position at which the feature is drawn
thickly. Not used in gappedPeak type, set to 0.
9. itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). Not used
in gappedPeak type, set to 0.
10. blockCount - The number of blocks (exons) in the BED line. In
MACS3 broad peak calling mode, the 'blocks' are not exons but the
stronger peaks within this peak region. Because of the restriction
in this format -- see the blockStarts, we have to make the first
base and the last base in the region in the 'blocks' if they are
not.
11. blockSizes - A comma-separated list of the block sizes. The number
of items in this list should correspond to blockCount.
12. blockStarts - A comma-separated list of block starts. The first
value must be 0 and all of the blockStart positions should be
calculated relative to chromStart. The number of items in this
list should correspond to blockCount.
13. signalValue - Measurement of overall (usually, average) enrichment
for the region. In MACS3, we use the fold-enrichment values
here. Please note that, same as the columns 8 and 9, this value
corresponds to the mean value (in this case, the fold-enrichment)
across the whole broad peak region.
14. pValue - Measurement of statistical significance (**-log10**). We
are using a 'local Poisson' model where the Poisson lambda is
estimated (by default) as the average input signal from surrounding
region.
15. qValue - Measurement of statistical significance using false
discovery rate (**-log10**). We used Benjamini-horchberg process
over the whole genome, treating the p-value calculation on each
basepair as an independent test, to get the FDR and q-value.

Please also note that, by default, we don't include a header line
(called trackline) in the gappedPeak output from MACS3. However, this
line will be necessary while uploading the gappedPeak file to UCSC for
display. Therefore, if you plan to upload the gappedPeak file, either
you turn on `--trackline` option in corresponding MACS3 subcommands,
or add the trackline mannually to the beginning of the file. A minimal
trackline is like:

`track type=gappedPeak name="track name" description="track description"`
Loading

0 comments on commit c02c662

Please sign in to comment.