update docs

macs3-project · Sep 6, 2024 · c02c662 · c02c662
1 parent ddcab9c
commit c02c662
Show file tree

Hide file tree

Showing 13 changed files with 439 additions and 11 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,7 @@
 # MACS: Model-based Analysis for ChIP-Seq
 
-![Status](https://img.shields.io/pypi/status/macs3.svg) ![License](https://img.shields.io/github/license/macs3-project/MACS) ![Programming languages](https://img.shields.io/github/languages/top/macs3-project/MACS) ![CI x64](https://github.com/macs3-project/MACS/workflows/MACS3%20CI%20x64/badge.svg?branch=master) ![CI non x64](https://github.com/macs3-project/MACS/workflows/MACS3%20CI%20non%20x64/badge.svg?branch=master) ![CI Mac OS](https://github.com/macs3-project/MACS/actions/workflows/build-and-test-MACS3-macos.yml/badge.svg?branch=master)
+![Status](https://img.shields.io/pypi/status/macs3.svg) ![License](https://img.shields.io/github/license/macs3-project/MACS) ![Programming languages](https://img.shields.io/github/languages/top/macs3-project/MACS) ![CI x64](https://github.com/macs3-project/MACS/workflows/MACS3%20CI%20x64/badge.svg?branch=master) ![CI non x64](https://github.com/macs3-project/MACS/workflows/MACS3%20CI%20non%20x64/badge.svg?branch=master) ![CI Mac OS](https://github.com/macs3-project/MACS/actions/workflows/build-and-test-MACS3-macos.yml/badge.svg?branch=master) [![CZI's Essential Open Source Software for Science](https://chanzuckerberg.github.io/open-science/badges/CZI-EOSS.svg)](https://czi.co/EOSS)
+
 
 [![PyPI
 download](https://img.shields.io/pypi/dm/macs3?label=pypi%20downloads)](https://pypistats.org/packages/macs3)

diff --git a/docs/source/docs/BED.md b/docs/source/docs/BED.md
@@ -1 +1,10 @@
 # BED format
+
+The BED format is a widely used format to store the genome
+annotation. It's flexible and has many variation depends on how many
+fields exists in the file. The official definition of BED format can
+be found
+[here](https://genome.ucsc.edu/FAQ/FAQformat.html#format1). Formats
+like BEDPE, narrowPeak, broadPeak, or gappedPeak can be regarded as
+variation of BED format.
+
diff --git a/docs/source/docs/BEDPE.md b/docs/source/docs/BEDPE.md
@@ -4,10 +4,11 @@ The BEDPE format is specifically designed for keeping the alignment
 locations of each read pair from Paired-End library. This is not a
 general format but only a format for MACS3. It only contains three
 columns -- the chromosome, the leftmost position of read pair, and the
-rightmost position of the read pair. These three columns All other information from
+rightmost position of the read pair. All other information from
 alignment will not be kept in this format, such as the length of the
-read, the mismatches/gaps in the alignment, and etc. An example is as
-followed:
+read, the mismatches/gaps in the alignment, and etc. We can use this
+format to generate a simplified alignment file for PE library and gzip
+it to minimize the file size. An example is as followed:
 
 ```
 chrXIII	0	60

diff --git a/docs/source/docs/INSTALL.md b/docs/source/docs/INSTALL.md
@@ -82,12 +82,11 @@ Then activate it by
 
 `$ source MACS3env/bin/activate`
 
-If you use 'conda', it will also provide virtual environment. Please
-read:
-[conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html)
-or [miniconda](https://docs.conda.io/en/latest/miniconda.html) For
-example, after installing 'conda', you can use `conda create -n MACS3`
-to create a new environment called 'MACS3' then switch to this
+If you use 'conda' through 'miniforge' project, it will also provide
+virtual environment. Please read:
+[miniforge](https://github.com/conda-forge/miniforge). For example,
+after installing 'conda', you can use `conda create -n MACS3` to
+create a new environment called 'MACS3' then switch to this
 environment with `conda activate MACS3`.
 
 There is another solution, [pipx](https://pipx.pypa.io/), to make a

diff --git a/docs/source/docs/SAMBAMBAMPE.md b/docs/source/docs/SAMBAMBAMPE.md
@@ -1,2 +1,26 @@
 # SAM/BAM/BAMPE format
 
+SAM and BAM formats are the most common format used to store alignment
+information of sequencing reads. The BAM format is in fact the binary
+version of SAM which is a text file. Please refer to [SAM format
+specification](https://samtools.github.io/hts-specs/SAMv1.pdf) for
+detail. As in MACS3, we call the BAM file from paired-end reads
+mapping as `BAMPE` -- BAM file for Paired-End data. When you specify
+the input format as `BAMPE` such as `-f BAMPE` in `callpeaks`, you
+will trigger the PE mode for MACS3 where we will consider the whole
+fragment between read pair as the DNA fragment bound by the protein of
+interest.  On the contrast, even if the BAM storing PE mapping
+information, if you specify `-f BAM`, MACS3 will treat the input as
+single-end data and use only the 1st mate of read pair to estimate the
+fragment length using the cross-correlation model.
+
+Most of MACS3 modules only take the mapping location of reads from
+SAM/BAM/BAMPE file. If necessary, you can use `macs3 filterdup
+--keep-dup all` or `macs3 randsample -p 100` to convert the
+SAM/BAM/BAMPE into BED or [BEDPE](./BEDPE.md) format, then gzip them
+to save storage. 
+
+The only exception is that the `callvar` command in MACS3 will need to
+read the actual sequences and mapping information such as the
+mismatch, insertion, deletion, etc from BAM. Also, please make sure
+that the BAM file for `callvar` has to be sorted and indexed.
diff --git a/docs/source/docs/bedGraph.md b/docs/source/docs/bedGraph.md
@@ -1 +1,13 @@
 # bedGraph format
+
+The official definition of bedGraph can be found
+[here](https://genome.ucsc.edu/goldenPath/help/bedgraph.html). As for
+all other files generated directly from MACS3, you may need to add a
+trackline to the beginning of the file when you upload the file to
+UCSC genome browser for display. A minimal trackline for bedGraph is
+`track type=bedGraph`. Please read the above link to the official
+definition page for more instruction. The bedGraph format file can be
+further converted into a binary format named 'bigWig' for smaller file
+size and for fast and random access. Check this
+[gist](https://gist.github.com/taoliu/2469050) for a script to convert
+the plain text bedGraph to binary bigWig.
diff --git a/docs/source/docs/broadPeak.md b/docs/source/docs/broadPeak.md
@@ -1 +1,58 @@
 # broadPeak format
+
+You can find the official definition of broadPeak format
+[here](https://genome.ucsc.edu/FAQ/FAQformat.html#format13). We chose
+this format for broad peak calling results. It is a BED variant, and
+comparing with narrowPeak format, it just misses the last peak summit
+column.
+
+The fields/columns in the file are defined as followed (based on UCSC
+definition, with specific notes on MACS3 output).
+
+1. chrom - Name of the chromosome (or contig, scaffold, etc.), as used
+   in the input alignment file for MACS3.
+2. chromStart - The starting position of the feature in the chromosome
+   or scaffold. The first base in a chromosome is numbered **0**.
+3. chromEnd - The ending position of the feature in the chromosome or
+   scaffold. The chromEnd base **is not included** in the display of
+   the feature. For example, the first 100 bases of a chromosome are
+   defined as chromStart=0, chromEnd=100, and span the bases numbered
+   0-99. This BED-style coordination system is machine-friendly but
+   not human-friendly. As a note, this is different with the
+   coordination system used in GFF format file, where the same region
+   in the above example will be defined as chromStart=1, chromEnd=100.
+4. name - Name given to a region (preferably unique). In MACS3, it
+   will be the experiment name plus a number such as 'FoxA1_99'.
+5. score - Indicates how dark the peak will be displayed in the
+   browser (0-1000). Thus, it's for the purpose of displaying on
+   genome browser. In MACS3 `callpeak` output, we use the
+   -log10qvalue*10. However, it may happen when the value in this
+   column goes above 1000, and cause trouble while loading it in
+   genome browsers. In this case, use the following `awk` command to
+   fix: `awk -F'\t' '{ if ($5 > 1000) $5=1000; OFS="\t"; print }' peak.broadPeak`
+6. strand - `+`/`-` to denote strand or orientation (whenever
+   applicable). We use `.` since we can't decide the strand of DNA
+   where the binding happens.
+7. signalValue - Measurement of overall (usually, average) enrichment
+   for the region. In MACS3, we use the fold-enrichment values
+   here. Please note that, same as the columns 8 and 9, this value
+   corresponds to the mean value (in this case, the fold-enrichment)
+   across the whole broad peak region.
+8. pValue - Measurement of statistical significance (**-log10**). We
+   are using a 'local Poisson' model where the Poisson lambda is
+   estimated (by default) as the average input signal from surrounding
+   region.
+9. qValue - Measurement of statistical significance using false
+   discovery rate (**-log10**). We used Benjamini-horchberg process
+   over the whole genome, treating the p-value calculation on each
+   basepair as an independent test, to get the FDR and q-value.
+
+But please also note that, by default, we don't include a header line
+(called trackline) in the broadPeak output from MACS3. However, this
+line will be necessary while uploading the broadPeak file to UCSC for
+display. Therefore, if you plan to upload the broadPeak file, either
+you turn on `--trackline` option in corresponding MACS3 subcommands,
+or add the trackline mannually to the beginning of the file. A minimal
+trackline is like:
+
+`track type=broadPeak name="track name" description="track description"`
diff --git a/docs/source/docs/callpeak.md b/docs/source/docs/callpeak.md
@@ -350,7 +350,7 @@ the essentials.
 ## Output files
 
 1. `NAME_peaks.xls` is a tabular file which contains information about
-   called peaks. You can open it in excel and sort/filter using excel
+   called peaks. You can open it in Excel and sort/filter using excel
    functions. Information include:
    
    - chromosome name

diff --git a/docs/source/docs/callpeakxls.md b/docs/source/docs/callpeakxls.md
@@ -1 +1,97 @@
 # `callpeak` output XLS format
+
+This plain text file format is used as the standard output from
+`callpeak` command. It's not a Microsoft XLS format. We use the suffix
+as`.xls` because we can trick the OS to open this plain text file with
+Microsoft Excel. This is a legacy feature from early MACS. In fact,
+this is a `tsv` format file with commented header lines.
+
+## Header lines
+
+The commented header lines starting with '#' include important runtime
+messages from `macs3 callpeak` command that have been used to generate
+this output file. Here is an example from a run based on the data in
+the `test` directory of MACS3 GitHub repository.
+
+```
+# This file is generated by MACS version 3.0.2
+# Command line: callpeak -t CTCF_12878_5M.bed.gz -c Input_12878_5M.bed.gz -n speedtest -B --trackline
+# ARGUMENTS LIST:
+# name = speedtest
+# format = AUTO
+# ChIP-seq file = ['CTCF_12878_5M.bed.gz']
+# control file = ['Input_12878_5M.bed.gz']
+# effective genome size = 2.91e+09
+# band width = 300
+# model fold = [5, 50]
+# qvalue cutoff = 5.00e-02
+# The maximum gap between significant sites is assigned as the read length/tag size.
+# The minimum length of peaks is assigned as the predicted fragment length "d".
+# Larger dataset will be scaled towards smaller dataset.
+# Range for calculating regional lambda is: 1000 bps and 10000 bps
+# Broad region calling is off
+# Paired-End mode is off
+
+# tag size is determined as 36 bps
+# total tags in treatment: 4999978
+# tags after filtering in treatment: 4636124
+# maximum duplicate tags at the same position in treatment = 1
+# Redundant rate in treatment: 0.07
+# total tags in control: 4999975
+# tags after filtering in control: 4895150
+# maximum duplicate tags at the same position in control = 1
+# Redundant rate in control: 0.02
+# d = 101
+# alternative fragment length(s) may be 101 bps
+
+```
+
+The first paragraph contains major parameters for this run, and the
+second paragraph contains important statistics of the dataset. The
+numbers for 'tags after filtering' or 'fragments after filtering' (for
+PE data), after removing duplicates, are considerred as the library
+size when MACS3 tries to estimate genome-wide background or to
+normalize signals. You can also find the estimated fragment size, for
+single-end dataset; for PE data, you will see the average insert
+length of all the read pairs after filtering.
+
+## Content of peak calls
+
+After the commented header lines, you will see the content of peak
+calls. The content will start with a header line describing each
+columns:
+
+```
+chr     start   end     length  abs_summit      pileup  -log10(pvalue)  fold_enrichment -log10(qvalue)  name
+```
+
+The columns include:
+
+1. chromosome name - Name of the chromosome (or contig, scaffold, etc.), as used
+   in the input alignment file for MACS3.
+2. start position of peak - The starting position of the feature in the chromosome
+   or scaffold. The first base in a chromosome is numbered **1** which
+   is different with BED-style file but similar to GFF-style file.
+3. end position of peak - The ending position of the feature in the chromosome or
+   scaffold. The chromEnd base **is included** in the peak which
+   is different with BED-style file but similar to GFF-style file.
+4. length of peak region - `=end-start+1`.
+5. absolute peak summit position - This is the field we record the
+    so-called peak summit location, where the enrichment is the
+    highest in the peak region. This is the absolute location of
+    summit, different with the relative summit location in narrowPeak
+    format. 
+6. pileup height at peak summit
+7. p-score - It's the `-log10(pvalue)` from local Poisson test for the
+   peak summit (e.g. pvalue =1e-10, then this value should be 10)
+8. fold enrichment for this peak summit against random Poisson
+  distribution with local lambda,
+9. q-score - It's the `-log10(qvalue)` at peak summit. We used
+   Benjamini-horchberg process over the whole genome, treating the
+   p-value calculation on each basepair as an independent test, to get
+   the FDR and q-value.
+
+Note: When `--broad` is enabled for broad peak calling, the pileup,
+p-value, q-value, and fold change in this file will be the mean value
+across the entire peak region, since peak summit won't be called in
+broad peak calling mode.
diff --git a/docs/source/docs/gappedPeak.md b/docs/source/docs/gappedPeak.md
@@ -1 +1,76 @@
 # gappedPeak format
+
+You can find the official definition of gappedPeak format
+[here](https://genome.ucsc.edu/FAQ/FAQformat.html#format14). We chose
+this format for broad peak calling, since it can represent a nested
+peak calling result where there are some highlighted narrower regions
+inside of a wider genomic region.
+
+The fields/columns in the file are defined as followed (based on UCSC
+definition, with specific notes on MACS3 output).
+
+1. chrom - Name of the chromosome (or contig, scaffold, etc.), as used
+   in the input alignment file for MACS3.
+2. chromStart - The starting position of the feature in the chromosome
+   or scaffold. The first base in a chromosome is numbered **0**.
+3. chromEnd - The ending position of the feature in the chromosome or
+   scaffold. The chromEnd base **is not included** in the display of
+   the feature. For example, the first 100 bases of a chromosome are
+   defined as chromStart=0, chromEnd=100, and span the bases numbered
+   0-99. This BED-style coordination system is machine-friendly but
+   not human-friendly. As a note, this is different with the
+   coordination system used in GFF format file, where the same region
+   in the above example will be defined as chromStart=1, chromEnd=100.
+4. name - Name given to a region (preferably unique). In MACS3, it
+   will be the experiment name plus a number such as 'FoxA1_99'.
+5. score - Indicates how dark the peak will be displayed in the
+   browser (0-1000). Thus, it's for the purpose of displaying on
+   genome browser. In MACS3 `callpeak` output, we use the
+   -log10qvalue*10. However, it may happen when the value in this
+   column goes above 1000, and cause trouble while loading it in
+   genome browsers. In this case, use the following `awk` command to
+   fix: `awk -F'\t' '{ if ($5 > 1000) $5=1000; OFS="\t"; print }' peak.broadPeak`
+6. strand - `+`/`-` to denote strand or orientation (whenever
+   applicable). We use `.` since we can't decide the strand of DNA
+   where the binding happens.
+7. thickStart - The starting position at which the feature is drawn
+   thickly. Not used in gappedPeak type, set to 0.
+8. thickEnd - The ending position at which the feature is drawn
+   thickly. Not used in gappedPeak type, set to 0.
+9. itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). Not used
+   in gappedPeak type, set to 0.
+10. blockCount - The number of blocks (exons) in the BED line. In
+    MACS3 broad peak calling mode, the 'blocks' are not exons but the
+    stronger peaks within this peak region. Because of the restriction
+    in this format -- see the blockStarts, we have to make the first
+    base and the last base in the region in the 'blocks' if they are
+    not. 
+11. blockSizes - A comma-separated list of the block sizes. The number
+    of items in this list should correspond to blockCount.
+12. blockStarts - A comma-separated list of block starts. The first
+	value must be 0 and all of the blockStart positions should be
+	calculated relative to chromStart. The number of items in this
+	list should correspond to blockCount.
+13. signalValue - Measurement of overall (usually, average) enrichment
+   for the region. In MACS3, we use the fold-enrichment values
+   here. Please note that, same as the columns 8 and 9, this value
+   corresponds to the mean value (in this case, the fold-enrichment)
+   across the whole broad peak region.
+14. pValue - Measurement of statistical significance (**-log10**). We
+   are using a 'local Poisson' model where the Poisson lambda is
+   estimated (by default) as the average input signal from surrounding
+   region.
+15. qValue - Measurement of statistical significance using false
+   discovery rate (**-log10**). We used Benjamini-horchberg process
+   over the whole genome, treating the p-value calculation on each
+   basepair as an independent test, to get the FDR and q-value.
+
+Please also note that, by default, we don't include a header line
+(called trackline) in the gappedPeak output from MACS3. However, this
+line will be necessary while uploading the gappedPeak file to UCSC for
+display. Therefore, if you plan to upload the gappedPeak file, either
+you turn on `--trackline` option in corresponding MACS3 subcommands,
+or add the trackline mannually to the beginning of the file. A minimal
+trackline is like:
+
+`track type=gappedPeak name="track name" description="track description"`