GitHub - galaxy001/pirs: profile basd Illumina pair-end Reads Simulator

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
AUTHORS		AUTHORS
COPYING		COPYING
INSTALL		INSTALL
MIRROR		MIRROR
Makefile		Makefile
NEWS		NEWS
Profiles		Profiles
README		README
getprofile.sh.example		getprofile.sh.example

Repository files navigation

pIRS (profile based Illumina pair-end Reads Simulator)

Contents
========
1. Introduction
2. Program framework
3. Usage
4. Examples
5. Output file format
6. Notes

1 Introduction
==============

pIRS is a program for simulating paired-end reads from a reference genome. It
is optimized for simulating reads similar to those generated from the Illumina
platform.

See `INSTALL' for installation instructions.

There are two subcommands: `pirs simulate' and `pirs diploid'. See section 3
for more details, or run `pirs simulate -h' or `pirs diploid -h'.

2 Program framework
===================
2.1 Profile Generator
Six tools are supplied. SOAP2 or BWA, soap.coverage (http://soap.genomics.org.cn/soapaligner.html)
are required. The full process is shown in getprofile.sh.example as an example.
2.1.1 GC%-Depth Profile Stat.
a). Run soap and soap.coverage to get .depth single file(s). gzip is OK to over it.
b). Run gc_coverage_bias on all depth single files. You will get gc-depth stat by 1 GC% and other files.
c). Run gc_coverage_bias_plot on the gc-depth stat file. You'll get PNG plot and a .gc file by 5 GC%.
d). Manually check the .gc file for any abnormal levels due to the lower depth on certain GC% windows.
2.1.2 Base-Calling Profile Stat:
a). Run soap or bwa to get .{soap,single} or .sam file(s).
b). Run error_matrix_calculator on those file(s). You will get *.{count,ratio}.matrix .
c). You can use error_matrix_merger to merge several .{count,ratio}.matrix files.
However, it is up to you to keep the read length matches.
2.1.3 InDel Profile Stat:
a). Choose samples with NO polymorphism InDel, such as the Coliphage samples that shipped with Illumina Sequencers.
b). Run bwa to get .sam/.bam file.
c). Run indelstat_sam_bam to get the profile.
2.1.4 Insert size & mapping ratio stat:
a). Run soap or bwa to get .{soap,single} or .sam file(s).
b). Run alignment_stator *.
* alignment_stator cannot stat. mapping ratio for sam files now.
2.2 Simulator
Two commands:
pirs diploid: use for generating diploid genome sequence. Read the input genome sequence and
then simulate SNP, InDel, SV(structure variation) on it. At last, output the
result genome sequence.
pirs simulate: use for simulating Illumina data, output PE-read file.
Note:
a) If you only want to simulate the diploid genome sequence, "pirs diploid" is enough.
b) If you want to simulate sequencing data of haploid genome, only you need is "pirs simulate".
c) If you want to simulate sequencing data of diploid genome, you first need to run "pirs diploid"
to get the other diploid genome sequence, and then run "pirs simulate" using both the original
genome sequence and the previous output sequence as the input.

3 Usage
=======

pirs <command> [option]
diploid generate diploid genome.
simulate simulate Illumina reads.

3.1 pirs diploid
Usage: pirs diploid [OPTIONS...] REFERENCE

Simulate a diploid genome by creating a copy of a haploid genome with
heterozygosity introduced. REFERENCE specifies a FASTA file containing
the reference genome. It may be compressed (gzip). It may contain multiple
sequences (scaffolds or chromosomes), each marked with a separate FASTA tag
line. The introduced heterozygosity takes the form of SNPs, indels, and
large-scale structural variation (insertions, deletions and inversions).
If REFERENCE is '-', the reference sequence is read from stdin, but it must be
uncompressed.

The probabilities of SNPs, indels, and large-scale structural variation can be
specified with the '-s', '-d', and '-v' options, respectively. You can also
set the ratio of transitions to transversions (for SNPs) with the '-R' option.

Indels are split evenly between insertions and deletions. The length
distribution of the indels is as follows and is derived from panda
re-sequencing data:
1bp 64.82%
2bp 17.17%
3bp 7.20%
4bp 7.29%
5bp 2.18%
6bp 1.34%

Large-scale structural variation is split evenly among large-scale insertions,
deletions, and inversions. By default, the length distribution of these
large-scale features is as follows:
100bp 70%
200bp 20%
500bp 7%
1000bp 2%
2000bp 1%

`pirs diploid' does not use multiple threads, even if pIRS was configured with
--enable-multiple threads.

OPTIONS:
-s, --snp-rate=RATE A floating-point number in the interval [0, 1] that
specifies the heterozygous SNP rate. Default: 0.001

-d, --indel-rate=RATE A floating-point number in the interval [0, 1] that
specifies the heterozygous indel rate.
Default: 0.0001

-v, --sv-rate=RATE A floating-point number in the interval [0, 1] that
specifies the large-scale structural variation
(insertion, deletion, inversion) rate in the diploid
genome. Default: 0.000001

-R, --transition-to-transversion-ratio=RATIO
In a SNP, a transition is when a purine or pyrimidine
is changed to one of the same (A <=> G, C <=> T)
while a transversion is when a purine is changed
into a pyrimidine or vice-versa. This option
specifies a floating-point number RATIO that gives
the ratio of the transition probability to the
transversion probability for simulated SNPs.
Default: 2.0

-o, --output-prefix=PREFIX
Use PREFIX as the prefix of the output file and logs.
Default: "pirs_diploid"

-O, --output-file=FILE
Use FILE as the name of the output file. Use '-'
for standard output; this also moves the
informational messages from stdout to stderr.

-c, --output-file-type=TYPE
The string "text" or "gzip" to specify the type of
the output FASTA file containing the diploid copy
of the genome, as well as the log files.
Default: "text"

-n, --no-logs Do not write the log files.

-S, --random-seed=SEED Use SEED as the random seed. Default:
time(NULL) * getpid()

-q, --quiet Do not print informational messages.

-h, --help Show this help and exit.

-V, --version Show version information and exit.

EXAMPLE:
./pirs diploid ref_sequence.fa -s 0.001 -d 0.0001 -v 0.000001\
-o ref_sequence >pirs.out 2>pirs.err

3.2 pirs simulate

Usage: ./pirs simulate [OPTION]... REFERENCE.FASTA...

pIRS is a program for simulating paired-end reads from a genome. It is
optimized for simulating reads from the Illumina platform. The input to
pIRS is any number of reference sequences. Typically you would just provide
one FASTA file containing your reference sequence, but you may provide two
if you have generated a diploid sequence with `pirs diploid', or if you have
chromosomes split up into multiple FASTA files. The output of pIRS is two
FASTQ files containing the simulated paired-end reads, as well as several log
files.

Input sequences are assumed to be haploid. If you instead want to simulate
reads from a diploid genome, you must give the --diploid option so that
the diploidy is taken into account when computing coverage. If you do
not do this, you will get twice as many reads as you wanted.

pIRS simulates a normally-distributed insert (fragment) length using the
Box-muller method. Usually you want the insert length standard deviation to
be 1/20 or 1/10 of the insert length mean (see the -m and -v options).
This program also simulates Illumina sequencing error, quality score and
GC bias based on empirical distribution profiles. Users may use the default
profiles in this package, which are generated by large real sequencing data,
or they may generate their own profiles.

OPTIONS:
-l LEN, --read-len=LEN
Generate reads having a length of LEN. Default: 100

-x VAL, --coverage=VAL
Set the average sequencing coverage (sometimes called depth).
It may be either a floating-point number or an integer.

-m LEN, --insert-len-mean=LEN
Generate inserts (fragments) having an average length of LEN.
Default: 180

-v LEN, --insert-len-sd=LEN
Set the standard deviation of the insert (fragment) length.
Default: 10% of insert length mean.

-j, --jumping, --cyclicize
Make the paired-end reads face away from either other, as
in a jumping library. Default: the reads face towards each
other.

-d, --diploid
This option asserts that reads are being simulated from a
diploid genome. It causes the program to abort if there
are not exactly two reference sequences; in addition, the
coverage is divided in half, since the two reference
sequences are in reality the same genome. This option
is not required to simulate diploid reads, but you must
set the coverage correctly otherwise (it will be half
as much as you think).

-B FILE, --base-calling-profile=FILE, --subst-error-profile=FILE
Use FILE as the base-calling profile. This profile will be
used to simulate substitution errors. Default:
PREFIX/share/pirs/Base-Calling_Profiles/humNew.PE100.matrix.gz

-I FILE, --indel-error-profile=FILE, --indel-profile=FILE
Use FILE as the indel-error profile. This profile will be
used to simulate insertions and deletions in the reads that
are artifacts of the sequencing process. Default:
PREFIX/share/pirs/InDel_Profiles/phixv2.InDel.matrix

-G FILE, --gc-bias-profile=FILE, --gc-content-bias-profile=FILE
Use FILE as the GC content bias profile. This profile will
adjust the read coverage based on the GC content of
fragments. Defaults:
PREFIX/share/pirs/GC-depth_Profiles/humNew.gcdep_100.dat,
PREFIX/share/pirs/GC-depth_Profiles/humNew.gcdep_150.dat,
PREFIX/share/pirs/GC-depth_Profiles/humNew.gcdep_200.dat,
depending on the mean insert length.

-e FILE, --error-rate=RATE, --subst-error-rate=RATE
Set the substitution error rate. The base-calling profile
will still be used, but the average frequency of errors will
be changed to RATE. Set to 0 to disable substitution errors
completely. In that case, the base-calling profile will not
be used. Default: default error rate of base-calling
profile.

Note: since pIRS parameterizes the error rate by
several parameters, it is very difficult to determine exactly
what needs to be done to make the error rate be a given
value. We try to adjust the probabilities of getting each
quality score in order to accomodate the user-supplied error
rate. However, depending on your input sequences, the actual
error rate simulated by pIRS could be off by 20% or more.
Please check the informational output to see the final error
rate that was actually simulated.

-A ALGO, --substitution-error-algorithm=ALGO, --subst-error-algo=ALGO
Set the algorithm used for simulating substitition errors.
It may be set to the string "dist" or "qtrans". The
default is "qtrans".

The "dist" algorithm looks up the substitution error rate
for each base pair based on the current cycle and the true
base. This lookup produces a quality score and a called base
that may or may not be the same as the true base. In the
base-calling profile, the matrix we use is marked as the
[DistMatrix].

The "qtrans" algorithm is a Markov-chain model based on the
previous quality score and current cycle. The next quality
score is looked up with a certain probability based on these
parameters. The matrix used for this is marked as
[QTransMatrix] in the base-calling profile. Then, the the
DistMatrix is used to find a called base for the quality score.
The DistMatrix is also used to call the base in the first
cycle.

-M MODE, --mask=MODE, --eamss=MODE
Use the EAMSS algorithm for masking read quality. MODE may be
the string "quality" or "lowercase". The EAMSS algorithm
identifies low-quality, GC-rich regions near the ends of reads.
"quality" mode will change the quality scores on these
regions to (2 + quality_shift), while "lowercase" mode
will change the base pairs to lower case, but not change
the quality values. Default: Do not use EAMSS.

-Q VAL, --quality-shift=VAL, --phred-offset=VAL
Set the ASCII shift of the quality value (usually 64 or 33 for
Illumina data). Default: 33

--no-quality-values
--fasta
Do not simulate quality values. The simulated reads will be
written as a FASTA file rather than a FASTQ file.
Substitution errors may still be done; if you do not want
to simulate any substition errors, provide --error-rate=0 or
--no-substitution-errors.

--no-subst-errors
--no-substitution-errors
Do not simulate substitution errors. Equivalent to
--error-rate=0.

--no-indels
--no-indel-errors
Do not simulate indels. The indel error profile will not be
used.

--no-gc-bias
--no-gc-content-bias
Do not simulate GC bias. The GC bias profile will not be
used.

-o PREFIX, --output-prefix=PREFIX
Use PREFIX as the prefix of the output files. Default:
"pirs_reads"

-c TYPE, --output-file-type=TYPE
The string "text" or "gzip" to specify the type of
the output FASTQ files containing the simulated reads
of the genome, as well as the log files. Default: "text"

-z, --compress
Equivalent to -c gzip.

-n, --no-logs, --no-log-files
Do not write the log files.

-S SEED, --random-seed=SEED
Use SEED as the random seed. Default:
time(NULL) * getpid(). Note: If pIRS was not compiled with
--disable-threads, each thread actually uses its own random
number generator that is seeded by this base seed added to
the thread number; also, if you need pIRS's output to be
exactly reproducible, you must specify the random seed as well
as use only 1 simulator thread (--threads=1, or configure
with --disable-threads, or run on system with 4 or fewer
processors).

-t, --threads=NUM_THREADS
Use NUM_THREADS threads to simulate reads. This option is
not available if pIRS was compiled with the --disable-threads
option. Default: number of processors minus 2 if writing
uncompressed output, or number of processors minus 3 if
writing compressed output, or 1 if there are not this many
processors

-q, --quiet Do not print informational messages.

-h, --help Show this help and exit.

-V, --version Show version information and exit.

4 Examples
==============
4.1 Simulating a diploid genome sequence.
Example command line:
pirs diploid Human_ref.fa -s 0.001 -R 2 -d 0.00001 -v 0.000001 -o Human >simulate_seq.o 2>simulate_seq.e
Output files:
a) Human.snp.indel.invertion.fa: another diploid genome sequence.
b) Human_indel.lst: InDel information list.
c) Human_snp.lst: SNP information list.
d) Human_invertion.lst: invertion information list.
e) simulate_seq.o, simulate_seq.e: records of the program running information.
4.2 Simulate Illumina paired-end reads from a haploid genome.
Example command line:
pirs simulate Human_ref.fa -m 170 -l 90 -x 5 -v 10 -o Human >simulate_170.o 2>simulate_170.e
Output files:
a) Human_90_170_1.fq, Human_90_170_2.fq: the paired-end read files.
b) Human_90_170.error_rate.distr: the error distribution file.
c) Human_90_170.insert_len.distr: the insert length distribution file.
d) Human_90_170.read.info: information about every simulated reads
e) simulate_170.o, simulate_170.e: records of the program running information.
4.3 Simulate Illumina paired-end reads from a diploid genome.
Example command line:
pirs simulate --diploid Human_ref.fa Human.snp.indel.invertion.fa.gz -m 800 -l 70 -x 5 -v 10
-o Human >simulate_800.o 2>simulate_800.e
Output files:
a) Human_70_800_1.fq, Human_70_800_2.fq: the pair-end read files.
b) Human_70_800.error_rate.distr: the error distribution file.
c) Human_70_800.insertsize.distr: the insert size distribution file.
d) Human_70_800.read.info.gz: record the information of every reads.
e) simulate_800.e, simulate_800.o: records of the program running information.

5 Output file format
====================
a) *.fq/*.fq.gz

FASTQ files containing the simulated reads. The files are given names
similar to PREFIX_70_800_1.fq, where PREFIX is the prefix provided by the -o
option (default: "pirs_reads"), 70 is the read length, 800 is the mean
insert length, and 1 means the file for read 1 of the read pairs. These
files will be written as GZIP files with the .gz suffix if you provide the
`-c gzip' option.

@read_800_21/1
ACGGAAAAGTTACGCTATCGCATGCGTGTAAGAACACTGCTCCTACGCCCATTTTATCGATGGCGCCCAG
+
egcggdggfgfgggggfeggggYbcgegfgggbggg^e]egfegggfbSeggdggegg`^eJgggcbEeb
@read_800_22/1
CACGGGGGGACTTTATTTAATGAGCGGCTGTAACTTGGTCCGTCGTTTGAGAGGGGACACCTCATATGAT
+
gggggegcgeggggcgfcgc_gf_ggfefcgVgegcfcdgf`geggdd[ge`ggafeggggdgdgee^gg

b) *.read.info

Information about each simulated read.

Column 1: reads ID.
Column 2: Name of the reference file; this is mainly provided so that you
tell which chromosome set a read came from if you simulate reads
from a diploid genome produced with the `pirs diploid' command.
Column 3: FASTA/FASTQ sequence ID of the contig/scaffold/chromosome.
Column 4: position(1-based) in chromosome.
Column 5: "+" forward direction; "-" reverse direction.
Column 6: real insert size.
Column 7: length of read-end masking by EAMSS algorithm.
Column 8: read-position(1-based) of substitution error and raw base(->)error base.
Column 9: read-position(1-based) of insertion error and the base of insertion.
Column 10: read-position(1-based) of deletion error and the base of deletion

c) *_snp.lst
For example:

I 3 3 G A
I 45342 45355 C T
I 104775 104680 C T
.....
Column 1: chromosome sequence ID.
Column 2: position(1-based) of SNP in reference chromosome.
Column 3: position(1-based) of SNP in simulated diploid chromosome.
Column 4: base of reference chromosome sequence.
Column 5: the base of SNP.

d) *_indel.lst
For example:
I - 3 1 C
IV + 5 2 AC
.....
Column 1: chromosomesequence ID.
Column 2: "-" Deletion; "+" Insertion.
Column 3: position(1-based) of InDel in the reference chromosome sequence.
For deletions, it is the position of the first deleted base. For
insertions, it is the position of the reference base corresponding
to the base directly before the insertion.
Column 4: position(1-based) of InDel in the diploid sequence. For
deletions, it is the position of the diploid sequence base
corresponding to the reference base directly before the deletion.
For insertions, it is the position of the first inserted base.
Column 5: length of InDel (always positive).
Column 6: sequence of the InDel.

I - 3 3 1 C
Ref: a t G c a // 3 is the position of G base in the chromosome base.
: a t G - a // following the G base is a deletion of "C" base.
IV + 5 5 2 A C
Ref: t t g c A - - g t t// 5 is the position of A base in the chromosome base.
: t t g c A A C g t t// following the A base is a insertion of "AC" bases.

e) *_inversion.lst
For example:
I 50191 50195 100
I 948984 948903 200
Column 1: chromosome sequence ID.
Column 2: position(1-based) of beginning of inversion in the reference
chromosome sequence.
Column 3: position(1-based) of beginning of inversion of the diploid
chromosome sequence.
Column 3: length of inversion.

6 Notes
======
a) pIRS does not simulate reads containing "N" char. If your input reference
contains the "N" character, reads generated in this in this region will be
discarded.
b) When running a simulation without the --no-subst-errors option, the maximum
length of the simulated reads depends on the number of cycles recorded in
the base-calling profile. The user must set the read length to no more
than half the number of cycles recorded in the base-calling profile.
c) When masking quality values, the program uses the same EAMSS algorithm
from CASAVA v1.8.0. This is only done if the --eamss option is supplied.
d) The program parses one chromosome at a time. Reads are evenly distributed
among the chromosomes.
e) To re-do a simulation and get the exact same results, you must set the
random seed with the -S parameter. In addition, you must use the
single-threaded version of pIRS, or else set --threads=4 (for only one
simulator thread).
f) pIRS will shift quality values when the substitution-error rate setting by
user is different from that in profile. The quality score in the output
data range from 2 to the maximum score of profile. You can find the real
substitution-error rate of output data in file *.error_rate.distr.

For update & support, please refer to http://code.google.com/p/pirs/ .