forked from ngs-docs/2013-davis-assembly
-
Notifications
You must be signed in to change notification settings - Fork 0
/
titus-notes.txt
81 lines (52 loc) · 2.88 KB
/
titus-notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
================================
Notes on using khmer to do stuff
================================
Software
========
You'll need khmer and screed installed::
pip install khmer
pip install screed
You will also need the khmer repository for some of the sandbox scripts::
git clone https://github.com/ged-lab/khmer.git -b master
We'll be using the `khmer command line scripts
<http://khmer.readthedocs.org/en/latest/scripts.html>`__.
0. Assembling mRNAseq and metagenome data in the cloud
======================================================
See `Eel Pond mRNAseq assembly protocol
<https://khmer-protocols.readthedocs.org/en/v0.8.3/mrnaseq/index.html>`__
and `Kalamazoo metagenome assembly protocols
<https://khmer-protocols.readthedocs.org/en/v0.8.3/metagenomics/index.html>`__.
1. Generating a saturation curve for shotgun data
=================================================
You can use normalize-by-median.py to assess information saturation in your
shotgun data::
normalize-by-median.py -k 20 -C 5 -x 2e9 -N 4 -R report.txt <FASTA/FASTQ/GZ/BZ2 files>
Note that the '-x' and '-N' parameters should multiple to be under the total
amount of data in your data set.
'report.txt' column 1 will then contain the number of reads examined,
and column 2 will contain the number of reads kept. When this levels
off, you've saturated to a coverage of 5.
2. Estimating genomic richness/non-repetitive genome content
============================================================
You can use the three-pass assembly protocol (see `Kalamazoo <https://khmer-protocols.readthedocs.org/en/v0.8.3/metagenomics/index.html>`__) ::
normalize-by-median.py -k 20 -C 20 -x 2e9 -N 4 --savehash C20.kh <FASTA/FASTQ/GZ/BZ2 files>
filter-abund.py --loadhash C20.kh *.keep
normalize-by-median.py -k 20 -C 5 -x 2e9 -N 4 *.keep.abundfilt
Finally, get the complete stats on the remaining reads::
python /path/to/khmer/sandbox/readstats.py *.keep.abundfilt.keep
Divide the total number of bases remaining by 5 and you'll have an estimate
of your total genome size. (Technically, the total number of nodes in
your De Bruijn graph. :)
3. Generating a coverage plot/coverage spectrum without a reference
===================================================================
First, load all the k-mers in the data set into a counting file::
load-into-counting.py -k 20 -C 20 -x 2e9 -N 4 counts.kh reads.fq
Then calculate the spectrum of read coverages::
python /path/to/khmer/sandbox/calc-median-distribution.py counts.kh reads.fq reads.hist
4. Generating an error profile for individual shotgun reads
===========================================================
For this, you'll need to use the kmer_error_profile branch::
git clone https://github.com/ged-lab/khmer.git -b kmer_error_profile
cd khmer && make all
and now run the calc-error-profile.py script::
/path/to/khmer/scripts/calc-error-profile.py reads.fq