Update README.md

rderelle · Sep 22, 2024 · 7813322 · 7813322
1 parent 91a5616
commit 7813322
Showing 1 changed file with 22 additions and 13 deletions.
diff --git a/README.md b/README.md
@@ -7,21 +7,27 @@
 
 ### Overview
 
-Fastlin is an ultra-fast program to perform lineage typing of <i>Mycobacterium tuberculosis</i> complex (MTBC) FASTQ read data and FASTA assemblies. Using the split-kmer approach, it can accuratly predict MTBC lineages and strain mixtures in seconds.
+Fastlin is an ultra-fast program to perform lineage typing of <i>Mycobacterium tuberculosis</i> complex (MTBC) FASTQ read data, BAM files and FASTA assemblies. Using the split-kmer approach, it can accuratly predict MTBC lineages and strain mixtures in seconds.
 
 Reference: [fastlin: an ultra-fast program for Mycobacterium tuberculosis complex lineage typing.](https://doi.org/10.1093/bioinformatics/btad648)
 
+Main updates since publication:
++ 0.2.3 : FASTA files as input (also using [seq_io](https://github.com/markschl/seq_io))
++ 0.3.0 : multi-threading available
++ 0.4.0 : BAM files as input (using [rust-htslib](https://github.com/rust-bio/rust-htslib))
+
+There is no planned updates at the moment. Please open an issue if you have any suggestion or request.
 
 ### Installation
-To install fastlin via cargo, you must have the [rust toolchain](https://www.rust-lang.org/tools/install) installed.
+To install fastlin via cargo (you must have the [rust toolchain](https://www.rust-lang.org/tools/install) installed):
 ```
 cargo install fastlin
 ```
-Or you can copy the code from this repository and install it using this command:
+Or you can download the latest release from this repository and compile it using cargo:
 ```
-cargo install --path .
+cargo install --path directory_release
 ```
-Alternatively, you can install precompiled binaries using Conda (Linux and macOS Intel processors):
+Alternatively, you can install precompiled binaries using Conda:
 ```
 conda install -c bioconda fastlin
 ```
@@ -30,18 +36,22 @@ You will also need a barcode file (see Input files below).
 ### Running fastlin
 The default command line is:
 ```
-fastlin -d /path/directory_fastq_files -b barcodes_file.txt
+fastlin -d your_directory -b your_barcodes.txt
 ```
 If your dataset consists of FASTQ files that are not BAM-derived, then you can apply a maximum kmer coverage threshold to reduce runtimes: 
 ```
-fastlin -d /path/directory_fastq_files -b barcode_file.txt -x 80
+fastlin -d your_directory -b your_barcodes.txt -x 80
 ```
 
 ### Input files
-<p>Fastlin takes as input the path of the directory containing the fastq and/or fasta files. The directory can contain a mix of FASTA geome assemblies, paired-end and single-end FASTQ files. These data files should be gzipped, with the following extensions:</p>
+<p>Fastlin takes as input the path of the directory containing FASTQ, BAM and/or FASTA files. The directory can contain a mix of FASTA geome assemblies, BAM alignment files, paired-end and single-end FASTQ files. FASTQ and FASTA files should be gzipped, with the following extensions:</p>
 
 - **.fastq.gz** or **.fq.gz** for FASTQ read data. The names of paired-end files should be in the form name_1.fq.gz and name_2.fq.gz (or equivalent with fastq.gz)
-- **.fas.gz**, **.fasta.gz** or **.fna.gz** for FASTA genome assemblies. In the cases of FASTA files, (i) the min-occurence paramter is automatically set to 1 and (ii) the maximum kmer coverage is ignored.
+- **.bam**, or **.BAM** for BAM files. Here the maximum kmer coverage is ignored since BAM files can be sorted.
+- **.fas.gz**, **.fasta.gz** or **.fna.gz** for FASTA genome assemblies. Here, minimum occurence is set to 1 and the maximum kmer coverage is ignored.
+
+<p>Please note that BAM files are analyzed in the same way as FASTQ files, by scanning reads without considering quality scores.
+You may find faster scripts or programs that focus solely on specific genomic positions.</p> 
 
 <p>The MTBC barcode file can be downloaded from https://www.github.com/rderelle/barcodes-fastlin. 
 Alternatively, you can build and test your own kmer barcodes using the Python scripts available in that directory.</p> 
@@ -53,7 +63,7 @@ A full description of fastlin parameters can be found [here](https://github.com/
 ### Output file
 Fastlin output consists of a tab-delimited file with the following fields:
 + sample: sample name
-+ data type: 'assembly', 'single' (reads) or 'paired' (-end reads)
++ data type: 'assembly', 'BAM', 'single' (reads) or 'paired' (-end reads)
 + k_cov: theoretical kmer coverage of the fastq files(s) based on the number of extracted kmers
 + mixture: pure ('no') or mixed ('yes') sample
 + lineages: detected lineages (median kmer occurences within paratheses)
@@ -65,7 +75,7 @@ ERRxxxxx&nbsp;&nbsp;&nbsp;&nbsp;paired&nbsp;&nbsp;&nbsp;&nbsp;118&nbsp;&nbsp;&nb
 
 The sample ERRxxxxx contains a single strain belonging to lineage 2. This typing is supported by 7 kmer barcodes, with a median number of occurences of 45. Since the abundance of the strain is far below the theoretical kmer coverage (equal here to 118), we can conclude that the sample is likely to contain high level of contaminations or sequencing errors.
 
-### Multithreading
+### Multi-threading
 <p>By default, fastlin runs on 1 thread. The number of threads can be increased using the '-t' parameter, which will split the sample set among all threads (for a single sample, increasing the number of threads will have no impact on runtime).</p>
 
 <p>Here are some examples of runtimes (in seconds) using real-world Mtb genomic data on a M2 Macbook Air:</p>
@@ -74,6 +84,7 @@ The sample ERRxxxxx contains a single strain belonging to lineage 2. This typing
 | data               | 1 thread  | 4 threads |
 |--------------------|-----------|-----------|
 | 12 paired FASTQ    |   66.9    | 19.3      |
+| 4 BAM files        |   26.9    | 9.4       |
 | 190 genomes FASTA  |   6.7     | 1.8       |
 
 </div>
@@ -85,5 +96,3 @@ The sample ERRxxxxx contains a single strain belonging to lineage 2. This typing
 dummy1&nbsp;&nbsp;&nbsp;single&nbsp;&nbsp;&nbsp;0&nbsp;&nbsp;&nbsp;no&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Error in file "reads/dummy1.fastq.gz": FASTQ parse error: sequence length is 150, but quality length is 50 (record 'ERR551806.5' at line 17).  
 dummy2&nbsp;&nbsp;&nbsp;single&nbsp;&nbsp;&nbsp;0&nbsp;&nbsp;&nbsp;no&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Error in file "reads/dummy2.fastq.gz": invalid gzip header  
 dummy3&nbsp;&nbsp;&nbsp;single&nbsp;&nbsp;&nbsp;0&nbsp;&nbsp;&nbsp;no&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Error in file "reads/dummy3.fastq.gz": corrupt deflate stream
-
-