Merge pull request #71 from rki-mf1/dev

merge dev into main for v0.5.0
rki-mf1 · Aug 26, 2024 · a0e8c7e · a0e8c7e
2 parents cd23ddc + 8975d28
commit a0e8c7e
Show file tree

Hide file tree

Showing 14 changed files with 665 additions and 35 deletions.
diff --git a/README.md b/README.md
@@ -12,6 +12,7 @@
 2. [Installation](#installation)
 3. [Usage](#usage)
 4. [Help](#help)
+5. [Citation](#citation)
 
 
 ## System requirements:
@@ -87,13 +88,35 @@ This generates the following data within the `<project_root>/results/` directory
 - a report (CSV) with statistis across all tested individuals
 
 ### Tuning the workflow parameters
-Many internal settings can be adjusted at the nextflow level.
+CIEVaD enables access and finetuning to a vast majority of parameters of the internal software tools.
 The parameters to adjust the workflows are listed on their respective help pages.
-To inspect the help pages type `--help` after the script name.
-Parameters can be adjusted via the CLI or within the _nextflow.config_ file.
+To inspect the help pages type `--help` after the script name, e.g. `nextflow run hap.nf --help` for the hap.nf workflow.
+Parameters can be adjusted via the CLI or directly within the _nextflow.config_ file.
 Mind that parameters provided by the CLI will overwrite parameters set in config.
+More information about tuning crucial parameters, e.g. [read quality](https://github.com/rki-mf1/cievad/wiki/Parameterization-of-the-workflow) and [genome coverage](https://github.com/rki-mf1/cievad/wiki/FAQ---Troubleshooting), can be found in the Wiki.
 
 ## Help:
 
-Visit the project [wiki](https://github.com/rki-mf1/cievad/wiki) for more information, help and FAQs. <br>
+Visit the project [wiki](https://github.com/rki-mf1/cievad/wiki) for more detail information on parameters, help and FAQs. <br>
 Please file issues, bug reports and questions to the [issues](https://github.com/rki-mf1/cievad/issues) section.
+
+## Citation:
+
+We have a [preprint](https://www.biorxiv.org/content/10.1101/2024.06.21.600013v1) available for CIEVaD.
+For the time being, if you use CIEVaD please cite
+```
+@article {Krannich2024.06.21.600013,
+	author = {Krannich, Thomas and Ternovoj, Dimitri and Paraskevopoulou, Sofia and Fuchs, Stephan},
+	title = {CIEVaD: a lightweight workflow collection for rapid and on demand deployment of end-to-end testing of genomic variant detection},
+	elocation-id = {2024.06.21.600013},
+	year = {2024},
+	doi = {10.1101/2024.06.21.600013},
+	publisher = {Cold Spring Harbor Laboratory},
+	abstract = {The identification of genomic variants has become a routine task in the thriving age of genome sequencing. Particularly small genomic variants of single or few nucleotides are routinely investigated for their impact on an organism{\textquoteright}s phenotype. Hence, precise and robust detection of the variants{\textquoteright} exact genomic location and change in nucleotide composition is vital in many biological applications. Although a plethora of methods exist for the many key steps of variant detection, thoroughly testing the detection process and evaluating its results is still a cumbersome procedure. In this work, we present a collection of trivial to apply and highly modifiable workflows to facilitate the generation of synthetic test data as well as to evaluate the accordance of a user-provided set of variants with the test data. Availability: The workflows are implemented in Nextflow and are freely available and open-source at https://github.com/rki-mf1/cievad under the GPL-3.0 license.Competing Interest StatementThe authors have declared no competing interest.},
+	URL = {https://www.biorxiv.org/content/early/2024/06/21/2024.06.21.600013},
+	eprint = {https://www.biorxiv.org/content/early/2024/06/21/2024.06.21.600013.full.pdf},
+	journal = {bioRxiv}
+}
+```
+
+
diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-'0.4.1'
+'0.5.0'
diff --git a/aux/Nstretches.py b/aux/Nstretches.py
@@ -0,0 +1,58 @@
+import re
+import matplotlib.pyplot as plt
+import argparse
+
+def parse_fasta(fasta_file):
+    sequences = {}
+    with open(fasta_file, 'r') as file:
+        sequence_id = None
+        sequence = ''
+        for line in file:
+            line = line.strip()
+            if line.startswith('>'):
+                if sequence_id is not None:
+                    sequences[sequence_id] = sequence
+                sequence_id = line[1:]  # Remove the '>' character
+                sequence = ''
+            else:
+                sequence += line
+        if sequence_id is not None:
+            sequences[sequence_id] = sequence
+    return sequences
+
+def find_n_stretches(sequence):
+    return [(m.start(), m.end()) for m in re.finditer(r'N+', sequence)]
+
+def generate_histogram(n_stretches, output_file):
+    lengths = [end - start for start, end in n_stretches]
+    plt.hist(lengths, bins=range(1, max(lengths) + 2), edgecolor='black')
+    plt.title('Histogram of N Stretches')
+    plt.xlabel('Length of N Stretches')
+    plt.ylabel('Frequency')
+    plt.savefig(output_file)
+    #plt.show()
+
+def write_bed_file(n_stretches, sequence_id, bed_file):
+    with open(bed_file, 'w') as file:
+        for start, end in n_stretches:
+            file.write(f'{sequence_id}\t{start}\t{end}\n')
+
+def process_fasta(fasta_file, histogram_output, bed_output):
+    sequences = parse_fasta(fasta_file)
+    for sequence_id, sequence in sequences.items():
+        n_stretches = find_n_stretches(sequence)
+        generate_histogram(n_stretches, histogram_output)
+        write_bed_file(n_stretches, sequence_id, bed_output)
+
+def main():
+    parser = argparse.ArgumentParser(description="Process a FASTA file to find 'N' stretches, generate a histogram, and output a BED file.")
+    parser.add_argument('fasta_file', type=str, help="Input FASTA file")
+    parser.add_argument('histogram_output', type=str, help="Output filename for the histogram (PNG format)")
+    parser.add_argument('bed_output', type=str, help="Output filename for the BED file")
+
+    args = parser.parse_args()
+
+    process_fasta(args.fasta_file, args.histogram_output, args.bed_output)
+
+if __name__ == "__main__":
+    main()