Mandy,Imtiaj and Courtney reworking StrainSort Pipeline for Cryptosporidium.
Data, java programs and scripts saved here: /work/tcglab/Crypto_MHS/crypto_diversity_test_04.01.2024
- Paired end FASTQ-Formatted sequencing reads (of any length) obtained from a sample(s) of interest
- Compiled reference genomes or sequences (such as scaffolds) to be used as a reference database in fasta format. It is best to remove contamination before use.
- A strain key - must have the headers of the reference sequences in the first column and the strain associated with that sequence in the second column. Other data can be present.
- A single reference genome from the species of interest in fasta format. This will be used to gather coverage info and create a common reference between all samples for diversity analysis
We will use the whole genome, without masking the repeats. Did remove contamination using NCBI's FCS tool.
This script indexes our reference genomes to create a reference databse We used the reference genomes that were not masked in this case
This loops through the samples and counts the raw reads in the fastq files
This reformats the reads so that they can be pasted into an excel sheet. You will want to use reformat_Read_Counts.class with Java version v13.0.2 if you can. This is the already compiled program.
Goes from this format: sample_1 1,000 1,000 sample_2 1,000 1,000
to this format: Sample Read1 Counts Read2 Counts sample_1 1,000 1,000 sample_2 1,000 1,000
reformat_Read_Counts.java - is the non-compiled code. You can recompile this with any java version and then it will run with that Java version. Only recompile if you can not access Java v13.0.2
This script trims off the Nextera adapters and quality filters the reads.
Script is set up the same as the script that counts the raw reads. We want to count the trimmed reads to be sure that we are not loosing a large number of reads in during trimming and quality filtering.
This script reformats the reads from the previous script so that they are easily pasted into an excel sheet
This script uses Kallisto to quantify the estimated number of reads per reference genome. This is how we will get our estimated abundance values. It is also set up to give us a pseudobam file which is what we will use to separate the reads. Outputs files into folder by sample (which is really inconvenient) .
This script renames the output files to have there sample name included. This way we can move them together into one folder.
This is the script that moves them all together.
After this script you can export the abundance.tsv files out pf the cluster and use them with the R script provided to make estimated abundance figures (See below).
This script sets up the texts files that allow for species separation.
You will need a strain_key.txt that has the headers of the reference sequences in the first column and the strain associated with that sequence in the second column. Other data can be present.
To execute this script: make a directory that named lineage_files move lineage_file_setup.class and the strain_key.txt into the folder Then run the script.
The output should be text files with the species name form the second column as the files name. Each file will contain the the headers in the first column that are associated with that species.
Example: file name - C_andersoni.txt inside C_andersoni.txt: lcl|C_andersoni-LRBS01000121.1 lcl|C_andersoni-LRBS01000027.1 lcl|C_andersoni-LRBS01000095.1 lcl|C_andersoni-LRBS01000131.1 lcl|C_andersoni-LRBS01000092.1 etc...
These will be used in next script.
There should also be a All_strain_name.txt file that lists all the species that are present in the strain_key.txt. They will be in a line so that they can be used in the next script.
https://www.samformat.info/sam-format-flag
This script separates the reads in the psuedobam files created by kallisto, using the species text files created in the previous step. Use the list of species within the All_strain_name.txt in line 72 of this script.
Here we will map all of the reads from all of the samples that were divided in the previous step to the C. parvum genome. This will allow us to determine coverage stats and is needed for downstream analysis with GATK
Using samtools coverage command when obtain the breadth of coverage and depth of coverage for each sample. This will help us determine which samples will move forward in the analysis
This steps sets read groups and marks duplicates.
This script calls haplotypes from each of the inputs.
This script converts the reference genomes to bed format for the next step
This script uses the haplotypes called from the inputs and the bed file from the previous step to create a database of the haplotypes called. This step also requires a sample map file that indicates where the GVCF files are. I provided and example file for this cause: gvcfs-for-db-import.sample_map
This script creates one VCF file with all genotyped inputs.
Filtering the VCF file
To visualize the estimated abundance of your samples you will need:
- The R markdown script provided: Kallisto_crypto_diversity_viz.Rmd
- The folder with the abundance tsv files.
- A species key - Kallisto_key.tsv
Create a folder and put all of the things listed above in it. Then open the Kallisto_crypto_diversity_viz.Rmd file and follow the instructions within it.
download from here: https://1drv.ms/u/s!AnQ1r7STGRvc0AaaIjauDy18eTS9?e=GveoOW