crypto_diversity_workshop

Mandy,Imtiaj and Courtney reworking StrainSort Pipeline for Cryptosporidium.

Data, java programs and scripts saved here: /work/tcglab/Crypto_MHS/crypto_diversity_test_04.01.2024

Inputs Needed

Paired end FASTQ-Formatted sequencing reads (of any length) obtained from a sample(s) of interest
Compiled reference genomes or sequences (such as scaffolds) to be used as a reference database in fasta format. It is best to remove contamination before use.
A strain key - must have the headers of the reference sequences in the first column and the strain associated with that sequence in the second column. Other data can be present.
A single reference genome from the species of interest in fasta format. This will be used to gather coverage info and create a common reference between all samples for diversity analysis

Setup Reference genomes

We will use the whole genome, without masking the repeats. Did remove contamination using NCBI's FCS tool.

Run StrainSort Curated

First script - 0_kallisto_index.sh

This script indexes our reference genomes to create a reference databse We used the reference genomes that were not masked in this case

Second script - 1_count_raw.sh

This loops through the samples and counts the raw reads in the fastq files

Third script - 2_reformat_raw_read_counts.sh

This reformats the reads so that they can be pasted into an excel sheet. You will want to use reformat_Read_Counts.class with Java version v13.0.2 if you can. This is the already compiled program.

Goes from this format: sample_1 1,000 1,000 sample_2 1,000 1,000

to this format: Sample Read1 Counts Read2 Counts sample_1 1,000 1,000 sample_2 1,000 1,000

reformat_Read_Counts.java - is the non-compiled code. You can recompile this with any java version and then it will run with that Java version. Only recompile if you can not access Java v13.0.2

Fourth script - 3_trim.sh

This script trims off the Nextera adapters and quality filters the reads.

Fifth script - 4_count_trim.sh

Script is set up the same as the script that counts the raw reads. We want to count the trimmed reads to be sure that we are not loosing a large number of reads in during trimming and quality filtering.

Sixth script - 5_reformat_trim_read_count

This script reformats the reads from the previous script so that they are easily pasted into an excel sheet

Seventh script - 6_kallisto_quant_pseudobam.sh

This script uses Kallisto to quantify the estimated number of reads per reference genome. This is how we will get our estimated abundance values. It is also set up to give us a pseudobam file which is what we will use to separate the reads. Outputs files into folder by sample (which is really inconvenient) .

Eight script - 7_kallisto_rename.sh

This script renames the output files to have there sample name included. This way we can move them together into one folder.

Ninth script - 8_kallisto_move.sh

This is the script that moves them all together.

After this script you can export the abundance.tsv files out pf the cluster and use them with the R script provided to make estimated abundance figures (See below).

Tenth script - 9_lineage_file_setup.sh

This script sets up the texts files that allow for species separation.

You will need a strain_key.txt that has the headers of the reference sequences in the first column and the strain associated with that sequence in the second column. Other data can be present.

To execute this script: make a directory that named lineage_files move lineage_file_setup.class and the strain_key.txt into the folder Then run the script.

The output should be text files with the species name form the second column as the files name. Each file will contain the the headers in the first column that are associated with that species.

These will be used in next script.

There should also be a All_strain_name.txt file that lists all the species that are present in the strain_key.txt. They will be in a line so that they can be used in the next script.

https://www.samformat.info/sam-format-flag

Eleventh script - 10_kallisto_pseudobam_read_separation.sh

This script separates the reads in the psuedobam files created by kallisto, using the species text files created in the previous step. Use the list of species within the All_strain_name.txt in line 72 of this script.

Twelfth script - 11_sep_reads_mapped_ref.sh

Here we will map all of the reads from all of the samples that were divided in the previous step to the C. parvum genome. This will allow us to determine coverage stats and is needed for downstream analysis with GATK

Thirteenth script - 12_coverage_samtools.sh

Using samtools coverage command when obtain the breadth of coverage and depth of coverage for each sample. This will help us determine which samples will move forward in the analysis

Fourteenth script - 13_gatk_1.sh

This steps sets read groups and marks duplicates.

Fifteenth script - 14_gatk_haplo.sh

This script calls haplotypes from each of the inputs.

Sixteenth script - 15_converttobed.sh

This script converts the reference genomes to bed format for the next step

Seventeenth script - 16_gatk_genomics_db_import.sh

This script uses the haplotypes called from the inputs and the bed file from the previous step to create a database of the haplotypes called. This step also requires a sample map file that indicates where the GVCF files are. I provided and example file for this cause: gvcfs-for-db-import.sample_map

Eighteenth script - 17_gatk_GVCF.sh

This script creates one VCF file with all genotyped inputs.

Ninetennth script - 18_VCFtools_filtering.sh

Filtering the VCF file

Run R script

To visualize the estimated abundance of your samples you will need:

The R markdown script provided: Kallisto_crypto_diversity_viz.Rmd
The folder with the abundance tsv files.
A species key - Kallisto_key.tsv

Create a folder and put all of the things listed above in it. Then open the Kallisto_crypto_diversity_viz.Rmd file and follow the instructions within it.

Reference Genomes

download from here: https://1drv.ms/u/s!AnQ1r7STGRvc0AaaIjauDy18eTS9?e=GveoOW

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crypto_diversity_workshop

Inputs Needed

Setup Reference genomes

Run StrainSort Curated

First script - 0_kallisto_index.sh

Second script - 1_count_raw.sh

Third script - 2_reformat_raw_read_counts.sh

Fourth script - 3_trim.sh

Fifth script - 4_count_trim.sh

Sixth script - 5_reformat_trim_read_count

Seventh script - 6_kallisto_quant_pseudobam.sh

Eight script - 7_kallisto_rename.sh

Ninth script - 8_kallisto_move.sh

Tenth script - 9_lineage_file_setup.sh

Eleventh script - 10_kallisto_pseudobam_read_separation.sh

Twelfth script - 11_sep_reads_mapped_ref.sh

Thirteenth script - 12_coverage_samtools.sh

Fourteenth script - 13_gatk_1.sh

Fifteenth script - 14_gatk_haplo.sh

Sixteenth script - 15_converttobed.sh

Seventeenth script - 16_gatk_genomics_db_import.sh

Eighteenth script - 17_gatk_GVCF.sh

Ninetennth script - 18_VCFtools_filtering.sh

Run R script

Reference Genomes

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
Java_programs		Java_programs
R		R
read_count_outputs		read_count_outputs
ref_genome_setup		ref_genome_setup
reference_genome		reference_genome
.gitattributes		.gitattributes
0_kallisto_index.sh		0_kallisto_index.sh
10_kallisto_pseudobam_read_separation.sh		10_kallisto_pseudobam_read_separation.sh
11_sep_reads_map_ref.sh		11_sep_reads_map_ref.sh
12_coverage_samtools.sh		12_coverage_samtools.sh
13_gatk_1.sh		13_gatk_1.sh
14_gatk_haplo.sh		14_gatk_haplo.sh
15_converttobed.sh		15_converttobed.sh
16_gatk_genomics_db_import.sh		16_gatk_genomics_db_import.sh
17_gatk_GVCF.sh		17_gatk_GVCF.sh
18_VCFtools_filtering.sh		18_VCFtools_filtering.sh
1_count_raw.sh		1_count_raw.sh
2_reformat_raw_read_counts.sh		2_reformat_raw_read_counts.sh
3_trim.sh		3_trim.sh
4_count_trim.sh		4_count_trim.sh
5_reformat_trim_read_counts.sh		5_reformat_trim_read_counts.sh
6_kallisto_quant_psuedobam.sh		6_kallisto_quant_psuedobam.sh
7_kallisto_rename.sh		7_kallisto_rename.sh
8_kallisto_move.sh		8_kallisto_move.sh
9_lineage_file_setup.sh		9_lineage_file_setup.sh
README.md		README.md
gvcfs-for-db-import.sample_map		gvcfs-for-db-import.sample_map
helpful_linux_commands.txt		helpful_linux_commands.txt
separated_read_coverage_info.xlsx		separated_read_coverage_info.xlsx
strain_key.txt		strain_key.txt

mandysulli/crypto_diversity_workshop

Folders and files

Latest commit

History

Repository files navigation

crypto_diversity_workshop

Inputs Needed

Setup Reference genomes

Run StrainSort Curated

First script - 0_kallisto_index.sh

Second script - 1_count_raw.sh

Third script - 2_reformat_raw_read_counts.sh

Fourth script - 3_trim.sh

Fifth script - 4_count_trim.sh

Sixth script - 5_reformat_trim_read_count

Seventh script - 6_kallisto_quant_pseudobam.sh

Eight script - 7_kallisto_rename.sh

Ninth script - 8_kallisto_move.sh

Tenth script - 9_lineage_file_setup.sh

Eleventh script - 10_kallisto_pseudobam_read_separation.sh

Twelfth script - 11_sep_reads_mapped_ref.sh

Thirteenth script - 12_coverage_samtools.sh

Fourteenth script - 13_gatk_1.sh

Fifteenth script - 14_gatk_haplo.sh

Sixteenth script - 15_converttobed.sh

Seventeenth script - 16_gatk_genomics_db_import.sh

Eighteenth script - 17_gatk_GVCF.sh

Ninetennth script - 18_VCFtools_filtering.sh

Run R script

Reference Genomes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages