diff --git a/README.md b/README.md index e7706eb..7c903c4 100644 --- a/README.md +++ b/README.md @@ -70,10 +70,325 @@ conda config --add channels r conda install --yes --name elvira_env python=3 r-base=4 fastp trinity transrate transdecoder busco blast hmmer fastqc ``` +## Reference generation + +**STEP 1**: Reads quality assessment. We checked the quality of our +reads using the modules in FastQC. + +``` bash +# Reads quality assessment +fastqc \ + --extract \ + --nogroup \ + --threads $SLURM_NTASKS \ + --outdir qc_dir \ + sample_1.fq.gz sample_2.fq.gz +``` + +**STEP 2**: Reads trimming using fastp (Chen et al. 2018). We removed +remaining sequencing adapters (detected by reads-pairs overlapping), +poly-X tails (we did not allow more than 15bp consecutive position with +the nucleotide), and low complexity reads (reads with many repetitions). +After we trimmed the low-quality positions, requiring a mean quality of +30 phred-score. The final length required after trimming was 70bp. + +``` bash +# Reads quality control +fastp \ + --thread $SLURM_NTASKS \ + --in1 sample_1.fq.gz \ + --in2 sample_2.fq.gz \ + --out1 sample_1.trim.fq.gz \ + --out2 sample_2.trim.fq.gz \ + --json sample.trim.json\ + --html sample.trim.html \ + --detect_adapter_for_pe \ + --low_complexity_filter \ + --complexity_threshold 50 \ + --cut_right \ + --cut_window_size 10 \ + --cut_mean_quality 30 \ + --qualified_quality_phred 30 \ + --unqualified_percent_limit 25 \ + --average_qual 30 \ + --length_required 70 \ + --correction \ + --overrepresentation_analysis \ + --overrepresentation_sampling 5 +``` + +**STEP 3**: Transcriptome *de novo* assembly. We used the clean reads +obtained in the previous step to reconstruct the transcripts using +Trinity (Haas et al. 2013). + +``` bash +# Transcriptome assembly +Trinity \ + --CPU $SLURM_NTASKS \ + --max_memory 64G \ + --seqType fq \ + --min_contig_length 300 \ + --KMER_SIZE 20 \ + --min_glue 10 \ + --min_per_id_same_path 84 \ + --max_diffs_same_path 16 \ + --left sample_1.trim.fq.gz \ + --right sample_2.trim.fq.gz \ + --output sample-Trinity \ + --full_cleanup +``` + +**STEP 4**: Transcriptome assembly evaluation. We checked the number of +molluscan orthologs assembled using BUSCO (Manni et al. 2021) +(transcriptome completeness); and we checked the number of transcripts +that were assembled correctly using TransRate (Smith-Unna et al. 2016) +(transcriptome correctness). + +``` bash +# Check the transcriptome completeness +busco \ + --force \ + --cpu $SLURM_NTASKS \ + --evalue 1e-6 \ + --in sample_transcripts.fna \ + --out mollusca_ort \ + --mode transcriptome \ + --lineage_dataset mollusca_odb10 + +# Check the transcriptome correctness +transrate \ + --threads $SLURM_NTASKS \ + --assembly sample_transcripts.fna \ + --left sample_1.trim.fq.gz \ + --right sample_1.trim.fq.gz \ + --output sample_cor +``` + +**STEP 5**: Protein-coding sequences identification. We extracted the +protein coding sequences from the transcript assembled correctly using +TransDecoder. After that, we removed the redundant sequences using +seqkit (Shen et al. 2016). + +``` bash +# Create the transcript to gene map +${TRINITY_HOME}/util/support_scripts/get_Trinity_gene_to_trans_map.pl \ + sample.good.fasta > sample.good.gtm + +# Extract the protein-coding sequences +TransDecoder.LongOrfs \ + -t sample.good.fasta \ + --gene_trans_map sample.good.gtm \ + --output_dir sample.good-cds + +TransDecoder.Predict \ + --single_best_only \ + -t sample.good.fasta \ + --output_dir sample.good-cds + +# Remove redundant sequences +seqkit rmdup \ + --threads $SLURM_NTASKS \ + --ignore-case \ + --by-seq \ + --out-file sample.good.rmdup.faa \ + sample.good.faa +``` + +**STEP 6**: Possible biological contamination removal. After removing +the possible miss-assemblies, we pulled the potential contaminants from +other organisms (bacteria, algae, etc.). We aligned the reads to the +NCBI non-redundant protein database and from that we extracted the +taxonomic information from these matches using the [taxonomizr R +package](https://cran.r-project.org/web/packages/taxonomizr/index.html). +We only kept the sequences that matched with a molluscan protein from +the genera *Elysia*, *Aplysia* and *Plakobranchus*. Previously to define +the filtering condition, we explored the taxonomy of all the matches to +find the most appropriate limits. + +``` bash +# Local alignment to identify the product +blastp \ + -num_threads $SLURM_NTASKS \ + -task blastp-fast \ + -evalue 1e-3 \ + -max_hsps 1 \ + -max_target_seqs 1 \ + -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs" \ + -db nr \ + -query sample.good.rmdup.faa \ + -out sample.good.rmdup.nr.tsv +``` + +``` r +# Attach the packages +library(taxonomizr) +library(Biostrings) +library(stringr) +library(readr) +library(dplyr) + +# Prepare the database (do only the 1st time) +getNamesAndNodes() +getAccession2taxid(types = c("nucl_wgs", "nucl_gb", "prot")) +read.names.sql("names.dmp", "accessionTaxa.sql") +read.nodes.sql("nodes.dmp", "accessionTaxa.sql") +read.accession2taxid(list.files(".", "accession2taxid.gz$"), "accessionTaxa.sql") + +# Load the transcripts sequences and the output from STEP 5A +prot_good <- readAAStringSet("sample.good.rmdup.faa") +blast_out <- read_delim(file = "sample.good.rmdup.nr.tsv", col_names = FALSE) %>% + select("X1", "X2") %>% + rename("tx_name" = "X1", "nr_accs" = "X2") + +# Extract the taxonomic information from the proteins ID +prot_taxa <- tibble( + "nr_accs" = unique(sort(blast_out %>% pull("nr_accs"))), + "nr_egid" = accessionToTaxa(nr_accs, "accessionTaxa.sql") + ) %>% + bind_cols(getTaxonomy(nr_accs %>% pull(nr_egid), "accessionTaxa.sql")) + +# Filter by Genus to keep only sea slugs +prot_filt <- prot_taxa %>% + filter(genus %in% c("Aplysia", "Elysia", "Plakobranchus")) %>% + inner_join(blast_out) +prot_filt <- prot_good[prot_filt %>% pull(tx_name)] + +# Rename the final transcripts to handle simple names +new_sqaccs <- paste0("SPNAME", str_pad(1:length(prot_good), width = nchar(length(prot_good)), pad = "0")) +old_sqaccs <- names(prot_filt) +accs_convr <- tibble("qseqid" = old_sqaccs, "qseqid_new" = new_sqaccs) +names(prot_filt) <- new_sqaccs + +# Change the sequence accession in the alignment +col_names <- c("qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore", "qlen", "slen", "qcovs") +res_blast <- read_delim(file = "sample.good.rmdup.nr.tsv", col_names = col_names) %>% + inner_join(accs_convr) %>% + select(!qseqid) %>% + rename("qseqid" = "qseqid_new") + +# Export the final transcriptome +writeXStringSet(x = prot_filt, filepath = "sample.good.rmdup.slugs.faa", append = FALSE) +write_delim(x = res_blast, file = "sample.good.rmdup.slugs.nr.tsv", delim = "\t", col_names = FALSE) +``` + +**STEP 7**: Protein homology prediction. We tried to predict the product +of the transcripts by local alignment using BLASTP (Altschul et al. +1990; **shiryev2007improved?**; Camacho et al. 2009). We annotated the +transcriptome using four databases: NCBI Nr, NCBI RefSeq Protein (Sayers +et al. 2021), UniProt TrEMBL and UniProt TrEMBL limited to molluscan +entries (UniProt Consortium 2021). + +``` bash +# Identify the homologous proteins +blastp \ + -num_threads $SLURM_NTASKS \ + -task blastp \ + -evalue 1e-6 \ + -max_hsps 1 \ + -max_target_seqs 1 \ + -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs" \ + -db dbname \ + -query sample.good.rmdup.slugs.faa \ + -out sample.good.rmdup.slugs.dbname.tsv +``` + +**STEP 8**: Protein domain identification. We identified the different +protein domains from the filtered transcripts using HMMER (Eddy 2011) +and the database Pfam (Mistry et al. 2021). + +``` bash +# Identify the different protein domains +hmmsearch \ + --cpu $SLURM_NTASKS \ + --tblout sample.good.rmdup.slugs.seqdom.tsv \ + --domtblout sample.good.rmdup.slugs.ptrdom.tsv \ + -E 1e-6 \ + --domE 1e-6 \ + --incE 1e-6 \ + --incdomE 1e-6 \ + pfam \ + sample.good.rmdup.slugs.faa +``` + +**STEP 9**: Gene Ontology annotation. We predicted the GO associated +with each transcript using the homologous proteins (STEP 7 using TrEMBL +DB) and the protein domains (STEP 8). Additionally, we required an extra +file containing the conversion between protein accession and the GO IDs; +this information can be obtained using the ID mapping tool. + +``` bash +# Extract the protein accessions to convert into GO IDs +awk '{print $1}' sample.good.rmdup.slugs.trembl.tsv | sort -u > sample.good.rmdup.slugs.prot2go.tsv +``` + +``` r +# Attach the packages +library(elvira) + +# Add GO annotations to the transcripts +txann <- go_annotation( + trans_blast = "sample.good.rmdup.slugs.trembl.tsv", + trans_hmm = "sample.good.rmdup.slugs.seqdom.tsv", + unip_goid = "sample.good.rmdup.slugs.prot2go.tsv" + ) + +# Export the annotation +write_delim(txann, file = "sample.good.rmdup.slugs.trans2go.tsv", delim = "\t", col_names = FALSE) +``` + +**STEP 10**: Additional functional annotation using an orthologs +database. After filtering the sequences and obtaining the final +transcriptome (`sample.good.rmdup.slugs.faa`) we sent it for functional +annotation using the eggNOG webportal (Huerta-Cepas et al. 2019; +Cantalapiedra et al. 2021) with the following options: + +- protein sequences +- minimum hit e-value = `1e-6` +- percentage identity = `60` +- minimum % of query coverage = `50` +- Gene Ontology evidence = `Transfer all annotations` +- PFAM refinement = `realign queries to the whole PFAM DB` +- SMART annotation = `Perform SMART annotation`. + +The other options were set by default. + # References