Release v2.0.0 · CDCgov/phoenix

Implemented Enhancements:

entry point for scaffolds added using either -entry SCAFFOLDS or -entry CDC_SCAFFOLDS that runs everything post SPAdes step. New input parameters --indir and --scaffold_ext added for functionality of this entry point commit f12da60.
- Supports scaffold files from shovill, spades and unicycler.
entry point for sra added using either -entry SRA or -entry CDC_SRA. These entry points will pull samples from SRA based on what is passed to --input_sra, which is a file with one SRR number per line commit a86ad3f.
Check now performed on input samplesheets to confirm the same sample id, forward read and reverse read aren't used multiple times in the samplesheet commit fd6127f.
Changed many modules to process_single rather than process_low to reduce resource requirements for these steps.
Updates to run PHX on nf-tower with an AWS back-end. Also, updated tower.yml file to have working reports.
AMRFinder+ was updated v3.11.11 allows point mutation calling for Burkholderia cepacia species complex, Burkholderia pseudomallei species complex, Serratia marcescens and Staphylococcus_pseudintermedius.
Argument, --coverage added. Can be passed to increase coverage cut off that will cause sample to fail minimum qc standards (default is 30x).
Public Kraken2 database is required rather than requesting from sharefile. For PHoeNIx >=2.0.0 you will need to download the public Standard-8 version kraken2 database created on or after March 14th, 2023 from Ben Langmead's github page. You CANNOT use an older version of the public kraken databases on Ben Langmead's github page. We thank @BenLangmead and @jenniferlu717 for taking the time to include an extra file in public kraken databases created after March 14th, 2023 to allow them to work in PHoeNIx!
- For PHoeNIx <=1.1.1 you will need to download the public Standard-8 version kraken2 database created on May 17, 2021 from Ben Langmead's github page. The download link is https://genome-idx.s3.amazonaws.com/kraken/k2_standard_8gb_20210517.tar.gz.
- The kraken database can be passed as a uncompressed folder or just in its downloaded .tar.gz form.

Output File Changes:

The folder fastqc was changed to fastqc_trimd to clarify it contains results from the trimmed data.
PROKKA module now outputs .fsa file (nucleotide file of genes) rather than .fna as the .fna file is really just the assembly file again.
Added version for base container information for FAIRY, ASSET_CHECK, FORMAT_ANI, FETCH_FAILED_SUMMARIES, CREATE_SUMMARY_LINE, GATHER_SUMMARY_LINES, and GENERATE_PIPELINE_STATS. This was added to software_versions.yml.
Changing the file/folder structure of some files for clarity and to make it less cluttered:
- Folders Annotation and Assembly were changed to annotation and assembly respectively to keep continuity.
- Files kraken2_asmbld/*.unclassified.fastq.gz and kraken2_asmbld/*.classified.fastq.gz were changed to kraken2_asmbld/*.unclassified.fasta.gz and kraken2_asmbld/*.classified.fasta.gz as they are actually fasta files.
- *.fastANI.txt --> moved from ~/ANI/fastANI to ~/ANI.
- The file *_trimmed_read_counts.txt that was in fastp_trimd was moved to the folder qc_stats.
- Files *_fastqc.zip and *_fastqc.html in folder fastqc_trimd moved to qc_stats.
- *.bbduk.log --> moved from ~/removedAdapters to ~/${sample}/qc_stats and removedAdapters is not longer and output folder.
- raw_stats folder was created and contains ${sample}_raw_read_counts.txt and ${sample}_FAIry_synopsis.txt, previously these were in the folders fastp_trimd and FAIry, respectively.
Sample GC% added to *_GC_content_20230504.txt file.
*_trimmed_read_counts.txt has Paired_Sequenced_[reads] column added as Total_Sequenced_[reads] is the number of the paired sequences and singletons.
Files produced from FastANI, MASH and FORMAT_ANI had mash database's data appended to the file name for tracking and validation. Files are now named *${sample}_REFSEQ_20230504.ani.txt, ${samplename}_REFSEQ_20230504.fastANI.txt, ${samplename}_REFSEQ_20230504_best_MASH_hits.txt and ${samplename}_REFSEQ_20230504.txt.
GRiPHin file updates
- New columns for WARNINGS, ALERTS, Minimum_QC_Issues, Total_Raw_[reads], Paired_Trimmed_[reads] and GC%.
- New column Primary_MLST_Source as added to show if the assmebly (MLST program) or reads (SRST2) was used for MLST determination.
- Auto_PassFail and PassFail_Reason were changed to Minimum_QC_Checks and Minimum_QC_Issues, respectively. This was to clarifiy these are minimum requirements for QC.
- The column Total_Sequenced_[bp] was removed from the report for lack of utility.
- Q30_R1_[%], Q30_R2_[%], and Total_Sequenced_[reads] were relabelled as Raw_Q30_R1_[%], Raw_Q30_R2_[%] and Total_Trimmed_[reads], respectively for clarity.

Fixed Bugs:

Added module GET_RAW_STATS to get raw stats, previously this was information was pulled from FASTP_TRIMD step, however, the input data here was post BBDUK which removes PhiX reads and adapters. Thus, the previous raw count was slightly off.
Fixed python version information not showing up for GET_TAXA_FOR_AMRFINDER and GATHERING_TRIMD_READ_QC_STATS. This was added to software_versions.yml.
Fixed issue where sample names with underscore it in caused incorrect parsing and contig number not showing up in GRiPHin reported genes commit a0fdff5.
Fixed AttributeError: 'DataFrame' object has no attribute 'map' error that came up in GRiPhin step when your set of samples had both a macrolide and macrolide_lincosamide_streptogramin AR gene commit 460bdbc.
Phoenix_Output_Report.tsv was reporting %Coverage for FastANI in the Taxa_Confidence column rather than %ID. Now both are reported when FastANI is successful commit 3b26fec.
GRiPHin_Report.xlsx was switch from reported rounded numbers for coverage/similarity % to reporting the floor as reporting 100% when 99.5% is the actual number is misleading and doesn't alert the user to SNPs in genes. Now by switching to the floor 99.5% would be reported as 99% commit 5477627.
Corrected GAMMA modules not printing the right version in the software_version.yml file commit 5477627.

Database Updates:

Curated AR gene database was updated on 2023-05-17 (yyyy-mm-dd) which includes:
- AMRFinderPlus database
  - Version 2023-04-17.1
- ARG-ANNOT
  - Latest version NT v6 July 2019
- ResFinder
  - Bumped from v2.0.0 to v2.1.0 including until 2023-04-12 commit f46d8fc.
Updated AMRFinder Database used by AMRFinder+ and GAMMA to v2023-04-17.1.
SRST2_MLST and MLST step now use the mlst_db which is provided in ~/phoenix/assests/databases this is now static and no longer pulls updates from PubMLST.org. This will keep the pipeline running when PubMLST.org is down and keeps the schemes from changing if you run the same sample at different times. This was implemented to deal with PubMLST.org being down fairly often and with pipeline validation in mind.

Container Updates:

AMRFinder+ was updated from 3.10.45 to 3.11.11.
BUSCO was updated from 5.4.3 to 5.4.7.
MultiQC was updated from 1.11 to 1.14.
MLST was updated from 2.22.1 to 2.23.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.0.0

Contributors