v2.1.0
v2.1.0 (02/11/2024)
Implemented Enhancements:
- Added handling for "unknown" assemblers in the scaffolds entry point so genomes can be downloaded from NCBI and run through PHoeNIx.
- For entry points CDC_PHOENIX or PHOENIX you can now use the argument
--create_ncbi_sheet
to generate partially filled out excel sheets for uploading to NCBI. You will still need to fill in some lab/sample specific information and review for accuracy, but this should speed up the process. As a reminder, please do not submit raw sequencing data to the CDC HAI-Seq BioProject (531911) that are auto populated in these sheet unless you are a state public health laboratory, a CDC partner or have been directed to do so by DHQP. The BioProject accession IDs in these files are specifically designated for domestic HAI bacterial pathogen sequencing data, including from the Antimicrobial Resistance Laboratory Network (AR Lab Network), state public health labs, surveillance programs, and outbreaks. For inquiries about the appropriate BioProject location for your data, please contact [email protected]. - New Terra workflow for combining
Phoenix_Summary.tsv
,GRiPHin_Summary.tsv
andGRiPHin_Summary.xlsx
of multiple runs into one file. This workflow will also combine the NCBI excel sheets created when using the--create_ncbi_sheet
. software_versions.yml
now contains versions for all custom scripts used in the pipeline to streamline its validation process and align it with CLIA requirements, ensuring smoother compliance.- MultiQC now contains graphs and data from BBDuk, FastP, Quast and Kraken. BUSCO is also part of MultiQC if the entry point runs it (i.e. CDC_* entries).
- AMRFinder+ species that are screened for point mutations were updated with Enterobacter asburiae, Vibrio vulfinicus and Vibrio parahaemolyticus.
- A check was added to ensure only SRR numbers are passed to -entry
CDC_SRA
andSRA
. - After extensive QC cut off review addtional warnings and minimum QC cut-offs were added:
- Minimum PASS/FAIL:
- > 500 scaffolds
- FAIry (file integrity check) - see Fixed Bugs section below for details.
- Warnings:
- 200-500 scaffolds -> high, but not enough for failure
- Taxa Quality Checks:
- FastANI Coverage <90% and Match <95%
- For entries BUSCO <97%
- Contamination Checks:
- <70% of reads/weighted scaffolds assigned to top geneus hit.
- Added weighted scaffold to kraken <30% unclassifed check (was just on reads before)
- Added weighted scaffold to kraken only 1 genera >25% of assigned check (was just reads before)
- Minimum PASS/FAIL:
Output File Changes:
- The default outdir phx produces was changed. If the user doesn't pass
--outdir
, the default was changed fromresults
tophx_output
. This was changed in response to feedback from compliance program, to avoid confusion regarding the difference between public health results (i.e. summary) and diagnostic results (i.e. report). - The
phx_output/FAIry
folder will contain a*_summaryline_failure.tsv
file for any isolate where file corruption was detected. *.tax
file had the NCBI assigned taxID added after the:
for easy lookup.
Fixed Bugs:
- Updated
tower.yml
file to reflect file name changes in v2.0.2. This will enable nf-tower reports to properly show up. commit e1b2b91 GRiPHin_Summary.xlsx
was highlighting coverage outside 40-100x despite--coverage
setting, changes made to respect--coverage
flag.- Added a fix to handle when auto select by the mlst script chooses the wrong taxonomy. PHoeNIx will force a rerun in cases where the taxonomy is known but initial mlst is run against incorrect scheme. Known instances found so far include: E. coli (Pasteur) being incorrectly indentified as Aeromonas and E. coli (Pasteur) being identified as Klebsiella. The scoring in the MLST program was updated and can now cause lower count perfect hits (e.g. 6 of 6 Aeromonas genes at 100%) to be scored higher than novel correct hits (e.g. 7 of 8 at 100%, 1 novel gene).
- Corrected instance where, in some cases, an mlst scheme could not be determined that a proper out file was not created.
- Fixed issue with MLST where certain characters in filename would cause array index out of bounds error
- Fixed issue where samples that failed SPAdes did not have
--coverage
parameter respected when generating synopsis file. - Fixed
-entry CDC_SCAFFOLDS
providing incorrect headers (missingBUSCO
andBUSCO_DB
). - Updated FAIry (file integrity check) to catch additional file integrity errors.
- FAIry detects and reports when:
- Corrupt fastq files that prevents the completion of gzip and zcat and generate a synopsis file when needed.
- If R1/R2 fastqs that do not have equal number of reads in the files.
- If there are no reads or scaffolds left after filtering and read trimming steps, respectively.
- FAIry detects and reports when:
Container Updates:
- Containers are now called with their sha256 to streamline PHoeNIx's validation process and align it with CLIA requirements.
- Containers updated to include developers bug fixes:
- fastp: v0.23.2 to v0.23.4 bug fixes.
- fastqc: v0.11.9 to v0.12.1 bug fixes.
- kraken2: v2.1.2 to v2.1.3 which has improvements on efficiency and bug fixes.
- fastani: v1.33 to v1.34 bug fixes. Specifically, it fixed multi-threading output bugs. Output and interface of FastANI remains same as before.
- amrfinderplus: v3.11.11 to v3.11.26 which has improvements on efficiency and bug fixes.
- SRAtools v3.0.3 to 3.0.9 updates and bug fixes.
- Container for SRA entry steps
SRATOOLS_FASTERQDUMP
andSRATOOLS_PREFETCH
was switched to a quay.io/biocontainers to address issues with the old container and ICA. commit 68815e3 - The srst2 container version stays the same, but it is now in a custom container built from commit
73f885f55c748644412ccbaacecf12a771d0cae9
as there has been a bug fix for a rounding penalty to integer without a new release. In addition, a fix was added to address issues related to handling grepping of '(' and ')'. Hosting updated container on quay.io.
Database Updates:
- MLST database was pulled from PubMLST and updated on Jan 24th, 2024.
- The Plasmid Replicons database was updated to include an update to the Enterobacteriales.fsa database.
- Curated AR gene database was updated on 2024-01-24 (yyyy-mm-dd) which includes:
- AMRFinderPlus database
- Version 2023-11-15.1
- ARG-ANNOT hasn't changed since the last time the database was created and contains updates since version NT v6 July 2019
- ResFinder
- Includes until 2024-01-28 commit 97d1fe0cd0a119172037f6bdb29f8a1c7c6e6019
- AMRFinderPlus database