Release v2.1.0 · CDCgov/phoenix

v2.1.0 (02/11/2024)

Full Changelog

Implemented Enhancements:

Added handling for "unknown" assemblers in the scaffolds entry point so genomes can be downloaded from NCBI and run through PHoeNIx.
For entry points CDC_PHOENIX or PHOENIX you can now use the argument --create_ncbi_sheet to generate partially filled out excel sheets for uploading to NCBI. You will still need to fill in some lab/sample specific information and review for accuracy, but this should speed up the process. As a reminder, please do not submit raw sequencing data to the CDC HAI-Seq BioProject (531911) that are auto populated in these sheet unless you are a state public health laboratory, a CDC partner or have been directed to do so by DHQP. The BioProject accession IDs in these files are specifically designated for domestic HAI bacterial pathogen sequencing data, including from the Antimicrobial Resistance Laboratory Network (AR Lab Network), state public health labs, surveillance programs, and outbreaks. For inquiries about the appropriate BioProject location for your data, please contact [email protected].
New Terra workflow for combining Phoenix_Summary.tsv, GRiPHin_Summary.tsv and GRiPHin_Summary.xlsx of multiple runs into one file. This workflow will also combine the NCBI excel sheets created when using the --create_ncbi_sheet.
software_versions.yml now contains versions for all custom scripts used in the pipeline to streamline its validation process and align it with CLIA requirements, ensuring smoother compliance.
MultiQC now contains graphs and data from BBDuk, FastP, Quast and Kraken. BUSCO is also part of MultiQC if the entry point runs it (i.e. CDC_* entries).
AMRFinder+ species that are screened for point mutations were updated with Enterobacter asburiae, Vibrio vulfinicus and Vibrio parahaemolyticus.
A check was added to ensure only SRR numbers are passed to -entry CDC_SRA and SRA.
After extensive QC cut off review addtional warnings and minimum QC cut-offs were added:
- Minimum PASS/FAIL:
  - > 500 scaffolds
  - FAIry (file integrity check) - see Fixed Bugs section below for details.
- Warnings:
  - 200-500 scaffolds -> high, but not enough for failure
  - Taxa Quality Checks:
    - FastANI Coverage <90% and Match <95%
    - For entries BUSCO <97%
  - Contamination Checks:
    - <70% of reads/weighted scaffolds assigned to top geneus hit.
    - Added weighted scaffold to kraken <30% unclassifed check (was just on reads before)
    - Added weighted scaffold to kraken only 1 genera >25% of assigned check (was just reads before)

Output File Changes:

The default outdir phx produces was changed. If the user doesn't pass --outdir, the default was changed from results to phx_output. This was changed in response to feedback from compliance program, to avoid confusion regarding the difference between public health results (i.e. summary) and diagnostic results (i.e. report).
The phx_output/FAIry folder will contain a *_summaryline_failure.tsv file for any isolate where file corruption was detected.
*.tax file had the NCBI assigned taxID added after the : for easy lookup.

Fixed Bugs:

Updated tower.yml file to reflect file name changes in v2.0.2. This will enable nf-tower reports to properly show up. commit e1b2b91
GRiPHin_Summary.xlsx was highlighting coverage outside 40-100x despite --coverage setting, changes made to respect --coverage flag.
Added a fix to handle when auto select by the mlst script chooses the wrong taxonomy. PHoeNIx will force a rerun in cases where the taxonomy is known but initial mlst is run against incorrect scheme. Known instances found so far include: E. coli (Pasteur) being incorrectly indentified as Aeromonas and E. coli (Pasteur) being identified as Klebsiella. The scoring in the MLST program was updated and can now cause lower count perfect hits (e.g. 6 of 6 Aeromonas genes at 100%) to be scored higher than novel correct hits (e.g. 7 of 8 at 100%, 1 novel gene).
Corrected instance where, in some cases, an mlst scheme could not be determined that a proper out file was not created.
Fixed issue with MLST where certain characters in filename would cause array index out of bounds error
Fixed issue where samples that failed SPAdes did not have --coverage parameter respected when generating synopsis file.
Fixed -entry CDC_SCAFFOLDS providing incorrect headers (missing BUSCO and BUSCO_DB).
Updated FAIry (file integrity check) to catch additional file integrity errors.
- FAIry detects and reports when:
  - Corrupt fastq files that prevents the completion of gzip and zcat and generate a synopsis file when needed.
  - If R1/R2 fastqs that do not have equal number of reads in the files.
  - If there are no reads or scaffolds left after filtering and read trimming steps, respectively.

Container Updates:

Containers are now called with their sha256 to streamline PHoeNIx's validation process and align it with CLIA requirements.
Containers updated to include developers bug fixes:
- fastp: v0.23.2 to v0.23.4 bug fixes.
- fastqc: v0.11.9 to v0.12.1 bug fixes.
- kraken2: v2.1.2 to v2.1.3 which has improvements on efficiency and bug fixes.
- fastani: v1.33 to v1.34 bug fixes. Specifically, it fixed multi-threading output bugs. Output and interface of FastANI remains same as before.
- amrfinderplus: v3.11.11 to v3.11.26 which has improvements on efficiency and bug fixes.
- SRAtools v3.0.3 to 3.0.9 updates and bug fixes.
Container for SRA entry steps SRATOOLS_FASTERQDUMP and SRATOOLS_PREFETCH was switched to a quay.io/biocontainers to address issues with the old container and ICA. commit 68815e3
The srst2 container version stays the same, but it is now in a custom container built from commit 73f885f55c748644412ccbaacecf12a771d0cae9 as there has been a bug fix for a rounding penalty to integer without a new release. In addition, a fix was added to address issues related to handling grepping of '(' and ')'. Hosting updated container on quay.io.

Database Updates:

MLST database was pulled from PubMLST and updated on Jan 24th, 2024.
The Plasmid Replicons database was updated to include an update to the Enterobacteriales.fsa database.
Curated AR gene database was updated on 2024-01-24 (yyyy-mm-dd) which includes:
- AMRFinderPlus database
  - Version 2023-11-15.1
- ARG-ANNOT hasn't changed since the last time the database was created and contains updates since version NT v6 July 2019
- ResFinder
  - Includes until 2024-01-28 commit 97d1fe0cd0a119172037f6bdb29f8a1c7c6e6019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.1.0

v2.1.0 (02/11/2024)