v2.0.0
Implemented Enhancements:
- entry point for scaffolds added using either
-entry SCAFFOLDS
or-entry CDC_SCAFFOLDS
that runs everything post SPAdes step. New input parameters--indir
and--scaffold_ext
added for functionality of this entry point commit f12da60.- Supports scaffold files from shovill, spades and unicycler.
- entry point for sra added using either
-entry SRA
or-entry CDC_SRA
. These entry points will pull samples from SRA based on what is passed to--input_sra
, which is a file with one SRR number per line commit a86ad3f. - Check now performed on input samplesheets to confirm the same sample id, forward read and reverse read aren't used multiple times in the samplesheet commit fd6127f.
- Changed many modules to
process_single
rather thanprocess_low
to reduce resource requirements for these steps. - Updates to run PHX on nf-tower with an AWS back-end. Also, updated
tower.yml
file to have working reports. - AMRFinder+ was updated v3.11.11 allows point mutation calling for Burkholderia cepacia species complex, Burkholderia pseudomallei species complex, Serratia marcescens and Staphylococcus_pseudintermedius.
- Argument,
--coverage
added. Can be passed to increase coverage cut off that will cause sample to fail minimum qc standards (default is 30x). - Public Kraken2 database is required rather than requesting from sharefile. For PHoeNIx >=2.0.0 you will need to download the public Standard-8 version kraken2 database created on or after March 14th, 2023 from Ben Langmead's github page. You CANNOT use an older version of the public kraken databases on Ben Langmead's github page. We thank @BenLangmead and @jenniferlu717 for taking the time to include an extra file in public kraken databases created after March 14th, 2023 to allow them to work in PHoeNIx!
- For PHoeNIx <=1.1.1 you will need to download the public Standard-8 version kraken2 database created on May 17, 2021 from Ben Langmead's github page. The download link is https://genome-idx.s3.amazonaws.com/kraken/k2_standard_8gb_20210517.tar.gz.
- The kraken database can be passed as a uncompressed folder or just in its downloaded
.tar.gz
form.
Output File Changes:
- The folder
fastqc
was changed tofastqc_trimd
to clarify it contains results from the trimmed data. - PROKKA module now outputs
.fsa
file (nucleotide file of genes) rather than.fna
as the.fna
file is really just the assembly file again. - Added version for base container information for
FAIRY
,ASSET_CHECK
,FORMAT_ANI
,FETCH_FAILED_SUMMARIES
,CREATE_SUMMARY_LINE
,GATHER_SUMMARY_LINES
, andGENERATE_PIPELINE_STATS
. This was added tosoftware_versions.yml
. - Changing the file/folder structure of some files for clarity and to make it less cluttered:
- Folders
Annotation
andAssembly
were changed toannotation
andassembly
respectively to keep continuity. - Files
kraken2_asmbld/*.unclassified.fastq.gz
andkraken2_asmbld/*.classified.fastq.gz
were changed tokraken2_asmbld/*.unclassified.fasta.gz
andkraken2_asmbld/*.classified.fasta.gz
as they are actuallyfasta
files. *.fastANI.txt
--> moved from~/ANI/fastANI
to~/ANI
.- The file
*_trimmed_read_counts.txt
that was infastp_trimd
was moved to the folderqc_stats
. - Files
*_fastqc.zip
and*_fastqc.html
in folderfastqc_trimd
moved toqc_stats
. *.bbduk.log
--> moved from~/removedAdapters
to~/${sample}/qc_stats
andremovedAdapters
is not longer and output folder.raw_stats
folder was created and contains${sample}_raw_read_counts.txt
and${sample}_FAIry_synopsis.txt
, previously these were in the foldersfastp_trimd
andFAIry
, respectively.
- Folders
- Sample GC% added to
*_GC_content_20230504.txt
file. *_trimmed_read_counts.txt
hasPaired_Sequenced_[reads]
column added asTotal_Sequenced_[reads]
is the number of the paired sequences and singletons.- Files produced from FastANI, MASH and FORMAT_ANI had mash database's data appended to the file name for tracking and validation. Files are now named
*${sample}_REFSEQ_20230504.ani.txt
,${samplename}_REFSEQ_20230504.fastANI.txt
,${samplename}_REFSEQ_20230504_best_MASH_hits.txt
and${samplename}_REFSEQ_20230504.txt
. - GRiPHin file updates
- New columns for
WARNINGS
,ALERTS
,Minimum_QC_Issues
,Total_Raw_[reads]
,Paired_Trimmed_[reads]
andGC%
. - New column
Primary_MLST_Source
as added to show if the assmebly (MLST program) or reads (SRST2) was used for MLST determination. Auto_PassFail
andPassFail_Reason
were changed toMinimum_QC_Checks
andMinimum_QC_Issues
, respectively. This was to clarifiy these are minimum requirements for QC.- The column
Total_Sequenced_[bp]
was removed from the report for lack of utility. Q30_R1_[%]
,Q30_R2_[%]
, andTotal_Sequenced_[reads]
were relabelled asRaw_Q30_R1_[%]
,Raw_Q30_R2_[%]
andTotal_Trimmed_[reads]
, respectively for clarity.
- New columns for
Fixed Bugs:
- Added module
GET_RAW_STATS
to get raw stats, previously this was information was pulled fromFASTP_TRIMD
step, however, the input data here was postBBDUK
which removes PhiX reads and adapters. Thus, the previous raw count was slightly off. - Fixed python version information not showing up for
GET_TAXA_FOR_AMRFINDER
andGATHERING_TRIMD_READ_QC_STATS
. This was added tosoftware_versions.yml
. - Fixed issue where sample names with underscore it in caused incorrect parsing and contig number not showing up in GRiPHin reported genes commit a0fdff5.
- Fixed
AttributeError: 'DataFrame' object has no attribute 'map'
error that came up in GRiPhin step when your set of samples had both a macrolide and macrolide_lincosamide_streptogramin AR gene commit 460bdbc. Phoenix_Output_Report.tsv
was reporting %Coverage for FastANI in theTaxa_Confidence
column rather than%ID
. Now both are reported when FastANI is successful commit 3b26fec.GRiPHin_Report.xlsx
was switch from reported rounded numbers for coverage/similarity % to reporting the floor as reporting 100% when 99.5% is the actual number is misleading and doesn't alert the user to SNPs in genes. Now by switching to the floor 99.5% would be reported as 99% commit 5477627.- Corrected GAMMA modules not printing the right version in the
software_version.yml
file commit 5477627.
Database Updates:
- Curated AR gene database was updated on 2023-05-17 (yyyy-mm-dd) which includes:
- AMRFinderPlus database
- Version 2023-04-17.1
- ARG-ANNOT
- Latest version NT v6 July 2019
- ResFinder
- Bumped from
v2.0.0
tov2.1.0
including until 2023-04-12 commit f46d8fc.
- Bumped from
- AMRFinderPlus database
- Updated AMRFinder Database used by AMRFinder+ and GAMMA to v2023-04-17.1.
SRST2_MLST
andMLST
step now use the mlst_db which is provided in~/phoenix/assests/databases
this is now static and no longer pulls updates from PubMLST.org. This will keep the pipeline running when PubMLST.org is down and keeps the schemes from changing if you run the same sample at different times. This was implemented to deal with PubMLST.org being down fairly often and with pipeline validation in mind.
Container Updates: