Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bin QC Improvements #707

Merged
merged 38 commits into from
Dec 13, 2024
Merged

Bin QC Improvements #707

merged 38 commits into from
Dec 13, 2024

Conversation

dialvarezs
Copy link
Contributor

@dialvarezs dialvarezs commented Oct 27, 2024

This PR adds:

  • CheckM2 as an alternative for bin qc
  • Updates CheckM and GUNC modules
  • A new BIN_QC subworkflow, integrating CheckM, CheckM2, BUSCO and GUNC

Closes #607.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

- Update modules
- Update integration in mag and with other tools (bin_summary, gtdb-tk)
- Update test
- Update schema
@dialvarezs dialvarezs marked this pull request as draft October 27, 2024 21:31
@jfy133
Copy link
Member

jfy133 commented Oct 27, 2024

@nf-core-bot fix linting

@jfy133
Copy link
Member

jfy133 commented Oct 27, 2024

Before you continue (sorry this is a bit late):

I generally don't like to deprecate old version of tools for a while, but rather keep them as alternative tools.

In some cases people want to stick with the original version for compatibility with previous runs

Could you 'revert' (or reinstall) the old checkm module and wrap it in an if/else statement (but within the subworkflow :) )

@muabnezor did a similar thing when adding porechop_ABI hree: #674

@dialvarezs
Copy link
Contributor Author

@jfy133 that makes sense, I will revert the CheckM removal. Bad for me for not asking before 😅.

@dialvarezs dialvarezs changed the title feat: Replace from CheckM to CheckM2 feat: Add CheckM2 Oct 27, 2024
@dialvarezs dialvarezs marked this pull request as ready for review October 28, 2024 10:30
@dialvarezs
Copy link
Contributor Author

dialvarezs commented Oct 28, 2024

Sorry, I didn't catch your last comment about including both tools in a single workflow. With that in mind, would make sense to include BUSCO as well, and just make a "bin_qc" subworkflow?

@jfy133
Copy link
Member

jfy133 commented Oct 29, 2024

Sorry, I didn't catch your last comment about including both tools in a single workflow. With that in mind, would make sense to include BUSCO as well, and just make a "bin_qc" subworkflow?

Yes that would be perfect! We need to subworkflow the sh*t out of this monster 😅 thank you!!!

@dialvarezs dialvarezs force-pushed the dev-checkm2 branch 2 times, most recently from a4f42ef to da52285 Compare November 1, 2024 00:27
@dialvarezs
Copy link
Contributor Author

It should be ready now. There is a last minor issue that should be solved by this PR: nf-core/modules#7119

@jfy133
Copy link
Member

jfy133 commented Dec 5, 2024

Running the failed test one more time, the conoct test has failed a few times before too right?

@dialvarezs
Copy link
Contributor Author

From the ci logs it seems to be lack of space in the runner.

@dialvarezs
Copy link
Contributor Author

dialvarezs commented Dec 5, 2024

Oh, I realized something now. I totally forgot to skip Bin QC if params.skip_binqc is set. 😅

@jfy133
Copy link
Member

jfy133 commented Dec 6, 2024

Glad you worked it out!

Unfortunately since yesterday now have both kids home sick so I will only have time to test next week again (I'm really really sorry for this, apparently it's being a really bad year in my city in the kindergartens and schools for viruses and stuff 😣)

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more last comments, but otherwise all I'm doing for the rest of the day is running the tests!

One major thing missing now though: We've removed the standalone checkm test in the ci.yml, but this has not been replaced by an alternative test, so now we have no checkM test... we should make sure all three are executed at least one across all our tests.

I suggest:

  • test.config: run BUSCO (default, no change)
  • test_adapterremoval.config: run checkm
  • test_bbnorm.config: run checkm2

Does that make sense?

CITATIONS.md Outdated Show resolved Hide resolved
docs/output.md Outdated Show resolved Hide resolved
modules/local/combine_tsv.nf Show resolved Hide resolved
nextflow_schema.json Outdated Show resolved Hide resolved
nextflow_schema.json Outdated Show resolved Hide resolved
subworkflows/local/bin_qc.nf Show resolved Hide resolved
Comment on lines +110 to +120
if (params.save_busco_db) {
// publish files downloaded by Busco
ch_downloads = BUSCO.out.busco_downloads
.groupTuple()
.map { _lin, downloads -> downloads[0] }
.toSortedList()
.flatten()
BUSCO_SAVE_DOWNLOAD(ch_downloads)

ch_versions = ch_versions.mix(BUSCO_SAVE_DOWNLOAD.out.versions.first())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better description of second half of my old comment on this:

  1. This should go below the BUSCO module, given it takes the output of the BUSCO module
  2. Do you see any reason why this cannot just be replaced with an extra publishDir entry in modules.conf for BUSCO where if --save_busco_db, then save the database files.

subworkflows/local/bin_qc.nf Show resolved Hide resolved
@jfy133
Copy link
Member

jfy133 commented Dec 11, 2024

Checked on your PR vs dev:

  • Normal test (BUSCO) (busco_summary.tsv and binsummary.tsv comparable ✅)
  • Test with CheckM (checkm_summary.tsv and binsummary.tsv comparable ✅)
  • Test with CheckM2 (checkm_summary.tsv exists and binsummary.tsv has table merged with quast results ✅)
  • Test with checkm2 + GUNC (check GUNC executes basically and produces a gunc_Summary.tsv)

Sorry this has taken so long, but the tests are taking around 25-30m each time 🙄 which doesn't help

Ah I need to do one more manual check which is with GTDBTk which I'll have to do one a cluster tomorrow, as you've tweaked that slightly

  • Test with GTDBTk

But I'm pretty sure this is ready once my final comments above are addressed :) , mostly the missing addition of the CheckMs to the tests.

@jfy133
Copy link
Member

jfy133 commented Dec 11, 2024

Crap, BUSCO is borked everywhere isn't it 😢

@dialvarezs
Copy link
Contributor Author

dialvarezs commented Dec 11, 2024

Yeah, it's already reported here: https://gitlab.com/ezlab/busco/-/issues/776 and it seems that the only solution is to update. Today I have been looking at the code to update and migrate to the nf-core BUSCO module.

Do you mind if I integrate those changes in this PR, or do you prefer a new one?

For the meantime, can you give a review to the module update? nf-core/modules#7199
I' getting an error with the meta that I'm not sure how to solve...

@jfy133
Copy link
Member

jfy133 commented Dec 11, 2024

Self note: testing GTDBtk with following commands:

  1. CHECKM-NEW nextflow run dialvarezs/mag -r dev-checkm2 -profile test,mpcdf_viper --outdir ./results_dialvarezs_checkm_gdtbk --binqc_tool checkm2 --skip_gtdbtk false --gtdb_db /ptmp/jfellowsy/databases/gtdb/release220/ --binqc_tool checkm --checkm_db ../databases/checkm/ --krona_db /ptmp/jfellowsy/databases/krona/taxonomy/ --centrifuge_db /ptmp/jfellowsy/mag/testdbs/minigut_cf.tar.gz --kraken2_db /ptmp/jfellowsy/mag/testdbs/minigut_kraken.tgz -resume (a bit confused because I specify checkm2 and accidently gave checkm1 db, but it seems checkm1 executed -> because I have --binqc_tool twice 😅) ? ✅
  2. CHECKM-OLD nextflow run nf-core/mag -r dev -profile test,mpcdf_viper --outdir ./results_checkm_gdtbk --skip_gtdbtk false --gtdb_db /ptmp/jfellowsy/databases/gtdb/release220/ --binqc_tool checkm --checkm_db ../databases/checkm/ --krona_db /ptmp/jfellowsy/databases/krona/taxonomy/ --centrifuge_db /ptmp/jfellowsy/mag/testdbs/minigut_cf.tar.gz --kraken2_db /ptmp/jfellowsy/mag/testdbs/minigut_kraken.tgz ✅ multiprocessing socket in use error -> two checkM jobs on same node breaks checkm, will either maxForks or just keep resuming)
  3. CHECKM2-NEW: nextflow run dialvarezs/mag -r dev-checkm2 -profile test,mpcdf_viper --outdir ./results_dialvarezs_checkm2_gdtbk --binqc_tool checkm2 --skip_gtdbtk false --gtdb_db /ptmp/jfellowsy/databases/gtdb/release220/ --checkm_db /ptmp/jfellowsy/databases/checkm2/ --krona_db /ptmp/jfellowsy/databases/krona/taxonomy/ --centrifuge_db /ptmp/jfellowsy/mag/testdbs/minigut_cf.tar.gz --kraken2_db /ptmp/jfellowsy/mag/testdbs/minigut_kraken.tgz
  4. BUSCO-NEW: nextflow run dialvarezs/mag -r dev-checkm2 -profile test,mpcdf_viper --outdir ./results_dialvarezs_busco_gdtbk --binqc_tool busco --skip_gtdbtk false --gtdb_db /ptmp/jfellowsy/databases/gtdb/release220/ --busco_db /ptmp/jfellowsy/databases/busco/bacteria_obd10 --krona_db /ptmp/jfellowsy/databases/krona/taxonomy/ --centrifuge_db /ptmp/jfellowsy/mag/testdbs/minigut_cf.tar.gz --kraken2_db /ptmp/jfellowsy/mag/testdbs/minigut_kraken.tgz

@jfy133
Copy link
Member

jfy133 commented Dec 11, 2024

Yeah, it's already reported here: https://gitlab.com/ezlab/busco/-/issues/776 and it seems that the only solution is to update. Today I have been looking at the code to update and migrate to the nf-core BUSCO module.

Do you mind if I integrate those changes in this PR, or do you prefer a new one?

Let's do a new one, it's breaking small MEGAHIT fix PR, so if it's separate we can pull into that one and this one at the same time.

For the meantime, can you give a review to the module update? nf-core/modules#7199 I' getting an error with the meta that I'm not sure how to solve...

Will have a look now! EDIT: fixed!

@jfy133
Copy link
Member

jfy133 commented Dec 13, 2024

Old BUSCO files should be back now!

And I'm waiting for my last GTDBTK related check but all the other runs looking good!

@dialvarezs
Copy link
Contributor Author

Old BUSCO files should be back now!

That's good news. The BUSCO update PR ended up getting a bit large with the removal of all those local modules.

@jfy133
Copy link
Member

jfy133 commented Dec 13, 2024

@dialvarezs My manual tests with GTDBTk work!

Just need the tests configs as in my comment above, but otherwise this is ready :D

@nf-core-bot
Copy link
Member

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.0.2.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

@dialvarezs
Copy link
Contributor Author

@jfy133 great!
And I agree with you about the tests, I will add that right now.

@jfy133 jfy133 self-requested a review December 13, 2024 15:11
Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If tests pass with the new configs...

image

@jfy133
Copy link
Member

jfy133 commented Dec 13, 2024

Thank you very much @dialvarezs ! This is huge work!

@dialvarezs
Copy link
Contributor Author

dialvarezs commented Dec 13, 2024

There was a last minor bug when CheckM was not run for certain bins (specifically eukaryotic ones). I updated the condition to check if the CheckM bins are a subset of the depth bins, rather than requiring them to be equal.

@dialvarezs dialvarezs merged commit c096f9a into nf-core:dev Dec 13, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants