Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pipeline linting for configuration files such that all directives which use params are closures. #2508

Open
mahesh-panchal opened this issue Nov 9, 2023 · 3 comments
Labels
Milestone

Comments

@mahesh-panchal
Copy link
Member

mahesh-panchal commented Nov 9, 2023

Description of feature

From slack: It's still common that users want to supply parameters using -c. This often leads to users encountering this issue: nextflow-io/nextflow#2662.

From the developer side, one change we can make is that all process directives which use params should be nested in a closure.
E.g.

  • ext.args = "$params.something" -> ext.args = { "$params.something" }.
  • ext.prefix = "$params.something" -> ext.prefix = { "$params.something" }.
  • publishDir = [ path: ".../$params.something", mode: params.publish_dir_mode ] -> publishDir = [ path: { ".../$params.something" } , mode: params.publish_dir_mode ]

mode: of publishDir doesn't accept a closure, so should be the only case ignored.

There are some cases of using params in enabled: of publishDir (eg. rnaseq) which also doesn't evaluate closures. These should be linted to move the enabled: test into saveAs: to allow closures.

@ewels ewels added linting and removed enhancement labels Nov 9, 2023
@ewels ewels added this to the 2.10 milestone Nov 9, 2023
@mahesh-panchal
Copy link
Member Author

It's seems there are some parameters used outside the process directive scope in the nextflow.config and these must be passed on the command-line or with a -params-file.

  • params.outdir: used in timeline, trace, report, and dag.
  • params.max_time: used in check_max function.
  • params.max_memory: used in check_max function.
  • params.max_cpus: used in check_max function.
  • params.genomes: used in nextflow.config.
  • params.igenomes_ignore: used in nextflow.config.
  • params.igenomes_base: used in igenomes.config.

@mirpedrol mirpedrol modified the milestones: 2.10, 2.11 Nov 10, 2023
@ewels ewels modified the milestones: 2.11, 2.12 Dec 8, 2023
@mirpedrol mirpedrol modified the milestones: 2.12, 2.13 Jan 29, 2024
@mashehu mashehu modified the milestones: 2.13, 3.0 Feb 16, 2024
@ewels ewels modified the milestones: 3.0, 3.1 Sep 26, 2024
@ewels ewels modified the milestones: 3.1, 3.2 Nov 21, 2024
@ewels
Copy link
Member

ewels commented Nov 21, 2024

From the developer side, one change we can make is that all process directives which use params should be nested in a closure.

@aostiles - we were wondering if RefTrace might be able to check for this kind of subtle syntax issue, before we start throwing ourselves into 100s lines of regular expressions 😅 What do you think?

@aostiles
Copy link

aostiles commented Dec 3, 2024

Yes, RefTrace can do that. I've updated it and pushed it to PyPI.
pip install reftrace will grab the lastest version.

I've updated the docs: https://reftrace.com/docs/lint-config-files to reflect the most recent changes and included an API reference. I'm working on fleshing out the docs and will make it easy for others to contribute. In the mean time, please feel free to ping me here, on Slack, or make an issue in the main repo.

Here's the current workflow:

pip install reftrace
reftrace generate  # generate a rules.py with both config and module linting rules
# edit rules.py to add/remove rules
retrace lint  # run the linter, linting all modules and *.config files

While RefTrace's core parsing logic is in Go, it's goal is to be easily usable from Python. There's nothing magic going on and it's possible to import pdb; pdb.set_trace() for rule debugging.

Specifically for this issue, one would write a rule with this signature

@configrule
def must_use_closures(config: ConfigFile, results: LintResults):

The relevant part exposed is a list of process scopes. For example:

process {
    withName: 'GUNZIP_.*|MAKE_TRANSCRIPTS_FASTA' {
        publishDir = [
            path: { "${params.outdir}/genome" },
            mode: params.publish_dir_mode,
            saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
            enabled: params.save_reference
        ]
    }

    withName: 'UNTAR_.*' {
        ext.args2 = '--no-same-owner'
    }

    withName: 'UNTAR_.*|STAR_GENOMEGENERATE|STAR_GENOMEGENERATE_IGENOMES|HISAT2_BUILD' {
        publishDir = [
            path: { "${params.outdir}/genome/index" },
            mode: params.publish_dir_mode,
            saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
            enabled: params.save_reference
        ]
    }
}

would be a single process scope with three named scopes. I realize this doesn't fully represent the config file format, but seemed like a good starting point.

Please let me know if you have questions or run into issues. The pip-installable tool has automated tests and I've manually tested on Mac M1 and x86 Linux. Additional platforms shouldn't be hard to add. It supports Python 3.9+. Most of the heavy lifting is done so the tool should be easy to change. I'm working on better docs so others can change it too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants