-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a config file to define hardcoded parameters #9
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Automated checks passed and local check passed:
git clone https://github.com/nextstrain/measles.git
cd measles
git checkout move-params-to-config
nextstrain build .
nextstrain view auspice
Changes look good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, @kimandrews!
Following from conversation on Slack in support of this idea, can you also move the hardcoded file paths for configuration files to the config YAML file, too? I would follow the pattern we use for the ncov workflow config files where we define a top-level "files" key in the YAML such that most of the entries in the files
rule map like this:
files:
exclude: "config/dropped_strains.txt"
reference: "config/measles_reference.gb"
colors: "config/colors.tsv"
auspice_config: "config/auspice_config.json"
Since input_fasta
doesn't get used anywhere and we instead download the data we need to start the workflow, we could just delete that line.
After discussion with the group, we decided to additionaly move hardcoded file paths for configuration files to the config YAML file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thanks, @kimandrews!
files: | ||
exclude: "config/dropped_strains.txt" | ||
reference: "config/measles_reference.gb" | ||
colors: "config/colors.tsv" | ||
auspice_config: "config/auspice_config.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking
I would follow the pattern we use for the ncov workflow config files where we define a top-level "files" key in the YAML such that most of the entries in the files rule map like this:
@huddlej, can you share some details on the reason ncov was originally set up like this?
Specifically, why group under the top level files
key instead of adding the files to specific param groups like so?
filter:
exclude: "config/dropped_strains.txt"
...
align:
reference: "config/measles_reference.gb"
export:
colors: "config/colors.tsv"
auspice_config: "config/auspice_config.json"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joverlee521 That's a good question! The ncov pattern came from the same impulse we felt in this PR to replace a files
rule in the Snakefile with a YAML config. I think I wanted to emphasize the placement of the configuration details instead of suggesting their reorganization. (My memories from this time period are hazy though!)
The original reason for grouping all the config file paths into one section of the Snakefile must date back farther, but I suspect the main reason for that structure is that the same files can be needed by multiple rules. For example, we might need the reference
file for align
, ancestral
, and translate
. In the flu workflow, we need to reference the list of strains from include
in both augur filter rules and a later flag outliers rule. Even files that seem less ambiguously connected to a specific rule like the Auspice config JSON is to export
, we also use the color scales from those JSONs in seasonal flu across multiple rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks for the context @huddlej!
Grouping files together makes sense if expected to be used multiple times. I also see the other side of it being confusing to need to edit configs for the filter
rule under both files
and filter
. It's also not be immediately obvious looking at the config what rules use which files (unless you have detailed docs like ncov).
Currently, I see configs using a mix of top level files
key, top level file name keys, and rule specific file name keys. We should come to some consensus on this schema as a group...I'll write up an issue on this in pathogen-repo-template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also not be immediately obvious looking at the config what rules use which files (unless you have detailed docs like ncov).
I have spent plenty of time following the strings to work out which rules use which files (mainly in ncov), and often finding a bunch of files no longer used by any rule. Docs are great, but from experience they inevitably fall out of sync. How about something like the following (or a variant thereof)?
files:
exclude: &files_exclude 'config/exclude_accessions.txt'
filter:
exclude: *files_exclude
P.S. Thanks @joverlee521 for the links to examples in your comment, it makes a big difference ✨
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can continue discussion in nextstrain/pathogen-repo-guide#26
Description of proposed changes
Use a config file to define hardcoded parameters and file paths
Checklist