Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardized composable configs #79

Open
joverlee521 opened this issue Dec 10, 2024 · 3 comments
Open

Standardized composable configs #79

joverlee521 opened this issue Dec 10, 2024 · 3 comments

Comments

@joverlee521
Copy link
Contributor

joverlee521 commented Dec 10, 2024

This is a meta issue for tracking work around "composable configs"

Context

We don't have a centralized config schema for phylogenetic workflows because each pathogen runs different Augur commands and custom scripts that use different params. The config gets even more complicated when the workflow creates multiple builds (e.g. flu subtype x segment x time resolution). Since the workflows are authored by different people, we end up with varying config schemas that can be confusing to outside users. With the config file as the main UI for external users of workflows, we should make them easier to work with!

Documenting available config params and their default values

I don't think it's realistic to write and maintain detailed workflow config docs like we have for ncov, so I've been trying to find a way to have centralized documentation for config files.

Making it easy to overriding configs

With nextstrain build and the forthcoming workflows as programs, users can provide custom config files to override the default config params. This is relatively straightforward for single build workflows (as long as the config params are well documented). This can be tedious for multi-build workflows as discussed on Slack. The path forward here is less clear, but here are some related work around this:

Validating configs during the workflow

It'd be useful to get immediate feedback during the workflow run by validating the user's custom config file. This should flag missing required config params and config params in the config file that are not being used in the workflow. This would require maintenance of a config schema that matches the use of configs in the workflow.

@tsibley
Copy link
Member

tsibley commented Dec 11, 2024

I talked about this with @joverlee521 during our 1:1 yesterday, and I've mentioned it elsewhere too, but I think there's a lot to be said for approaching our configs as "small multiples":

One build (i.e. one set of Auspice JSONs) == one "small" config document and the config for multi-build workflows == a collection (dict/list) of these small config documents.

builds:
  zika:
    filter:
      group_by: 
      min_date: 
      min_length: 
builds:
  avian-flu/h5n1/ha/all-time:
    filter:
      group_by: 
      min_date: 
      min_length: 

  avian-flu/h5n1/ha/2y:
    filter:
      group_by: 
      min_date: 
      min_length: 

  avian-flu/h5n1/mp/all-time:
    filter:
      group_by: 
      min_date: 
      min_length: 

  

# or, maybe alternatively, nested: <https://github.com/joverlee521/nextstrain-testing/blob/cba0c7e5/configs/configs/avian-flu.yaml>

Benefits:

  • Easier to document/explain/teach: the "small" config that's repeated is simpler because it's for a single build and it's explainable in isolation; the overall config is then explainable as many "small" configs, one per build.
  • Works consistently for single-build or multi-build workflows.
  • All config fields are always settable for every build; we don't need to pick and choose the supported granularity for each field (and realize later we picked wrong).
  • Can be written by hand, with repetition optionally elided by YAML anchors.
  • Can be generated from a more concise config by other means (CUE, custom programs, whatever).
  • Simplifes workflow authoring: a build's entire config is accessed via config["builds"][f"avian-flu/{w.subtype}/{w.segment}/{w.resolution}"]. This is straight-forward dictionary access without need for extra lookup functions.

Taking this idea further, I see two main places of interaction with config:

  1. What the user writes (e.g. their config.yaml in their working analysis directory, written by modifying an example or from scratch).
  2. What the workflow reads (e.g. accessing the config variable, what the workflow author writes).

We've not treated those separately, i.e. the config is ~identical between the two, but I think we should start treating them separately:

  • The human-written config should be concise and expressive, e.g. supporting things like @jameshadfield's globbing.
  • The machine-read config should be verbose and fully-expanded (as with the "small multiples" idea above). It should be statically accessible without (or with very minimal) use of extra functions. This simplifies the config interface for the author of the workflow.
  • The human-written config should be expanded to the machine-read config either a) outside the workflow, before invocation or b) very early in the workflow's initialization (probably more practical). This can be via CUE or via other means.
  • Even if (b), the workflow should still be able to take a fully-expanded config generated by other means (e.g. custom programs, à la how Augur is fine with custom generated node data files).

A concise and expressive syntax such as globbing seems easier to explain/teach with the "small multiples" approach: the key concept is that the concise syntax is expanded to the collection of small configs, and this expansion can be previewed in advance of actually running the workflow.

@huddlej
Copy link

huddlej commented Dec 11, 2024

We discussed CUE a bit back in Jan 2022 and I ended up testing CUE for seasonal flu's config, but the consensus at the time was summarized by @rneher's comment of "I don't think we should get to hung up on how to generate configs."

@tsibley
Copy link
Member

tsibley commented Dec 16, 2024

@huddlej Thanks for digging up that previous discussion and example! I'd forgotten about that (and it's interesting to look at other examples in that Slack thread). I was advocating for a "small multiples" approach then too:

When you push up multi-config out of the config itself, you open up all sorts of possibilities for how to produce those configs (CUE, Python, copy/paste, whatever). and don't force someone to learn the bespoke hardcoded expansion rules or behaviour

In response to:

"I don't think we should get to hung up on how to generate configs."

I see the "small multiples" approach as intentionally not getting hung up on how configs are generated by making it possible to generate/produce them many ways. The alternative of a single complex config with bespoke composition methods more easily leads IMO to getting hung up on exactly what you can and can't compose and how. All this said though, I (still) disagree that "sweating the details" in the context of improving usability is "getting hung up on" them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants