Standardized composable configs #79

joverlee521 · 2024-12-10T18:46:14Z

This is a meta issue for tracking work around "composable configs"

Context

We don't have a centralized config schema for phylogenetic workflows because each pathogen runs different Augur commands and custom scripts that use different params. The config gets even more complicated when the workflow creates multiple builds (e.g. flu subtype x segment x time resolution). Since the workflows are authored by different people, we end up with varying config schemas that can be confusing to outside users. With the config file as the main UI for external users of workflows, we should make them easier to work with!

Documenting available config params and their default values

I don't think it's realistic to write and maintain detailed workflow config docs like we have for ncov, so I've been trying to find a way to have centralized documentation for config files.

Add standard logging of config values #18
How should we standardize config schema for files? #26
October 2024 write up on workflow configs discusses some standard guidelines that we can document workflow configs.
03 December 2024 lab meeting explores the idea of having a centralized composable config schema for the shared Augur commands. Each pathogen workflow would use a subset of the centralized schema and then extend it with pathogen specific config schemas. We can create docs for the centralized config schema and only maintain pathogen specific docs per repo.

Making it easy to overriding configs

With nextstrain build and the forthcoming workflows as programs, users can provide custom config files to override the default config params. This is relatively straightforward for single build workflows (as long as the config params are well documented). This can be tedious for multi-build workflows as discussed on Slack. The path forward here is less clear, but here are some related work around this:

implement composable auspice config file for augur export v2 augur#298 (although this issue is focused on the auspice config, there are similar ideas)
Rewrite config syntax avian-flu#104

Validating configs during the workflow

It'd be useful to get immediate feedback during the workflow run by validating the user's custom config file. This should flag missing required config params and config params in the config file that are not being used in the workflow. This would require maintenance of a config schema that matches the use of configs in the workflow.

The text was updated successfully, but these errors were encountered:

tsibley · 2024-12-11T18:14:10Z

I talked about this with @joverlee521 during our 1:1 yesterday, and I've mentioned it elsewhere too, but I think there's a lot to be said for approaching our configs as "small multiples":

One build (i.e. one set of Auspice JSONs) == one "small" config document and the config for multi-build workflows == a collection (dict/list) of these small config documents.

builds:
  zika:
    filter:
      group_by: …
      min_date: …
      min_length: …

builds:
  avian-flu/h5n1/ha/all-time:
    filter:
      group_by: …
      min_date: …
      min_length: …

  avian-flu/h5n1/ha/2y:
    filter:
      group_by: …
      min_date: …
      min_length: …

  avian-flu/h5n1/mp/all-time:
    filter:
      group_by: …
      min_date: …
      min_length: …

  …

# or, maybe alternatively, nested: <https://github.com/joverlee521/nextstrain-testing/blob/cba0c7e5/configs/configs/avian-flu.yaml>

Benefits:

Easier to document/explain/teach: the "small" config that's repeated is simpler because it's for a single build and it's explainable in isolation; the overall config is then explainable as many "small" configs, one per build.
Works consistently for single-build or multi-build workflows.
All config fields are always settable for every build; we don't need to pick and choose the supported granularity for each field (and realize later we picked wrong).
Can be written by hand, with repetition optionally elided by YAML anchors.
Can be generated from a more concise config by other means (CUE, custom programs, whatever).
Simplifes workflow authoring: a build's entire config is accessed via config["builds"][f"avian-flu/{w.subtype}/{w.segment}/{w.resolution}"]. This is straight-forward dictionary access without need for extra lookup functions.

Taking this idea further, I see two main places of interaction with config:

What the user writes (e.g. their config.yaml in their working analysis directory, written by modifying an example or from scratch).
What the workflow reads (e.g. accessing the config variable, what the workflow author writes).

We've not treated those separately, i.e. the config is ~identical between the two, but I think we should start treating them separately:

The human-written config should be concise and expressive, e.g. supporting things like @jameshadfield's globbing.
The machine-read config should be verbose and fully-expanded (as with the "small multiples" idea above). It should be statically accessible without (or with very minimal) use of extra functions. This simplifies the config interface for the author of the workflow.
The human-written config should be expanded to the machine-read config either a) outside the workflow, before invocation or b) very early in the workflow's initialization (probably more practical). This can be via CUE or via other means.
Even if (b), the workflow should still be able to take a fully-expanded config generated by other means (e.g. custom programs, à la how Augur is fine with custom generated node data files).

A concise and expressive syntax such as globbing seems easier to explain/teach with the "small multiples" approach: the key concept is that the concise syntax is expanded to the collection of small configs, and this expansion can be previewed in advance of actually running the workflow.

huddlej · 2024-12-11T21:01:00Z

We discussed CUE a bit back in Jan 2022 and I ended up testing CUE for seasonal flu's config, but the consensus at the time was summarized by @rneher's comment of "I don't think we should get to hung up on how to generate configs."

tsibley · 2024-12-16T21:47:40Z

@huddlej Thanks for digging up that previous discussion and example! I'd forgotten about that (and it's interesting to look at other examples in that Slack thread). I was advocating for a "small multiples" approach then too:

When you push up multi-config out of the config itself, you open up all sorts of possibilities for how to produce those configs (CUE, Python, copy/paste, whatever). and don't force someone to learn the bespoke hardcoded expansion rules or behaviour

In response to:

"I don't think we should get to hung up on how to generate configs."

I see the "small multiples" approach as intentionally not getting hung up on how configs are generated by making it possible to generate/produce them many ways. The alternative of a single complex config with bespoke composition methods more easily leads IMO to getting hung up on exactly what you can and can't compose and how. All this said though, I (still) disagree that "sweating the details" in the context of improving usability is "getting hung up on" them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardized composable configs #79

Standardized composable configs #79

joverlee521 commented Dec 10, 2024 •

edited

Loading

tsibley commented Dec 11, 2024

huddlej commented Dec 11, 2024

tsibley commented Dec 16, 2024 •

edited

Loading

Standardized composable configs #79

Standardized composable configs #79

Comments

joverlee521 commented Dec 10, 2024 • edited Loading

Context

Documenting available config params and their default values

Making it easy to overriding configs

Validating configs during the workflow

tsibley commented Dec 11, 2024

huddlej commented Dec 11, 2024

tsibley commented Dec 16, 2024 • edited Loading

joverlee521 commented Dec 10, 2024 •

edited

Loading

tsibley commented Dec 16, 2024 •

edited

Loading