Pipeline principles and best-practice #214

SamuelBrand1 · 2024-05-09T11:25:54Z

There are a number of potential problems with the analysis pipeline as currently defined in the PR #213 .

This issue is trying to bring together a best-practice guide for making an analysis pipeline:

Tensions:

Re-use vs safety. For me re-using a bit of code from outside the inference loop seems less safe. For example, reusing an EpiData object rather than re-declaring with parameters from inside the loop over specifications seems to open up an opportunity to have bugs such as accidentally passing in a mis-specified object.
Saving vs declaring. When should objects be saved to disk, and reloaded and when should the script that created code objects be rerun?
Use of checkpointing. I assume that the only checkpoint should occur in heavily computational bits of the code. Otherwise, you are just creating an opportunity for the wrong variables to get semi-permanently stuck in the pipeline. What am I missing?

Any link to a decent analysis pipeline best practice guide?

The text was updated successfully, but these errors were encountered:

seabbs · 2024-05-09T13:19:44Z

Re-use vs safety. For me re-using a bit of code from outside the inference loop seems less safe. For example, reusing an EpiData object rather than re-declaring with parameters from inside the loop over specifications seems to open up an opportunity to have bugs such as accidentally passing in a mis-specified object.

If this is true why do we feel comfortable designing EpiAware around these kind of structs?

seabbs · 2024-05-09T13:21:01Z

Some notes:

I think the first thing is what we might want from a good pipeline?

Logical structure
Limited computational waste
When we change something anything that depends on that should update
Ability to freely scale backend compute
Ability to add new components without rerunning large chunks of the code base
Minimal pipeline infrastructure i.e. something is close to what a user of the underlying package would expect to see.

Now we could do much of this manually but ideally we would have tools to cover as much as possible.
Some examples of these are things like make (which does the structure kind of and caching), targets in R which does all of it depending on the amount of work you put in, snakemake and NextFlow in python which again do all of it.

From what we have seen in the julia ecosystem it looks like none of the tools do all the things we might like but that some combination of DrWatson, Dagger, JobScheduler, and Pipelines.

seabbs · 2024-05-09T13:22:04Z

I assume that the only checkpoint should occur in heavily computational bits of the code. Otherwise, you are just creating an opportunity for the wrong variables to get semi-permanently stuck in the pipeline. What am I missing?

I think good check pointing (that may not be what we actually have) is aware of all the things that go into it so that it can'
t "get stuck with the wrong things in it" - this is how things work in targets or a well designed make pipeline for example.

seabbs · 2024-05-09T13:22:50Z

Saving vs declaring. When should objects be saved to disk, and reloaded and when should the script that created code objects be rerun?

I think this is part of the issue but the actual main issue with this approach is global variables

SamuelBrand1 added help wanted Extra attention is needed pipeline labels May 9, 2024

CDCgov locked and limited conversation to collaborators May 10, 2024

SamuelBrand1 converted this issue into discussion #216 May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Pipeline principles and best-practice #214

Pipeline principles and best-practice #214

SamuelBrand1 commented May 9, 2024

seabbs commented May 9, 2024

seabbs commented May 9, 2024 •

edited

Loading

seabbs commented May 9, 2024

seabbs commented May 9, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

Pipeline principles and best-practice #214

Pipeline principles and best-practice #214

Comments

SamuelBrand1 commented May 9, 2024

seabbs commented May 9, 2024

seabbs commented May 9, 2024 • edited Loading

seabbs commented May 9, 2024

seabbs commented May 9, 2024

This issue was moved to a discussion.

seabbs commented May 9, 2024 •

edited

Loading