Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline principles and best-practice #214

Closed
SamuelBrand1 opened this issue May 9, 2024 · 4 comments
Closed

Pipeline principles and best-practice #214

SamuelBrand1 opened this issue May 9, 2024 · 4 comments
Labels
help wanted Extra attention is needed pipeline

Comments

@SamuelBrand1
Copy link
Collaborator

There are a number of potential problems with the analysis pipeline as currently defined in the PR #213 .

This issue is trying to bring together a best-practice guide for making an analysis pipeline:

Tensions:

  • Re-use vs safety. For me re-using a bit of code from outside the inference loop seems less safe. For example, reusing an EpiData object rather than re-declaring with parameters from inside the loop over specifications seems to open up an opportunity to have bugs such as accidentally passing in a mis-specified object.
  • Saving vs declaring. When should objects be saved to disk, and reloaded and when should the script that created code objects be rerun?
  • Use of checkpointing. I assume that the only checkpoint should occur in heavily computational bits of the code. Otherwise, you are just creating an opportunity for the wrong variables to get semi-permanently stuck in the pipeline. What am I missing?

Any link to a decent analysis pipeline best practice guide?

@SamuelBrand1 SamuelBrand1 added help wanted Extra attention is needed pipeline labels May 9, 2024
@seabbs
Copy link
Collaborator

seabbs commented May 9, 2024

Re-use vs safety. For me re-using a bit of code from outside the inference loop seems less safe. For example, reusing an EpiData object rather than re-declaring with parameters from inside the loop over specifications seems to open up an opportunity to have bugs such as accidentally passing in a mis-specified object.

If this is true why do we feel comfortable designing EpiAware around these kind of structs?

@seabbs
Copy link
Collaborator

seabbs commented May 9, 2024

Some notes:

I think the first thing is what we might want from a good pipeline?

  • Logical structure
  • Limited computational waste
  • When we change something anything that depends on that should update
  • Ability to freely scale backend compute
  • Ability to add new components without rerunning large chunks of the code base
  • Minimal pipeline infrastructure i.e. something is close to what a user of the underlying package would expect to see.

Now we could do much of this manually but ideally we would have tools to cover as much as possible.
Some examples of these are things like make (which does the structure kind of and caching), targets in R which does all of it depending on the amount of work you put in, snakemake and NextFlow in python which again do all of it.

From what we have seen in the julia ecosystem it looks like none of the tools do all the things we might like but that some combination of DrWatson, Dagger, JobScheduler, and Pipelines.

@seabbs
Copy link
Collaborator

seabbs commented May 9, 2024

I assume that the only checkpoint should occur in heavily computational bits of the code. Otherwise, you are just creating an opportunity for the wrong variables to get semi-permanently stuck in the pipeline. What am I missing?

I think good check pointing (that may not be what we actually have) is aware of all the things that go into it so that it can'
t "get stuck with the wrong things in it" - this is how things work in targets or a well designed make pipeline for example.

@seabbs
Copy link
Collaborator

seabbs commented May 9, 2024

Saving vs declaring. When should objects be saved to disk, and reloaded and when should the script that created code objects be rerun?

I think this is part of the issue but the actual main issue with this approach is global variables

@CDCgov CDCgov locked and limited conversation to collaborators May 10, 2024
@SamuelBrand1 SamuelBrand1 converted this issue into discussion #216 May 10, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
help wanted Extra attention is needed pipeline
Projects
None yet
Development

No branches or pull requests

2 participants