Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate ingest and phylogenetic workflows #38

Merged
merged 9 commits into from
Apr 9, 2024
Merged

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Apr 5, 2024

Description of proposed changes

Coordinated with @joverlee521 to copy commits from zika PR: nextstrain/zika#52

Adds a single GH Action workflow to automate the ingest and phylogenetic workflows, set to run daily at the same time as the automated mpox ingest.

Uses GH Action caches to store hash of ingest results' Metadata.sha256sum values added to the S3 metadata within upload-to-s3. If the cache contains a match from previous runs of the GH Action workflow, then the workflow will skip the phylogenetic job.

See commits for details.

Related issue(s)

Based on discussion in nextstrain/pathogen-repo-guide#25

Checklist

The manual run completed successfully although does not push to the live site since output files do not have "_genome" postfixs in the filenames:

nextstrain remote upload s3://nextstrain-data auspice/dengue_all.json auspice/dengue_denv1.json auspice/dengue_denv2.json auspice/dengue_denv3.json auspice/dengue_denv4.json
        
Uploading auspice/dengue_all.json as dengue_all.json
Uploading auspice/dengue_denv1.json as dengue_denv1.json
Uploading auspice/dengue_denv2.json as dengue_denv2.json
Uploading auspice/dengue_denv3.json as dengue_denv3.json
Uploading auspice/dengue_denv4.json as dengue_denv4.json

Currently just runs the ingest workflow and uploads the results
to AWS S3. Subsequent commits will add automation for the phylogenetic
workflow.

Copied commit from Zika PR #52

nextstrain/zika@d44f2ae
@j23414 j23414 changed the title Automate workflows Automate ingest and phylogenetic workflows Apr 5, 2024
j23414 added 4 commits April 5, 2024 08:26
The phylogenetic workflow will run after the ingest workflow has
completed successfully to use the latest available data.

Subsequent commits will check if the ingest results included new
data to only run the phylogenetic workflow when there's new data.

Copied commit from Zika PR #52

nextstrain/zika@2c415e7
Uses GitHub Actions cache to store a file that contains the
`Metadata.sh256sum` of the ingest files on S3 and use
the `hashFiles` function to create a unique cache key.

Then the existence of the cache key is an indicator that the ingest
file contents have not been updated since a previous run on GH Actions.
This does come with a big caveat that GH will remove any cache entries
that have not been accessed in over 7 days.¹ If the workflow is not
being automatically run within 7 days, then it will always run the
phylogenetic job.

If this works well, then we may want to consider moving this within
the `pathogen-repo-build` reusable workflow to have the same
functionality across pathogen automation workflows.

¹ https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#usage-limits-and-eviction-policy

Copied commit from Zika PR #52

nextstrain/zika@eb5e76d
Add individuals inputs per workflow to override the default Docker image
used by `nextstrain build`. Having this input has been extremely helpful
to continue running pathogen workflows when we run into new bugs that
are not present in older nextstrain-base images.

There are separate image inputs for the two workflows because they use
different tools and may require different versions of images.

Copied commit from Zika PR #52

nextstrain/zika@65a8acc
Copied daily schedule of mpox ingest
https://github.com/nextstrain/mpox/blob/e439235ff1c1d66e7285b774e9536e2896d9cd2f/.github/workflows/fetch-and-ingest.yaml#L4-L21

Daily runs seem fine since the ingest workflow currently takes less
than 2 minutes to complete and it will not trigger the phylogenetic
workflow if there's no new data.

We can bring this down to once a week if it seems like overkill.

Copied commit from Zika PR #52

nextstrain/zika@77ca1d4
@j23414 j23414 force-pushed the automate-workflows branch from 642a310 to 795546d Compare April 5, 2024 15:26
@j23414 j23414 requested a review from a team April 5, 2024 20:05
@joverlee521
Copy link
Contributor

Yay, the test run completed with plenty of time to spare for the phylogenetic workflow 🎉

The manual run completed successfully although does not push to the live site since output files do not have "_genome" postfixs in the filenames

I know this will be fixed in #18, but I would just update the phylo outputs with the hardcoded _genome filename in this PR so that the automated builds get surfaced through nextstrain.org.

@joverlee521
Copy link
Contributor

FYI, after the initial trigger with pull_request you should be able to trigger the workflow manually with

gh workflow run ingest-to-phylogenetic.yaml  --ref automate-workflows

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me 👍

A couple things to follow up on before merging:

@j23414 j23414 force-pushed the automate-workflows branch 2 times, most recently from 9839842 to 02504d6 Compare April 9, 2024 23:23
@j23414 j23414 merged commit df504f3 into main Apr 9, 2024
32 checks passed
@j23414 j23414 deleted the automate-workflows branch April 9, 2024 23:32
@joverlee521
Copy link
Contributor

The dengue workflow is now showing up on the pathogen workflow status page

Screenshot 2024-04-10 at 10 55 24 AM

joverlee521 added a commit that referenced this pull request Apr 11, 2024
I missed this in review of #38.
The phylogenetic workflow was still pulling from old S3 URLs
and not the ingest workflow output data.

This commit corrects the S3 URL to the ingest output files and updates
the `strain_id_field` config param to use the appropriate ID column
from the ingest output.
@joverlee521 joverlee521 mentioned this pull request Apr 11, 2024
1 task
@j23414 j23414 mentioned this pull request May 23, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants