-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pathogen-repo-build: Wait for AWS Batch jobs to finish #54
pathogen-repo-build: Wait for AWS Batch jobs to finish #54
Conversation
275daac
to
7864f50
Compare
4e26422
to
fbff47b
Compare
7803f9c
to
045c171
Compare
No harm in describing it in the README, esp. as other development programs are to be added imminently.
f93f183
to
92458ee
Compare
92458ee
to
323e376
Compare
I'll review this when I'm back from vacation, but no problem if it is merged before then. |
49247cb
to
b8383d3
Compare
…hored input YAML This allows us to author the workflow using YAML anchors/references (especially with merge keys) since GitHub Actions doesn't otherwise support those YAML features. The lack of support is a shame because GitHub Actions workflows can be very repetitive and anchors/references/merges are a decent solution to that. I'm about to add substantial conceptual replication that we won't want to maintain concretely replicated in the file. Since the generated file must be checked in, a new CI step ensures the generated file matches the authored file and the generated file is excluded from git diffs by default. An optional local pre-commit hook is also available for making sure you craft commits that won't run afoul of the CI check later. Or you can, as necessary, run `make` manually before committing.
This lets us more easily test it in development.
This avoids the disconnect between the success/failure status of a GitHub Actions workflow run and the actual AWS Batch job, which makes for easier reporting and debugging and generally less cognitive dissonance. GitHub Actions workflow _runs_ have a very high max timeout of 35 days, but each _job_ in a workflow has a much lower max timeout of 6 hours. Many of our builds should be less than 6 hours¹, but here we support builds up to 24 hours by chaining together 4 GitHub Actions jobs. We can add more jobs to the chain if we need to, but I don't foresee that. Nearly all of our builds are in public repos, which means they won't consume usage minutes from our GitHub Actions quota. However, they _will_ consume concurrency limits from the quota. We were already frequently bumping into the default free-tier quota of 20 concurrent jobs, so adding more long-running jobs (to wait around for the AWS Batch jobs) was a nonstarter until we upgraded to a Team plan with its corresponding quota of 60 concurrent jobs.² ¹ As of 24 Jan, all Batch jobs except one in the prior week were sub 12 hours. The exception was the GISAID ncov-ingest job launched on 16 Jan. Next longest jobs were 10.5 hours, all GenBank ncov-ingest jobs. <nextstrain/private#95 (comment)> ² <https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1705024616778389> <https://github.com/tsibley/blab-standup/blob/HEAD/2023-08-22.md>
b8383d3
to
57eb5e5
Compare
Repushed to fix a linting issue (but not bug per se) that shellcheck in CI found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool use of YAML anchors and references, I will need to spend more time to become familiar with it. I was able to successfully run make
and use the pre-commit hook.
The actual changes to the pathogen-repo-build workflow looks good by inspection. I think reasonable to test it out with the ncov-ingest workflows next?
In our discussion yesterday about standardizing GH Actions for pathogen repos, @jameshadfield raised a good question about whether this will work with your proposed pathogen-repo-automation workflow...My gut feeling is it will not work, but will have to try it out.
I assume you're referring to this bit in yesterday's meeting notes?
I left a comment in the meeting notes, but since those can get lost, I'll copy it here: As far as I understand it, there is no "calling job", per se. For a reusable workflow, the calling job exists only in notation. At execution time, the called workflow's jobs become part of the caller's jobs. Or perhaps you might say the called workflow's jobs replace the caller job. If you look at the GitHub Actions web UI or the data model exposed by the GitHub API, this understanding lines up. |
I agree testing with ncov-ingest or something like that seems like a good next step. Next week, I'll merge this and arrange for that testing. |
Hmm, actually, ncov-ingest and ncov aren't already using this Merging, as I suggested quickly last Friday, will cause all callers of this workflow to immediately start using the new version in this PR. So either we can test in production by merging, fixing anything that crops up… which is probably fine but maybe also breaks the world. Or I can temporarily modify some subset of callers to point to the workflow on this branch instead and start with more limited testing of real work loads. I'll do that. |
Extracted some data from calling workflows. |
…workflow Using mpox here to test changes to how we have GitHub Actions interact with AWS Batch, in advance of merging and releasing those changes to every pathogen repo. "Testing in production, just not all at once." Related-to: <nextstrain/.github#54>
I modified mpox to use the workflow from this branch. It works for CI. Then I triggered a manual (trial run) test: https://github.com/nextstrain/mpox/actions/runs/7879309419/job/21499345399 We'll also see what happens tomorrow after daily ingest is run. |
Ingest worked this morning. |
Merging this today so it's used for all workflows tomorrow morning. Things look good with mpox, so let's go. |
…ld.yaml workflow" This reverts commit 34feb4a. After successful runs, the preview version is now merged to master, so switch back. Related-to: <nextstrain/.github#54>
Changed the mpox workflows back: nextstrain/mpox@f608152 |
Before this point, the status of a GitHub Actions workflow run isn't reflective of the the actual AWS Batch job outcome. Related-to: <nextstrain/.github#54>
Before this point, the status of a GitHub Actions workflow run isn't reflective of the the actual AWS Batch job outcome. Related-to: <nextstrain/.github#54>
Before this point, the status of a GitHub Actions workflow run isn't reflective of the the actual AWS Batch job outcome. Related-to: <nextstrain/.github#54>
Before this point, the status of a GitHub Actions workflow run isn't reflective of the the actual AWS Batch job outcome. Related-to: <nextstrain/.github#54>
This avoids the disconnect between the success/failure status of a GitHub Actions workflow run and the actual AWS Batch job, which makes for easier reporting and debugging and generally less cognitive dissonance.
GitHub Actions workflow runs have a very high max timeout of 35 days, but each job in a workflow has a much lower max timeout of 6 hours. Many of our builds should be less than 6 hours¹, but here we support builds up to 24 hours by chaining together 4 GitHub Actions jobs. We can add more jobs to the chain if we need to, but I don't foresee that.
Nearly all of our builds are in public repos, which means they won't consume usage minutes from our GitHub Actions quota. However, they will consume concurrency limits from the quota. We were already frequently bumping into the default free-tier quota of 20 concurrent jobs, so adding more long-running jobs (to wait around for the AWS Batch jobs) was a nonstarter until we upgraded to a Team plan with its corresponding quota of 60 concurrent jobs.²
¹ As of 24 Jan, all Batch jobs except one in the prior week were sub 12 hours. The exception was the GISAID ncov-ingest job launched on 16 Jan. Next longest jobs were 10.5 hours, all GenBank ncov-ingest jobs. https://github.com/nextstrain/private/issues/95#issuecomment-1909155013
² https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1705024616778389 https://github.com/tsibley/blab-standup/blob/HEAD/2023-08-22.md
See also the commit messages of supporting commits leading up to the final commit (above).
Checklist