Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditionally create the pool if non-existing #67

Merged
merged 29 commits into from
Oct 15, 2024
Merged

Conversation

gvegayon
Copy link
Member

@gvegayon gvegayon commented Oct 9, 2024

This pull request includes several enhancements and modifications to the GitHub Actions workflow file .github/workflows/1_pre-Test-Model-Image-Build.yaml. The changes focus on improving the build process, adding commit message handling, and enhancing the Azure Batch pool creation logic.

Azure Batch pool creation:

  • Added a conditional step to check if the Azure Batch Pool already exists before attempting to create it, preventing redundant pool creation.
  • Pools can be re-built via commit message by passing the tag [re-build pool].
  • Pools can be deleted via commit message by passing adding the tag [delete pool]. Will only work if a pool with the branch name was created.
  • Tags [re-build pool] and [delete pool] can be combined.

Important

Trying to pre-clean the job may make the workflow fail b/c the pool takes longer to be deleted than the time between submitting the deletion instruction and trying to build it. Pools are deleted only at the end step (if found), so deletion cannot happen in the first commit to the branch.

Enhancements to build process:

  • Added name attributes to various jobs for better identification in the workflow runs. (jobs.Job01-build_image_dependencies, jobs._01_build-model-image, jobs._02_create-batch-pool-and-submit-jobs) [1] [2] [3]

Commit message handling:

  • Added commit-msg output to capture the latest commit message and pass it between jobs. [1] [2] [3]

@gvegayon gvegayon linked an issue Oct 9, 2024 that may be closed by this pull request
Copy link

codecov bot commented Oct 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Additional details and impacted files

📢 Thoughts on this report? Let us know!

@gvegayon gvegayon changed the title First run with pool show Conditionally create the pool if non-existing Oct 9, 2024
@gvegayon gvegayon marked this pull request as ready for review October 9, 2024 05:22
@gvegayon gvegayon requested review from zsusswein and jkislin October 9, 2024 05:22
@gvegayon
Copy link
Member Author

@zsusswein and @natemcintosh, this should fix the errors you are getting in other PRs (or reduce them!)

@zsusswein
Copy link
Collaborator

Will get to this today! Got caught up in production yesterday

@jkislin
Copy link
Contributor

jkislin commented Oct 11, 2024

@gvegayon this is tremendous. Thanks again! I don't have time to do a proper review until next week (I'd want to test and watch Azure Batch Explorer and Github Actions a few times, which I can do), but if all checks are passing and Zach and Nate think it looks good, I'll defer to them.

@natemcintosh
Copy link
Collaborator

This is going to make future scheduled work with Azure Batch so much simpler!

Looking beyond the scope of this PR, maybe we could generalize some of these steps. It quite likely that we'll need to do very similar things in the future for other repos, which would probably require a lot of very careful copy and paste. To reduce worry about getting something wrong in that process, maybe we could make a reusable Github action for doing these steps? Something like

- name: Refresh pool
  uses: CDCGov/refresh-pool@v1
  with: |
    arg1: ...
    arg2: ... 

That said, I've never created an action before, and don't know what the process might look like.

@zsusswein
Copy link
Collaborator

zsusswein commented Oct 11, 2024

Looking beyond the scope of this PR, maybe we could generalize some of these steps.

I like this idea, but think it should be beyond the scope of this PR. Nate, would you mind making this an issue instead?

EDIT: #72

Copy link
Collaborator

@zsusswein zsusswein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

George this PR rocks. It's awesome.

Getting a bit greedy with some additional asks:

  • How would this play with Clean up on PR close #62? I still want to leave that functionality for a separate PR but would it be as simple as adding an additional or clause to the pool delete condition? If it's harder, are there any changes that can be made here to make implementing it simpler?
  • How easy do you think it would be to add polling for pool deletion completion via az batch pool show? I think we'd want the runner to hang until deletion is successful so that the runner doesn't report success unless the pool is actually deleted. Feel free to shunt this to a new issue.
  • Would it be possible to have a bot leave a comment on the PR with the current state of the linked pool?
  • Can you add a description of your new functionality and how to trigger pool deletion/recreation to the readme?

Feel free to turn the above asks into new issues instead of doing them here. This PR is already going above and beyond.

@zsusswein
Copy link
Collaborator

I also want to flag for @natemcintosh that we should think about how to handle the case of the pool running a stale version of the tagged image. As in:

  1. We push branch feat-xyz
  2. Actions creates image cfa-epinow2-pipeline:feat-xyz
  3. Actions creates pool cfa-epinow2-pipeline:feat-xyz (which in the future will be linked to the image from (2) but is not yet pending Link built pool to image in ACR #59)
  4. We make some commits and push
  5. Actions updates image build and pushes new layers, pointing image with tag cfa-epinow2-pipeline:feat-xyz to the new layers
  6. We don't tear down the pool. On running a job, does the node in the pool from (3) spin up container layers from (2) or from (5)? If we want (5) do we need to tear down and rebuild?

I think this is unanswerable in our current setup (pending #59). So I'm mainly flagging this is a weak spot for us to keep an eye on.

@natemcintosh
Copy link
Collaborator

Good question. Once things get merged, I'd say we should try it out by making some change that we can easily check the output of.

@gvegayon
Copy link
Member Author

gvegayon commented Oct 15, 2024

@zsusswein, here are some answers to your questions:

  • How would this play with Clean up on PR close #62? I still want to leave that functionality for a separate PR but would it be as simple as adding an additional or clause to the pool delete condition? If it's harder, are there any changes that can be made here to make implementing it simpler?

I would add that as a separate workflow. I can deal with #62 after this is merged.

  • How easy do you think it would be to add polling for pool deletion completion via az batch pool show? I think we'd want the runner to hang until deletion is successful so that the runner doesn't report success unless the pool is actually deleted. Feel free to shunt this to a new issue.

I think this could be done, but it sounds like a separate action for CDCgov/cfa-actions! Will add the issue there. Mostly because it will involve writing a program that loops checking whether the pool deletion was complete, and that could be something useful for others.

  • Would it be possible to have a bot leave a comment on the PR with the current state of the linked pool?

Not with the current, but with the last run. It could be something similar to what's going on in CDCgov/cfa-actions#1. I suggest creating a separate issue for this.

  • Can you add a description of your new functionality and how to trigger pool deletion/recreation to the readme?

Sure. I'll add that to the readme (or a readme).

README.md Show resolved Hide resolved
README.md Show resolved Hide resolved
@zsusswein zsusswein enabled auto-merge (squash) October 15, 2024 23:20
@zsusswein zsusswein merged commit cb17a98 into main Oct 15, 2024
9 checks passed
@zsusswein zsusswein deleted the gvegayon-delete-pools branch October 15, 2024 23:26
@gvegayon
Copy link
Member Author

I was just adding a mermaid diagram, @zsusswein. Here it is:

flowchart LR

  START((Start))---DEPS_CACHED

  DEPS_CACHED{Deps<br>cached?}---|No|DEPS
  DEPS_CACHED---|Yes|IMG

  subgraph DEPS[Job01-build_image_dependencies]
    direction TB
    Dockerfile-dependencies---|Generates|DEPS_IMAGE[Dependencies<br>Image]
  end

  DEPS---IMG

  subgraph IMG[_01_build-model-image]
    direction TB
    Dockerfile---|Generates|PKG_IMG[Package<br>Image]
  end

  IMG---POOL

  subgraph POOL[_02_create-batch-pool-and-submit-jobs]
    direction TB

    POOL_EXISTS{Is the pool<br>up?}
    POOL_EXISTS---|No|CREATE_POOL[Create the pool]
    POOL_EXISTS---|Yes|DELETE_POOL{Commit includes<br>'delete pool'}
    DELETE_POOL---END_POOL((End))
    CREATE_POOL---END_POOL

  end
Loading

If you like it, I can PR it

@zsusswein
Copy link
Collaborator

Love it! Please go for the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Delete pools and images prior to recreating instead of failing
4 participants