Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch flux adapter to enable pickling to save study state and update it's resource/uri handling #415

Merged
merged 21 commits into from
Aug 22, 2023

Conversation

jwhite242
Copy link
Collaborator

@jwhite242 jwhite242 commented Mar 8, 2023

Casts flux's JobID objects into string form to enable the dag to be pickled when conductor saves state. JobID objects are tied to flux's use of cffi and are not picklable themselves.

Base58 string form is used instead of the integer it's derived from so that logging is consistent with the view users have on the commandline with flux reporting job id's in that format. This requires a separate re-instantiation of JobID objects upon waking to check the job status'.

Additional fixes:

  • Improve the flexibility of resource specification in the flux adapter to align with the other adapters (i.e. not requiring nodes be specified).
  • Change uri handling so env vars are not the only way to schedule jobs
  • Updates to get working on flux only machines that require no bootstrapping or uri
    • Changes job submit call to be non-waitable by default on adapter 0.49 and newer as only owner of instance can do so (which is rarely the case on flux managed machines)
  • Adds documentation and recipes for how to use Maestro with flux machines and flux instances running on non-flux scheduled machines
  • Pins poetry version to 1.4.2 in CI to maintain support for python 3.7
  • Fixes parsing of flux broker version to properly handle non-release versions of flux and logs that version into the flux batch script header comments and the maestro logs.

Resolves #416

@jwhite242 jwhite242 self-assigned this Mar 8, 2023
@jwhite242 jwhite242 added Bugfix Discussion/Implementation of a solution to a bug (usually a PR) Adapters Items that are directly related to Maestro's adapter structure. labels Mar 8, 2023
@vsoch
Copy link

vsoch commented Jun 17, 2023

@jwhite242 I am using this branch for container builds for the examples I'm working on - do you want any feedback / help / what is the plan for merging this in (and then I can fall back to using the main branch!)

@jwhite242 jwhite242 force-pushed the bugfix/flux_adapter_pickling branch from de390f6 to bbec50b Compare August 2, 2023 00:00
@jwhite242 jwhite242 marked this pull request as ready for review August 2, 2023 23:52
@jwhite242
Copy link
Collaborator Author

@jwhite242 I am using this branch for container builds for the examples I'm working on - do you want any feedback / help / what is the plan for merging this in (and then I can fall back to using the main branch!)

@vsoch Apologies for letting this languish for a bit. Should be merging this one soon now that the last blocker for submitting jobs so flux managed systems has been removed. Hopefully the updated docs also help get non-flux experts up and running a little quicker too; the docs you've been adding to new flux docs pages on this have been super helpful there as well! If you see anything missing in the docs here let me know.

@jwhite242 jwhite242 merged commit 194f2c2 into LLNL:develop Aug 22, 2023
11 checks passed
jwhite242 added a commit that referenced this pull request Dec 12, 2023
1.1.10 Release (#432)

* Sync up read the docs config with dev environments using poetry (#399)
* Print usage on command line when no args are provided (#404)
* Add sacct fallback to slurm adapter to improve robustness of job tracking (#405)
* Update Flurm Job State mappings for flux versions >= 0.26 (#407)
* Bump certifi from 2021.10.8 to 2022.12.7 to address security issue (#409)
* Bump cryptography from 37.0.1 to 38.0.3 to address security issue (#410)
* Add missing shbang in unscheduled scripts from lsf adapter (#411)
* Update poetry lockfile to address dependabot flagged security issues (#412)
* Fix for Dockerfile smell DL3006 (#418)
* Port Maestro documentation to mkdocs and expand coverage of features and tutorials (#403)
* Update version info to be driven from pyproject.toml exclusively, and hook up to command line (#419)
* Pin mermaid to < 10.x due to api change (#422)
* Bump lock file certifi from 2022.12.7 to 2023.7.22 to address security issue (#426)
* Refactor flux adapter to avoid using pickle to talk to flux brokers installed in external environments (#415)
   Also adds flux integration tests to exercise against real flux brokers
* Add pager functionality to status command (#420)
* Patch broken flux job cancellation (#428)
* Insulate slurm adapters from user customization of squeue and sacct output formats (#431)
   Also adds live unit and integration tests for slurm adapter

---------

Co-authored-by: Francesco Di Natale <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Bruno P. Kinoshita <[email protected]>
Co-authored-by: Charles Doutriaux <[email protected]>
Co-authored-by: Giovanni Rosa <[email protected]>
Co-authored-by: Brian Gunnarson <[email protected]>
jwhite242 added a commit that referenced this pull request Feb 6, 2024
1.1.10 Release (#432)

* Sync up read the docs config with dev environments using poetry (#399)
* Print usage on command line when no args are provided (#404)
* Add sacct fallback to slurm adapter to improve robustness of job tracking (#405)
* Update Flurm Job State mappings for flux versions >= 0.26 (#407)
* Bump certifi from 2021.10.8 to 2022.12.7 to address security issue (#409)
* Bump cryptography from 37.0.1 to 38.0.3 to address security issue (#410)
* Add missing shbang in unscheduled scripts from lsf adapter (#411)
* Update poetry lockfile to address dependabot flagged security issues (#412)
* Fix for Dockerfile smell DL3006 (#418)
* Port Maestro documentation to mkdocs and expand coverage of features and tutorials (#403)
* Update version info to be driven from pyproject.toml exclusively, and hook up to command line (#419)
* Pin mermaid to < 10.x due to api change (#422)
* Bump lock file certifi from 2022.12.7 to 2023.7.22 to address security issue (#426)
* Refactor flux adapter to avoid using pickle to talk to flux brokers installed in external environments (#415)
   Also adds flux integration tests to exercise against real flux brokers
* Add pager functionality to status command (#420)
* Patch broken flux job cancellation (#428)
* Insulate slurm adapters from user customization of squeue and sacct output formats (#431)
   Also adds live unit and integration tests for slurm adapter

---------

Co-authored-by: Francesco Di Natale <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Bruno P. Kinoshita <[email protected]>
Co-authored-by: Charles Doutriaux <[email protected]>
Co-authored-by: Giovanni Rosa <[email protected]>
Co-authored-by: Brian Gunnarson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Adapters Items that are directly related to Maestro's adapter structure. Bugfix Discussion/Implementation of a solution to a bug (usually a PR)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workflow examples with Flux?
2 participants