Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: agent resource sync API #2180

Draft
wants to merge 6 commits into
base: topic/07-22-feat_schedule_function_returns_kernel-agent_binding
Choose a base branch
from

Conversation

fregataa
Copy link
Member

@fregataa fregataa commented May 26, 2024

resolves #2142 https://github.com/lablup/giftbox/issues/262

Agents's sync-and-get-kernels() API

The API that synchronizes agent's kernels to kernel information specified by API parameters (preparing_kernels, pulling_kernels, running_kernels, terminating_kernels). It assumes that the kernel information given by the parameter is the "truth".
If any of kernel information mismatch between kernel_registry and running_kernels(or terminating_kernels), agent injects termination event to terminate the kernel.

sync-and-get-kernels() API returns actual { running, terminating, terminated } kernels (which is not used for now). actual_terminated_kernels contains terminated kernels specified as running_kernels by API parameter.

How to use

  1. Call POST /session/_/sync-agent-resource manager API
  2. Set a config to trigger resource-sync. it should be any of [after-scheduling, before-kernel-creation and on-creation-failure].
  • on-creation-failure: Set by default. Call resource sync when kernel creation failed by InsufficientResource exception
  • after-scheduling: Call resource sync right after scheduling on a scaling group
  • before-kernel-creation: Call resource sync before calling create-kernels agent API
# manager.toml

[manager]
agent-resource-sync-trigger = ["after-scheduling"]
# Default is ["on-creation-failure"]
# agent-resource-sync-trigger = ["after-scheduling", "on-creation-failure"]

Note

on-creation-failure option cannot not handle ExceptionGroup including multiple InsufficientResource exceptions, which is raised by creation failure of multi kernel session. It covers only creation failure of single kernel session.
This will be resolved after merge lablup/callosum#30

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

📚 Documentation preview 📚: https://sorna--2180.org.readthedocs.build/en/2180/


📚 Documentation preview 📚: https://sorna-ko--2180.org.readthedocs.build/ko/2180/

Copy link

graphite-app bot commented May 26, 2024

Your org has enabled the Graphite merge queue for merging into main

Add the label “flow:merge-queue” to the PR and Graphite will automatically add it to the merge queue when it’s ready to merge. Or use the label “flow:hotfix” to add to the merge queue as a hot fix.

You must have a Graphite account and log in to Graphite in order to use the merge queue. Sign up using this link.

@github-actions github-actions bot added comp:manager Related to Manager component comp:agent Related to Agent component comp:common Related to Common component size:L 100~500 LoC labels May 26, 2024
Copy link
Member Author

fregataa commented May 26, 2024

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @fregataa and the rest of your teammates on Graphite Graphite

@fregataa fregataa force-pushed the topic/05-23-fix_hard-sync_kernel_registry_to_real_containers branch from 50ab7d0 to 964c9f3 Compare May 26, 2024 14:32
@fregataa fregataa force-pushed the topic/05-23-feature_sync_mismatch_between_db_and_agent_kernels branch from e8d2106 to 5dd7136 Compare May 26, 2024 14:33
@fregataa fregataa changed the title feature: sync mismatch between db and containers feature: agent resource sync API May 26, 2024
@fregataa fregataa force-pushed the topic/05-23-feature_sync_mismatch_between_db_and_agent_kernels branch from 5dd7136 to 94d419d Compare May 28, 2024 07:18
@github-actions github-actions bot added area:docs Documentations comp:client Related to Client component comp:cli Related to CLI component comp:installer Related to Installer comp:storage-proxy Related to Storage proxy component require:db-migration Automatically set when alembic migrations are added or updated size:XL 500~ LoC and removed size:L 100~500 LoC labels May 28, 2024
@fregataa fregataa force-pushed the topic/05-23-fix_hard-sync_kernel_registry_to_real_containers branch from 7ca0eb4 to 58002f0 Compare May 28, 2024 07:23
@fregataa fregataa force-pushed the topic/05-23-feature_sync_mismatch_between_db_and_agent_kernels branch from 94d419d to fdd53c9 Compare May 28, 2024 07:23
@github-actions github-actions bot added size:L 100~500 LoC and removed size:XL 500~ LoC labels May 28, 2024
@fregataa fregataa force-pushed the topic/05-23-fix_hard-sync_kernel_registry_to_real_containers branch from 58002f0 to 09893dc Compare May 29, 2024 08:04
@fregataa fregataa force-pushed the topic/05-23-feature_sync_mismatch_between_db_and_agent_kernels branch from fdd53c9 to a771f22 Compare May 29, 2024 08:04
@fregataa fregataa force-pushed the topic/fix-context-indent-when-call-create-kernel branch from 5b49ca6 to 6830de1 Compare July 19, 2024 14:27
@fregataa fregataa force-pushed the topic/05-23-feature_sync_mismatch_between_db_and_agent_kernels branch from 808b99f to 466ff9d Compare July 19, 2024 14:27
Base automatically changed from topic/fix-context-indent-when-call-create-kernel to main July 19, 2024 14:37
@fregataa fregataa force-pushed the topic/05-23-feature_sync_mismatch_between_db_and_agent_kernels branch 2 times, most recently from 5c5886a to fc28048 Compare July 19, 2024 15:44
@fregataa fregataa removed the require:db-migration Automatically set when alembic migrations are added or updated label Jul 20, 2024
@fregataa fregataa force-pushed the topic/05-23-feature_sync_mismatch_between_db_and_agent_kernels branch from fc28048 to bbbd88e Compare July 20, 2024 07:54
@fregataa fregataa changed the title feature: agent resource sync API feat: agent resource sync API Jul 20, 2024
@fregataa fregataa force-pushed the topic/05-23-feature_sync_mismatch_between_db_and_agent_kernels branch 2 times, most recently from 0594b41 to ef6e3eb Compare July 22, 2024 09:53
@fregataa fregataa changed the base branch from main to topic/07-22-feat_schedule_function_returns_kernel-agent_binding July 22, 2024 09:53
@fregataa fregataa force-pushed the topic/07-22-feat_schedule_function_returns_kernel-agent_binding branch from e0d34df to 116fabd Compare August 1, 2024 07:37
@fregataa fregataa force-pushed the topic/05-23-feature_sync_mismatch_between_db_and_agent_kernels branch from ef6e3eb to e49073f Compare August 1, 2024 07:37
@fregataa fregataa force-pushed the topic/07-22-feat_schedule_function_returns_kernel-agent_binding branch from 116fabd to 0b69f29 Compare August 2, 2024 06:40
@fregataa fregataa force-pushed the topic/05-23-feature_sync_mismatch_between_db_and_agent_kernels branch from e49073f to 372049e Compare August 2, 2024 06:40
@fregataa fregataa force-pushed the topic/07-22-feat_schedule_function_returns_kernel-agent_binding branch from 0b69f29 to 2d124a5 Compare August 8, 2024 03:59
@fregataa fregataa force-pushed the topic/05-23-feature_sync_mismatch_between_db_and_agent_kernels branch from 372049e to caa2113 Compare August 8, 2024 04:00
@fregataa fregataa force-pushed the topic/07-22-feat_schedule_function_returns_kernel-agent_binding branch from 2d124a5 to fdca36a Compare August 10, 2024 13:48
@fregataa fregataa force-pushed the topic/05-23-feature_sync_mismatch_between_db_and_agent_kernels branch from caa2113 to 5cb8bc8 Compare August 10, 2024 13:48
@fregataa fregataa force-pushed the topic/07-22-feat_schedule_function_returns_kernel-agent_binding branch from fdca36a to 11f6c3f Compare August 25, 2024 06:23
@fregataa fregataa force-pushed the topic/05-23-feature_sync_mismatch_between_db_and_agent_kernels branch from 5cb8bc8 to d778970 Compare August 25, 2024 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:docs Documentations area:ux UI / UX issue. comp:agent Related to Agent component comp:cli Related to CLI component comp:client Related to Client component comp:common Related to Common component comp:installer Related to Installer comp:manager Related to Manager component comp:storage-proxy Related to Storage proxy component platform:enterprise Backend.AI Enterprise support. size:L 100~500 LoC type:enhance Enhance component, behavior, internals without user-facing features urgency:4 As soon as feasible, implementation is essential.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve auto-healing of "insufficient amount of resource" errors
1 participant