Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: sync agent's kernel-registry to actual container periodically #2179

Conversation

fregataa
Copy link
Member

@fregataa fregataa commented May 26, 2024

Intro

The status information of the container is divided into three types in BackendAI system: DB on the manager side, agent's kernel registry, and actual container. This PR is about agent's kernel registry and actual container.

Problem

In the current implementation, kernel data is inserted and removed from the agent's kernel registry in the task of creating and destroying containers. In the case of a container creation, when any unhandled error occurs, the kernel data inserted into the kernel registry is removed. Such removing is not reliable and any other unpredictable errors can cause mismatch between kernel registry and actual container state.
So, let's sync kernel registry to the actual container state in a periodic loop.

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version

Copy link

graphite-app bot commented May 26, 2024

Your org has enabled the Graphite merge queue for merging into main

Add the label “flow:merge-queue” to the PR and Graphite will automatically add it to the merge queue when it’s ready to merge. Or use the label “flow:hotfix” to add to the merge queue as a hot fix.

You must have a Graphite account and log in to Graphite in order to use the merge queue. Sign up using this link.

@github-actions github-actions bot added comp:agent Related to Agent component size:M 30~100 LoC labels May 26, 2024
Copy link
Member Author

fregataa commented May 26, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @fregataa and the rest of your teammates on Graphite Graphite

@fregataa fregataa force-pushed the topic/05-23-fix_hard-sync_kernel_registry_to_real_containers branch from 50ab7d0 to 964c9f3 Compare May 26, 2024 14:32
@fregataa fregataa added this to the 24.09 milestone May 26, 2024
@fregataa fregataa added urgency:3 Must be finished within a certain time frame. urgency:2 With time limit, it should be finished within it; otherwise, resolve it when no other chores. and removed urgency:3 Must be finished within a certain time frame. labels May 26, 2024
@fregataa fregataa marked this pull request as ready for review May 26, 2024 15:17
@fregataa fregataa changed the title fix: hard-sync kernel_registry to real containers fix: sync agent's kernel-registry to actual container periodically May 26, 2024
@fregataa fregataa marked this pull request as draft May 26, 2024 21:54
@fregataa fregataa force-pushed the topic/05-23-fix_enhanced_kernel_termination_handling branch from 6698e2a to 11f42db Compare May 28, 2024 07:23
@fregataa fregataa force-pushed the topic/05-23-fix_hard-sync_kernel_registry_to_real_containers branch from 7ca0eb4 to 58002f0 Compare May 28, 2024 07:23
@fregataa fregataa force-pushed the topic/05-23-fix_enhanced_kernel_termination_handling branch from 11f42db to c1cd4fa Compare May 29, 2024 08:03
@fregataa fregataa force-pushed the topic/05-23-fix_hard-sync_kernel_registry_to_real_containers branch from 58002f0 to 09893dc Compare May 29, 2024 08:04
@fregataa fregataa marked this pull request as ready for review May 29, 2024 08:05
@fregataa fregataa force-pushed the topic/05-23-fix_enhanced_kernel_termination_handling branch from c1cd4fa to 20a9b6c Compare June 4, 2024 06:52
@fregataa fregataa force-pushed the topic/05-23-fix_hard-sync_kernel_registry_to_real_containers branch from 09893dc to b351f26 Compare June 4, 2024 06:52
@fregataa
Copy link
Member Author

fregataa commented Jun 4, 2024

Close this PR since the idea of it is not confirmed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:agent Related to Agent component size:M 30~100 LoC urgency:2 With time limit, it should be finished within it; otherwise, resolve it when no other chores.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant