-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fusemanager: fix container fail after ttl timeout in detach mode #1905
Conversation
snapshot/snapshot.go
Outdated
// In detach mode, rs is taken over by fusemanager, | ||
// and there may be running containers, so we skip clean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to mention about fusemanager as snapshot/snapshot.go is agnostic about fusemanager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
snapshot/snapshot.go
Outdated
return fmt.Errorf("failed to unmount %s: %w", m.Mountpoint, err) | ||
// In detach mode, rs is taken over by fusemanager, | ||
// and there may be running containers, so we skip clean | ||
if !o.detach { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just retruning from this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done,if o.detach will return directly
docs/overview.md
Outdated
@@ -118,6 +118,10 @@ When upgrading the fuse manager, it's recommended to follow these steps: | |||
|
|||
This ensures a clean upgrade without impacting running containers. | |||
|
|||
### Important Considerations | |||
|
|||
Before restarting the `containerd-stargz-grpc` process, if there are running containers, it is crucial to use `SIGKILL` to terminate the `containerd-stargz-grpc` process. This approach prevents the normal shutdown sequence from attempting to clean up the mountpoints of the running containers, which could lead to disruptions in their availability. By using `SIGKILL`, you ensure that the process is forcefully terminated without affecting the ongoing operations of the containers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a doc about when the user should use SIGTERM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
all done |
Signed-off-by: abushwang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
In detach mode, when
containerd-stargz-grpc
exits normally, it sends an Unmount request to the fuse manager. Additionally, during the startup ofcontainerd-stargz-grpc
, therestoreRemoteSnapshot
function cleans up previous mountpoints. If there are still running containers at this time, it can lead to issues when the TTL cache expires, resulting in abnormal behavior of the containers.I considered several solutions:
containerd-stargz-grpc
usingSIGKILL
, and then skip the cleanup step inrestoreRemoteSnapshot
.ResolveResultEntryTTLSec
to an infinitely large value to leverage the TTL cache for ensuring the normal operation of containers. However, this would still lead to failures if the containers attempt to access uncached content.After careful consideration, I have decided to proceed with the first approach.