Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fusemanager: fix container fail after ttl timeout in detach mode #1905

Merged
merged 1 commit into from
Jan 7, 2025

Conversation

wswsmao
Copy link
Contributor

@wswsmao wswsmao commented Dec 16, 2024

In detach mode, when containerd-stargz-grpc exits normally, it sends an Unmount request to the fuse manager. Additionally, during the startup of containerd-stargz-grpc, the restoreRemoteSnapshot function cleans up previous mountpoints. If there are still running containers at this time, it can lead to issues when the TTL cache expires, resulting in abnormal behavior of the containers.

I considered several solutions:

  1. The approach in the current PR, where users should restart containerd-stargz-grpc using SIGKILL, and then skip the cleanup step in restoreRemoteSnapshot.
  2. Setting ResolveResultEntryTTLSec to an infinitely large value to leverage the TTL cache for ensuring the normal operation of containers. However, this would still lead to failures if the containers attempt to access uncached content.
  3. Implementing a complex mechanism to determine if any running containers are using the mountpoints, and if so, skipping the cleanup.

After careful consideration, I have decided to proceed with the first approach.

Comment on lines 747 to 748
// In detach mode, rs is taken over by fusemanager,
// and there may be running containers, so we skip clean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to mention about fusemanager as snapshot/snapshot.go is agnostic about fusemanager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return fmt.Errorf("failed to unmount %s: %w", m.Mountpoint, err)
// In detach mode, rs is taken over by fusemanager,
// and there may be running containers, so we skip clean
if !o.detach {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just retruning from this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done,if o.detach will return directly

docs/overview.md Outdated
@@ -118,6 +118,10 @@ When upgrading the fuse manager, it's recommended to follow these steps:

This ensures a clean upgrade without impacting running containers.

### Important Considerations

Before restarting the `containerd-stargz-grpc` process, if there are running containers, it is crucial to use `SIGKILL` to terminate the `containerd-stargz-grpc` process. This approach prevents the normal shutdown sequence from attempting to clean up the mountpoints of the running containers, which could lead to disruptions in their availability. By using `SIGKILL`, you ensure that the process is forcefully terminated without affecting the ongoing operations of the containers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a doc about when the user should use SIGTERM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wswsmao
Copy link
Contributor Author

wswsmao commented Dec 26, 2024

all done

@wswsmao
Copy link
Contributor Author

wswsmao commented Jan 6, 2025

@ktock

Copy link
Member

@ktock ktock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@ktock ktock merged commit 928a4dd into containerd:main Jan 7, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants