fusemanager: fix container fail after ttl timeout in detach mode #1905

wswsmao · 2024-12-16T11:31:28Z

In detach mode, when containerd-stargz-grpc exits normally, it sends an Unmount request to the fuse manager. Additionally, during the startup of containerd-stargz-grpc, the restoreRemoteSnapshot function cleans up previous mountpoints. If there are still running containers at this time, it can lead to issues when the TTL cache expires, resulting in abnormal behavior of the containers.

I considered several solutions:

The approach in the current PR, where users should restart containerd-stargz-grpc using SIGKILL, and then skip the cleanup step in restoreRemoteSnapshot.
Setting ResolveResultEntryTTLSec to an infinitely large value to leverage the TTL cache for ensuring the normal operation of containers. However, this would still lead to failures if the containers attempt to access uncached content.
Implementing a complex mechanism to determine if any running containers are using the mountpoints, and if so, skipping the cleanup.

After careful consideration, I have decided to proceed with the first approach.

ktock · 2024-12-23T14:20:52Z

snapshot/snapshot.go

+	// In detach mode, rs is taken over by fusemanager,
+	// and there may be running containers, so we skip clean


No need to mention about fusemanager as snapshot/snapshot.go is agnostic about fusemanager.

ktock · 2024-12-23T14:23:09Z

snapshot/snapshot.go

-				return fmt.Errorf("failed to unmount %s: %w", m.Mountpoint, err)
+	// In detach mode, rs is taken over by fusemanager,
+	// and there may be running containers, so we skip clean
+	if !o.detach {


Why not just retruning from this function?

done，if o.detach will return directly

ktock · 2024-12-23T14:40:50Z

docs/overview.md

@@ -118,6 +118,10 @@ When upgrading the fuse manager, it's recommended to follow these steps:

 This ensures a clean upgrade without impacting running containers.

+### Important Considerations
+
+Before restarting the `containerd-stargz-grpc` process, if there are running containers, it is crucial to use `SIGKILL` to terminate the `containerd-stargz-grpc` process. This approach prevents the normal shutdown sequence from attempting to clean up the mountpoints of the running containers, which could lead to disruptions in their availability. By using `SIGKILL`, you ensure that the process is forcefully terminated without affecting the ongoing operations of the containers.


Could you add a doc about when the user should use SIGTERM?

wswsmao · 2024-12-26T06:25:36Z

all done

wswsmao · 2025-01-06T09:48:24Z

@ktock

Signed-off-by: abushwang <[email protected]>

ktock

Thanks

ktock reviewed Dec 23, 2024

View reviewed changes

wswsmao force-pushed the main branch from a079b40 to 889a10e Compare December 25, 2024 02:02

fusemanager: fix container fail after ttl timeout in detach mode

61ed6ff

Signed-off-by: abushwang <[email protected]>

wswsmao force-pushed the main branch from 889a10e to 61ed6ff Compare January 7, 2025 01:46

wswsmao mentioned this pull request Jan 7, 2025

fuse passthrough: fix oom when running huge images #1923

Merged

ktock approved these changes Jan 7, 2025

View reviewed changes

ktock merged commit 928a4dd into containerd:main Jan 7, 2025
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fusemanager: fix container fail after ttl timeout in detach mode #1905

fusemanager: fix container fail after ttl timeout in detach mode #1905

wswsmao commented Dec 16, 2024

ktock Dec 23, 2024

wswsmao Dec 25, 2024

ktock Dec 23, 2024

wswsmao Dec 25, 2024

ktock Dec 23, 2024

wswsmao Dec 25, 2024

wswsmao commented Dec 26, 2024

wswsmao commented Jan 6, 2025

ktock left a comment

		// In detach mode, rs is taken over by fusemanager,
		// and there may be running containers, so we skip clean

fusemanager: fix container fail after ttl timeout in detach mode #1905

fusemanager: fix container fail after ttl timeout in detach mode #1905

Conversation

wswsmao commented Dec 16, 2024

ktock Dec 23, 2024

Choose a reason for hiding this comment

wswsmao Dec 25, 2024

Choose a reason for hiding this comment

ktock Dec 23, 2024

Choose a reason for hiding this comment

wswsmao Dec 25, 2024

Choose a reason for hiding this comment

ktock Dec 23, 2024

Choose a reason for hiding this comment

wswsmao Dec 25, 2024

Choose a reason for hiding this comment

wswsmao commented Dec 26, 2024

wswsmao commented Jan 6, 2025

ktock left a comment

Choose a reason for hiding this comment