-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mountinfo: linux procfs traversing method may leads to mountinfo loss #161
Comments
Provide a PoC for this problem (very simple and crude script):
#!/bin/bash
mkdir -p /mnt/{runc_a,runc_b,runc_test}
for i in {1..100}; do
mkdir -p /mnt/runc_a/${i}
umount /mnt/runc_a/${i} 2>/dev/null
mount --bind /mnt/runc_test /mnt/runc_a/${i}
done
for i in {1..100}; do
mkdir -p /mnt/runc_b/${i}
umount /mnt/runc_b/${i} 2>/dev/null
mount --bind /mnt/runc_test /mnt/runc_b/${i}
done
after testing done, there will be some chaotic output like below, but not the origin content of mountinfo at the program start moment, i never unmount any
In actual messive container business scenarios, the content of one mount entry may be extremely large. During the traversal of mountinfo, the possibility of entry loss due to simultaneous unmount occurs is also greater. |
Yes, I am well aware of this kernel issue. In fact, I have a repo devoted to it (https://github.com/kolyshkin/procfs-test). To the best of my knowledge, this was fixed in kernel 5.8 by this commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f6c61f96f2d97cbb5f7fa85607bc398f843ff0f For older kernels, the only workaround is to re-read /proc/mountinfo. We don't want the workaround in this package or in runc because it makes the whole thing much slower. Eventually, distro vendors should either upgrade their kernels and/or backport the above patch. IOW, the fix belongs to the kernel. |
I think I know how to fix this. Just an idea. Instead of reading mountinfo to get the parent mount, traverse up the directory tree until we find a directory different device. The previous directory is a mount root. When we have two options:
|
This check may not enough, if rootfs comes from bind mount, they will have the same device as parent directory. I have indeed thought about this problem, but I haven't come up with a good solution yet. On the runc side, I have thought of an incomplete workaround which is to first check if the rootfs itself is a mountpoint, If it is, then skip the process of obtaining the complete mountinfo and directly return the rootfs path and its mount options, In a regular production environment scenario, most rootfs are mounts of the overlayfs type, traverse mountinfo will skip. However, the method of checking whether a directory is a mountpoint on Linux still indirectly depends on traversing mountinfo ... The kernel bug you mentioned that i never noticed is very useful for me ! i will build some tests on the repaired kernel. However, I still have a doubt. As far as I know, currently traversing procfs relies on the |
After the help and analysis of kernel colleague, i see this patch https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f6c61f96f2d97cbb5f7fa85607bc398f843ff0f can actually fix current mount leak issue in runc. thanks ! |
Currently, on Linux, the way to obtain mount information entirely depends on traversing the procfs (such as
/proc/self/mountinfo
,/proc/<pid>/mountinfo
, etc.). However, this method is not safe. An ongoing unmount event from other process on the system may cause the current read request to be subject to a race condition, especially when the mountinfo content is relatively large or the traversal process is relatively slow, this is because the procfs file interface for mountinfo implemented in linux can only guarantee atomicity within a single read syscall, although theses read calls are still within the context of a single open call.I have tried to avoid this problem by increasing the read size to try to read all contents of mountinfo in just once-only read call. However, each read call can only read data up to the size of one pagesize at most. This is also a limitation of the implementation principle of procfs.
I have also noticed the two new interfaces listmount(2)/statmount(2) provided in the new kernel mentioned in #139 . However, the listmount interface can only return a list of all mount IDs. If we want to achieve an effect equivalent to traversing the mountinfo file, we may need to rely on adding many new system calls, and these two new interfaces rely on a kernel version that is too new, It is difficult to promote their full popularity in a short time, so I want to know if there are currently any other possible solutions worthy of expectation (such as use eBPF to inject some filter into kernel space? but i haven't thought when and how to trigger it)
Due to this problem, we have encountered a mount leak issue in the runc. Therefore, I did not add these informations in #139 . Instead, I opened a separate issue because I think this problem is more suitable to be treated as a bug compared to performance improvement.
The text was updated successfully, but these errors were encountered: