Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mountinfo: linux procfs traversing method may leads to mountinfo loss #161

Open
zhaodiaoer opened this issue Sep 26, 2024 · 5 comments
Open

Comments

@zhaodiaoer
Copy link

Currently, on Linux, the way to obtain mount information entirely depends on traversing the procfs (such as /proc/self/mountinfo, /proc/<pid>/mountinfo, etc.). However, this method is not safe. An ongoing unmount event from other process on the system may cause the current read request to be subject to a race condition, especially when the mountinfo content is relatively large or the traversal process is relatively slow, this is because the procfs file interface for mountinfo implemented in linux can only guarantee atomicity within a single read syscall, although theses read calls are still within the context of a single open call.

I have tried to avoid this problem by increasing the read size to try to read all contents of mountinfo in just once-only read call. However, each read call can only read data up to the size of one pagesize at most. This is also a limitation of the implementation principle of procfs.

I have also noticed the two new interfaces listmount(2)/statmount(2) provided in the new kernel mentioned in #139 . However, the listmount interface can only return a list of all mount IDs. If we want to achieve an effect equivalent to traversing the mountinfo file, we may need to rely on adding many new system calls, and these two new interfaces rely on a kernel version that is too new, It is difficult to promote their full popularity in a short time, so I want to know if there are currently any other possible solutions worthy of expectation (such as use eBPF to inject some filter into kernel space? but i haven't thought when and how to trigger it)

Due to this problem, we have encountered a mount leak issue in the runc. Therefore, I did not add these informations in #139 . Instead, I opened a separate issue because I think this problem is more suitable to be treated as a bug compared to performance improvement.

@zhaodiaoer
Copy link
Author

zhaodiaoer commented Sep 26, 2024

Provide a PoC for this problem (very simple and crude script):

  1. prepare 100 bind mount with mount target in format "/mnt/runc_a/{index}", and another 100 bind mount with mount target in format "/mnt/runc_b/{index}", then there will be 200 new mount items exist in /proc/self/mountinfo after system default mounts.
#!/bin/bash

mkdir -p /mnt/{runc_a,runc_b,runc_test}

for i in {1..100}; do
    mkdir -p /mnt/runc_a/${i}
    umount /mnt/runc_a/${i} 2>/dev/null
    mount --bind /mnt/runc_test /mnt/runc_a/${i}
done

for i in {1..100}; do
    mkdir -p /mnt/runc_b/${i}
    umount /mnt/runc_b/${i} 2>/dev/null
    mount --bind /mnt/runc_test /mnt/runc_b/${i}
done
  1. use golang program to scan /proc/self/mountinfo like that mountinfo/GetMountsFromReader() does, and print out every scanned text line, in middle of scan process, inject a job that unmount all mount that use /mnt/**runc_a**/ as mountpoint prefix
package main

import (
        "bufio"
        "fmt"
        "os"
        "strconv"
        "strings"
        "syscall"
)

var (
        actorMoutns = []string{}
)

func unmountActor() {
        for i := 1; i <= 100; i++ {
                path := fmt.Sprintf("/mnt/runc_a/%s", strconv.Itoa(i))
                err := syscall.Unmount(path, 0)
                if err != nil {
                        fmt.Printf("Error unmounting %s: %v\n", path, err)
                }
        }
}

func main() {
        f, err := os.Open("/proc/self/mountinfo")
        if err != nil {
                panic(err)
        }
        defer f.Close()

        unmounted := false
        s := bufio.NewScanner(f)
        for s.Scan() {
                text := s.Text()
                fields := strings.Split(text, " ")
                if strings.HasPrefix(fields[4], "/mnt/runc_a") && !unmounted {
                        unmountActor()
                        unmounted = true
                }
                fmt.Println(text)
        }
}

after testing done, there will be some chaotic output like below, but not the origin content of mountinfo at the program start moment, i never unmount any /mnt/runc_b/* , but some items lost in the result.

...
354 26 254:1 /mnt/runc_test /mnt/runc_a/1 rw,relatime shared:1 - ext4 /dev/vda1 rw,errors=remount-ro
365 26 254:1 /mnt/runc_test /mnt/runc_a/2 rw,relatime shared:1 - ext4 /dev/vda1 rw,errors=remount-ro
437 26 254:1 /mnt/runc_test /mnt/runc_a/3 rw,relatime shared:1 - ext4 /dev/vda1 rw,errors=remount-ro
448 26 254:1 /mnt/runc_test /mnt/runc_a/4 rw,relatime shared:1 - ext4 /dev/vda1 rw,errors=remount-ro
459 26 254:1 /mnt/runc_test /mnt/runc_a/5 rw,relatime shared:1 - ext4 /dev/vda1 rw,errors=remount-ro
470 26 254:1 /mnt/runc_test /mnt/runc_a/6 rw,relatime shared:1 - ext4 /dev/vda1 rw,errors=remount-ro
481 26 254:1 /mnt/runc_test /mnt/runc_a/7 rw,relatime shared:1 - ext4 /dev/vda1 rw,errors=remount-ro
492 26 254:1 /mnt/runc_test /mnt/runc_a/8 rw,relatime shared:1 - ext4 /dev/vda1 rw,errors=remount-ro
1603 26 254:1 /mnt/runc_test /mnt/runc_b/9 rw,relatime shared:1 - ext4 /dev/vda1 rw,errors=remount-ro
1614 26 254:1 /mnt/runc_test /mnt/runc_b/10 rw,relatime shared:1 - ext4 /dev/vda1 rw,errors=remount-ro
1625 26 254:1 /mnt/runc_test /mnt/runc_b/11 rw,relatime shared:1 - ext4 /dev/vda1 rw,errors=remount-ro
1636 26 254:1 /mnt/runc_test /mnt/runc_b/12 rw,relatime shared:1 - ext4 /dev/vda1 rw,errors=remount-ro
1647 26 254:1 /mnt/runc_test /mnt/runc_b/13 rw,relatime shared:1 - ext4 /dev/vda1 rw,errors=remount-ro
1658 26 254:1 /mnt/runc_test /mnt/runc_b/14 rw,relatime shared:1 - ext4 /dev/vda1 rw,errors=remount-ro
1669 26 254:1 /mnt/runc_test /mnt/runc_b/15 rw,relatime shared:1 - ext4 /dev/vda1 rw,errors=remount-ro
...

In actual messive container business scenarios, the content of one mount entry may be extremely large. During the traversal of mountinfo, the possibility of entry loss due to simultaneous unmount occurs is also greater.

@kolyshkin
Copy link
Collaborator

Yes, I am well aware of this kernel issue. In fact, I have a repo devoted to it (https://github.com/kolyshkin/procfs-test).

To the best of my knowledge, this was fixed in kernel 5.8 by this commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f6c61f96f2d97cbb5f7fa85607bc398f843ff0f

For older kernels, the only workaround is to re-read /proc/mountinfo. We don't want the workaround in this package or in runc because it makes the whole thing much slower. Eventually, distro vendors should either upgrade their kernels and/or backport the above patch. IOW, the fix belongs to the kernel.

@kolyshkin
Copy link
Collaborator

I think I know how to fix this. Just an idea.

Instead of reading mountinfo to get the parent mount, traverse up the directory tree until we find a directory different device. The previous directory is a mount root.

When we have two options:

  • still read mountinfo to get mount options, or
  • remount it with MS_PRIVATE unconditionally.

@zhaodiaoer
Copy link
Author

Instead of reading mountinfo to get the parent mount, traverse up the directory tree until we find a directory different
device. The previous directory is a mount root.

This check may not enough, if rootfs comes from bind mount, they will have the same device as parent directory.

I have indeed thought about this problem, but I haven't come up with a good solution yet. On the runc side, I have thought of an incomplete workaround which is to first check if the rootfs itself is a mountpoint, If it is, then skip the process of obtaining the complete mountinfo and directly return the rootfs path and its mount options, In a regular production environment scenario, most rootfs are mounts of the overlayfs type, traverse mountinfo will skip. However, the method of checking whether a directory is a mountpoint on Linux still indirectly depends on traversing mountinfo ...

The kernel bug you mentioned that i never noticed is very useful for me ! i will build some tests on the repaired kernel.

However, I still have a doubt. As far as I know, currently traversing procfs relies on the seq_file mechanism to generate a stream and return memory data. This mechanism limits a single read request to only return one page of data. When mountinfo is relatively large (currently in our environment, it is generally several hundred KB), traversing once requires multiple reads. Is there still a race condition problem: the mount tree changes between two reads? I feel that completely solving this problem depends on the Linux kernel providing atomic system calls to obtain complete mountinfo, such as some thing like Getfsstat implemented in BSD.

@zhaodiaoer
Copy link
Author

However, I still have a doubt. As far as I know, currently traversing procfs relies on the seq_file mechanism to generate a stream and return memory data. This mechanism limits a single read request to only return one page of data. When mountinfo is relatively large (currently in our environment, it is generally several hundred KB), traversing once requires multiple reads. Is there still a race condition problem: the mount tree changes between two reads? I feel that completely solving this problem depends on the Linux kernel providing atomic system calls to obtain complete mountinfo, such as some thing like Getfsstat implemented in BSD.

After the help and analysis of kernel colleague, i see this patch https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f6c61f96f2d97cbb5f7fa85607bc398f843ff0f can actually fix current mount leak issue in runc. thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants