-
Notifications
You must be signed in to change notification settings - Fork 0
Linux sandboxing
Linux 3.12 is assumed, along with the following configuration options:
- CONFIG_USER_NS=y
- CONFIG_NAMESPACES=y
Namespaces are an isolation hierarchy for kernel resources. A full set of fresh namespaces is comparable to a virtual machine, with a shared kernel. A sandbox will likely desire full isolation:
unshare(CLONE_NEWUSER|CLONE_NEWIPC|CLONE_NEWNS|CLONE_NEWPID|CLONE_NEWUTS|CLONE_NEWNET);
spawn(sandboxed_process)
A mount namespace is the file hierarchy available to a process, consisting of the tree of mounts with ownership over their submounts. A mount and the owned submounts can be marked shared, private, slave or unbindable.
- shared: changes propagate to all other namespaces
- private: changes do not propagate
- slave: changes propagate from the master, but not vice-versa
- unbindable: private, and cannot be cloned through a bind operation
A fully private mount namespace works well for an application sandbox. It allows for having a hidden lightweight read-only directory to chroot
into with only the necessary devices (/dev/urandom
) and mounts (/proc
, and maybe a tmpfs
).
Obtaining isolation in a mount namespace (root
is a directory to chroot
into, with proc
sub-directory):
// avoid propagating mounts to or from the real root
if mount(NULL, "/", NULL, MS_PRIVATE|MS_REC, NULL) < 0 {
fail!("mount /")
}
// turn directory into a bind mount
if mount(root, root, "bind", MS_BIND|MS_REC, NULL) < 0 {
fail!("bind mount")
}
// re-mount as read-only
if mount(root, root, "bind", MS_BIND|MS_REMOUNT|MS_RDONLY|MS_REC, NULL) < 0 {
fail!("remount bind mount")
}
if chroot(root) < 0 {
fail!("chroot")
}
if chdir("/") < 0 {
fail!("chdir")
}
if mount(NULL, "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, NULL) < 0 {
fail!("mount /proc")
}
An isolated network namespace, with only a loopback device by default. Virtual network devices can be given to the namespace, but that functionality should be unnecessary for Servo.
An isolated process namespace, where the initial process is considered init
and has PID 1. A remount of /proc
is required to update it for the new namespace. Since a sandbox will usually involve a chroot
, that's a given. Note that the PID namespace is entered upon forking a child process, not immediately like the others.
An isolated user/group namespace, where UID/GID values do not correspond to values outside of the namespace even when equal. This is the most essential, as it's a requirement for using namespaces without CAP_SYS_ADMIN
.
To reduce the kernel attack surface, it will obviously be a good idea to drop from the pseudo-root user immediately. This will mean having the bare essentials of a user database in /etc
for the chroot.
User namespaces were automatically disabled if XFS was enabled before Linux 3.12, so that is essentially going to be the soft minimum requirement. However, distributions still need to enable the CONFIG_USER_NS
switch, and they may not want to do it right away due to security risks.
It seems Fedora is starting off with it enabled, but with a patch to add the restriction of CAP_SYS_ADMIN
. User namespaces were primarily added to allow for unprivileged containers, so the restriction should go away eventually
Essentially just an isolated domain name and host name. There's no harm in hiding this information!
An isolated view of SystemV IPC and POSIX message queues. Again, not very interesting, but obviously a good idea since there's nothing to lose.
seccomp-bpf is essentially iptables for system calls. It allows building a whitelist of allowed system calls, and adding arbitrary integer comparison checks for each of the parameters. For Servo, this will primarily be useful for reducing the kernel attack surface.
The value of seccomp for isolation approaches zero as new system calls are required, because the parameters cannot usually be restricted much. There are at least a few system calls with information leaks, like the ability to obtain kernel logs if dmesg.restrict
is unset.
https://github.com/thestinger/rust-seccomp
The runtime alone needs a large number of system calls, so namespaces are going to be much more valuable as a starting point.
- use
setsid
to make a fresh session - use
setresgid
/setresuid
for dropping pseudo-root - make sure inherited file descriptors aren't breaking the sandbox
- make sure to wipe out the environment
Control groups are very interesting for restricting resource usage, but they are a system administrator feature at this point and stopping a denial of service via remote code execution in the sandbox is a low priority. Rather than using control groups, resource limits can be set for the process and seccomp
can be used to prevent spawning new processes or changing the limits.
Not portable, and also essentially a system administrator feature. There's not much that can be done here anyway because users will expect a browser as a whole to have access to the filesystem.