Mount namespaces isolate filesystem resources.
This pretty much covers everything that has to do with files on the system.
Among the encapsulated resources is a file containing the list of mount points that are visible to a process and as we hinted at in the intro post, isolation can enforce that changing the list (or any other file) within some mount namespace instance
M does not affect that list in a different instance (so that only the processes in
M observe the changes).
You might be wondering why we just zoomed in on a seemingly random file that contains a list - what’s so special about it?
The list of mount points determines a process’ entire view of available filesystems on the system and since we’re in Linux land with the everything is a file mantra, the visibility of pretty much every resource is dictated by this view - from actual files and devices to information about which other processes are also running in the system.
So it’s a huge security win for
isolate to be able to dictate exactly what parts of the system we want commands that we run to be aware of. Mount namespaces combined with mount points are a very powerful tool that lets us acheive this.
We can see mount points visible to a process with id
$pid via the
/proc/$pid/mounts file - its contents is the same for all processes that belong to the same mount namespace as
Spotted somewhere in the list returned on my system is the
/dev/sda1 device mounted at
/ (yours might differ). This is the disk device hosting the root filesystem that contains all the good stuff needed for the system to start and run properly so it would be great if
isolate runs commands without them knowing about filesystems like these.
Let’s start by running a terminal in its own mount namespace:
Strictly speaking, we don’t need superuser access to work with new mount namespaces as long as we include the user namespace setup procedures of the previous post. As a result, in this post we will only assume that
unsharecommands within the terminal are running as superuser.
isolatedoesn’t need this assumption.
Hmmm, we can still see the same list as in the root mount namespace.
Especially after witnessing in the previous post that a new user namespace begins with a clean slate, it may seem that the
-m flag we passed to
unshare didn’t have any effect.
The shell process is in fact running in a different mount namespace (we can verify this by comparing the symlinked file
ls -l /proc/$$/ns/mnt to that of another shell running in the root mount namespace).
The reason we still see the same list is that whenever we create a new mount namespace (child), a copy of the mount points of the mount namespace where the creation took place (parent) is used as the child’s list.
Now any changes we make to this file (e.g by mounting a filesystem) will be invisible to all other processes.
However, changing pretty much any other file at this point will affect other processes because we are still referencing the exact same files (Linux only makes copies of special files like the mount points list).
This means that we currently have minimal isolation. If we want to limit what our command process will see, we must update this list ourselves.
Now, on one extreme, since we’re trying to be security conscious, we could just say F* it and have
isolate clear the entire list before executing the command but that will render the command useless since every program at least has dependencies on resources like operating system files, which in turn, are backed by some filesystem.
On the other extreme, we could also just execute the command as is, sharing with it, the same filesystems that contain the necessary system files that it requires but this obviously defeats the purpose of this isolation thing that we have going on.
The sweet spot would provide the program with its very own copy of dependencies and system files that it requires to run, all sandboxed so that it can make any changes to them without affecting other programs on the system. In the best case scenario, we would wrap these files in a filesystem and mount it as the root filesystem (at the root directory
/) before executing the un-suspecting program.
The idea is, because everything reachable by a process must go via the root filesystem and because we will know exactly what files we put in there for the command process, we will rest easy knowing that it is properly isolated from the rest of the system.
Alright, this sounds good in theory and in order to pull it off, we will do the following:
- Create a copy of the dependencies and system files needed by the command.
- Create a new mount namespace.
- Replace the root filesystem in the new mount namespace with one that is made up of our system files copy.
- Execute the program inside the new mount namespace.
A question that arises already at step
1 is which system files are even needed by the command we want to run? We could rummage in our own root filesystem and ask this question for every file that we encounter and only include the ones where the answer is yes but that sounds painful and unnecessary. Also, we don’t even know what command
isolate will be executing to begin with.
If only people have had this same issue and gathered a set of system files, generic enough to serve as a base right out of the box for a majority of programs out there? Luckily there are many projects that do this! One of which is the Alpine Linux project (this is its main function when you start
FROM alpine:xxx in your
Alpine provides root filesystems that we can use for our purposes. If you are following along, you can get a copy of their minimal root filesystem (
MINI ROOT FILESYSTEM) for
x86_64 here. The latest version at the time of writing and that we will use in this post is
rootfs directory has familiar files just like our own root filesystem at
/ but checkout how minimal it is - quite a few of these directories are empty:
This is great! we can give the command that we launch a copy of this and it could
sudo rm -rf / for all we care, no one else will be bothered.
Given our new mount namespace and a copy of system files, we would like to mount those files on the root directory of the new mount namespace without pulling the rug from under our feet.
Linux has us covered here with the
pivot_root system call (there is an associated command) that allows us to control what a processes sees as the root filesystem.
The command takes two arguments
pivot_root new_root put_old where
new_root is the path to the filesystem containing the soon-to-be root filesystem and
put_old is a path to a directory. It works by:
- Mounting the root filesystem of the calling process on
- Mounting the filesystem pointed to by
new_rootas the current root filesystem at
Let’s see this in action. In our new mount namespace, we start by creating a filesystem out of our alpine files:
Next we pivot root:
Finally, we unmount the old filesystem from
put_old so that the nested shell cannot access it.
With that, we can run any command in our shell and they will run using our custom alpine root filesystem, unaware of the orchestration that led up to their execution. And our precious files on the old filesystem are safe beyond their reach.
The source code for this post can be found here.
We can replicate what we just accomplished in code, swapping the
pivot_root command for the corresponding system call.
First, we create our command process in a new mount namespace by adding the
CLONE_NEWNS flag to
Next, we create a function
prepare_mntns that, given a path to a directory containing system files (
rootfs), sets up the current mount namespace by pivoting the root of the current process to
rootfs as we did earlier.
We need to call this function from our code and it must be done by our command process in
cmd_exec (since its the one running within the new mount namespace), before the actual command begins execution.
Let’s try it out:
This output shows something strange - we’re unable to verify the mount list that we have fought so hard for, and
ps tells us that there are no processes running on the system (not even the current process or
Its more likely that we broke something while setting up the mount namespace.
We’ve mentioned the
/proc directory a few times so far in this series and if you were familiar with it, then you’re probably not surprised that
ps came up empty since we saw earlier that the directory was empty within this mount namespace (when we got it from the alpine root filesystem).
/proc directory in Linux is usually used to expose a special filesystem (called the proc filesystem) that is managed by Linux itself.
Linux uses it to expose information about all processes running in the system as well as other system information with regards to devices, interrupts etc.
Whenever we run a command like
ps which accesses information about processes in the system, it looks to this filesystem to fetch information.
In other words, we need to spin up a
Luckily, this basically involves telling Linux that we need one, preferably mounted at
/proc. But we can’t do so just yet since our command process is still dependent on the same
proc filesystem as
isolate and every other regular process in the system - to cut this dependency, we need to run it inside its own
The PID namespace isolates process IDs in the system. One effect is that processes running in different PID namespaces can have the same process ID without conflicting with each other. Granted that we’re isolating this namespace because we want to give as much isolation as we can to our running command, a more interesting reason we show it here is that mounting the
proc filesystem requires root privileges and the current PID namespace is owned by the root user namespace where we do not have sufficient permissions (if you remember from the previous post,
root to the command process isn’t really root).
So, we must be running within a PID namespace owned by the user namespace that recognizes our command process as root.
We can create a new PID namespace by passing the
Next, we add a function
prepare_procfs that sets up the proc filesystem by mounting one within the currently mount and pid namespace.
Finally, we call the function right before unmounting
put_old in our
prepare_mntns function, after we have setup the mount namespace and changed to the root directory.
We can take
isolate for another spin:
This looks much better! The shell sees itself as the only process running on the system and running as PID 1 (since it was the first process to start in this new PID namespace).
This post covered two namespaces and
isolate racked up two new features as a result. In the next post, we will be looking at isolation via
Network namespaces. There, we will have to deal with some intricate, low-level network configuration in an attempt to enable network communication between processes in different network namespaces.