Advancing the State of The Art of Container Storage With Titus, Part 3
Disclaimer: This blog post is a deep dive in to the topic of Linux container storage, specifically looking at Netflix’s Open Source Titus container platform. Netflix happens to be my employer, but nothing in this blog post is secret or talk about anything that isn’t already open source.
In Part 1, I discussed the current state of the art of container storage with the CSI+kubernetes, and its limitations.
In Part 2, I discuss the problem of mounting storage inside running containers, especially using user namespaces.
In this Part 3, I’ll discuss how Titus (titus-storage
) is able to separate the attaching of storage from the container lifecycle (how to attach storage after a container is running), all while respecting Linux namespaces, and while keeping the container completely unprivileged and in its user namespace.
What We Are Up Against
We have a running container.
We want to mount something in it.
That “something” could be a network filesystem, a block device, a bind mount, overlayfs, tmpfs, who knows.
Each situation requires a unique solution.
We know that as soon as we try to switch into the user namespace of the container, we no longer can use the mount
syscall reliably.
Is there any other way to “inject” a mount?
How Titus (titus-storage
) Does It
Remember from Part 1 I decided to give up on the CSI and its limitations.
Instead we are just going to build some binaries, like mount.nfs
It will just be a binary that can run at any time and mount storage in a container, even after it has been created!
We will run this mount binary outside of the container, where we have privileges. When we are done, we want a mount setup inside the container, with all the namespaces correctly set, all without giving container additional privileges.
If you would rather read C than my sequence diagrams, just go straight for the code.
Using New Mount APIs Instead of mount
Thanks to the new Linux mount APIs that provide fine-grained control over the mount process, we can split up the mounting process.
Half of the mount process can happen inside (some of the namespaces) of the container.
The other half can happen outside the user namespace, where our CAP_SYS_ADMIN
still works.
The key trick is to create a
“superblock”
(noted as fsfd) inside the namespaces (mount, net, user) of the container, and then fsmount
that superblock on behalf of the container.
Here is a quick comparison between the classic and new Syscall APIs:
Syscall Name | Privileges required (usually) | Namespace interaction | Effect |
---|---|---|---|
mount (classic) |
CAP_SYS_ADMIN |
None (assumes you are already in all namespaces) | Mounts in whatever namespaces you are in. |
fscreate (new) |
None | Takes on the user namespace when called | Returns a file descriptor ready to be configured. |
fsconfig (new) |
None | None | Configures an input file descriptor. |
fsmount (new) |
CAP_SYS_ADMIN |
Takes on the current mount+net namespace when called | Mounts an input file descriptor, returns a mount fd |
move_mount (new) |
CAP_SYS_ADMIN |
Uses the mount namespace | Actually puts a mount fd onto the filesystem |
Combined with SCM_RIGHTS
, we can get the right syscalls in the right namespaces to achieve what we want.
Using SCM_RIGHTS
to Pass File Descriptors
Let’s say that we did nsenter --user
into a container and used fscreate
to get a file descriptor (fd
).
How would we get it “back out” of the container for something outside the container to use it?
Answer: SCM_RIGHTS
.
SCM_RIGHTS
is a method for processes to share file descriptors (the superblock in this case) over a Unix socket.
We are not just transferring the file descriptor number here, we are passing the actual file vnode/descriptor!
If we can pass the fd
back and forth between processes, we will be able to mount storage inside containers, even block devices, even though the inside container can’t “see” them.
This does require that the Linux Namespaces, heck the whole container needs to exist before we can do this procedure (contrast to the CSI, where the container needs to be created after the mount happens, so that it can be bind-mounted in).
This is a feature, not a bug!
It means we can mount storage in Titus containers whenever we want, just like you can attach storage on demand with any other normal server!
(See part 4 for how Titus is able to pause workloads at first launch, to give titus-storage
time to mount things first)
Putting It All Together: NFS (EFS)
Here is an animation that demonstrates the use of these new syscalls, in combination with SCM_RIGHTS
, mount an NFS (EFS) volume in a container.
This demonstrates the titus-mount-nfs
binary.
Sorry the video has /ebs
, I meant for it to say /efs
:
Here is the procedure in sequence diagram form:
This works because the non-forked version of titus-mount-nfs
, which actually ends up calling fsmount
, never actually enters the user namespace!
But we still get the benefits of the user namespace (UIDs are correct), because we called fscreate
while we were in there.
All of this complexity is contained within the standalone binary. The binary just takes standard arguments like nfs mount path, hostname, but additionally a container PID to know which container to enter.
Putting It All Together: Host Bind
A Host bind mount can be setup in a similar way, but using way fewer syscalls and tricks.
It takes a path on the host, and makes it appear inside the container.
This is a traditional bind mount, but it can be done after a container is created.
All that is required is that a mount is open_tree
’d on the host, and then move_mount
’d into the container’s filesystem.
This demonstrates the titus-mount-bind
binary:
Putting It All Together: Block Device
This is an example of mounting a traditional block device from the host into a container.
This is useful in AWS for EBS, which shows up as a NVMe device, like /dev/nvme0n1
.
Normally this device file is not visible by the container.
But with the right tricks, we can configure the mount while we are on the “outside” of the container, where we can still see it.
This demonstrates the titus-mount-block-device
binary:
switch namespaces, otherwise we can't see the acual /dev/ file! T->T: fsconfig on the filesystem fd %% switch into mount + net ns rect rgb(191, 223, 255) activate MNT Note over T,MNT: Switch to Mnt NS T->>MNT: fsmount in the mount namespace T->>MNT: move_mount in the mount namespace end deactivate MNT deactivate T
Putting It All Together: Container To Container
I haven’t seen an example of what titus-mount-container-to-container
does in the industry.
It takes a source container + dir and a destination + dir and bind mounts them.
This is useful with kubernetes multi-container pods, except we are able to share directly share folders from one container to another.
No intermediate emptydir
or other shared storage is required.
For example, a sidecar container may need to see a main container’s /data
.
Or maybe a service mesh sidecar needs the main container’s certificate files.
This demonstrates the titus-mount-container-to-container
binary:
Conclusion
These mount binaries are doing some creative things with syscalls to allow us to mount storage at will with containers, all while keeping them unprivileged.
But 99% of the time, users will want their storage ready at start, not after the container has started. See part 4 were I demonstrate how we are able to control the startup timing of containers, to ensure that storage is mounted before they start.
[ Part 1 | Part 2 | Part 3 | Part 4 ]
Comment via email