Building a Linux Container from Scratch in Rust (Part 1)

We use tools like Docker, Kubernetes, and Podman every day, typing docker run and watching as a fully self-contained unknown springs into existence. In the early days of my career, these softwares with others were black boxes I wanted to tear widely open (even though I was genuinely scared I won't understand the intents and implementation details). Now, I'm older, and got around to prying open the black boxes, starting with low-level details Docker was built on.

That's what this series is about. I'm going to deconstruct 'em boxes for myself, hopefully for you too (apologies if that doesn't happen, you will be alright).

We're going to build our own simple container runtime from absolute scratch. No Docker, no runc, just you, me, the Rust compiler, and the raw power of the Linux kernel's syscalls. We'll call our little setup confine, and hopefully when we're done, you and I will have a better understanding of what a container actually is. Psst…this is not about building a production-ready Docker replacement. abegg ooh!

So, Namespaces, right?

Before any line of code, let's yap a little about the core topic in a container setup: Linux Namespaces.

Stripped of all the jargon, a namespace is just the kernel's way of giving a process its own private, walled-off view of a specific system resource. It's a form of virtualization. Instead of virtualizing an entire machine, you're just virtualizing a piece of the system, like the network stack or the list of running processes. A "container" is really just a regular Linux process that has been walled off from the host system using a collection of these namespaces.

There are several types, each responsible for isolating a different resource.

$ ls -l /proc/$$/ns
total 0
lrwxrwxrwx 1 0xull 0xull 0 Oct  5 21:11 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 0xull 0xull 0 Oct  5 21:11 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 0xull 0xull 0 Oct  5 21:11 mnt -> 'mnt:[4026531841]'
lrwxrwxrwx 1 0xull 0xull 0 Oct  5 21:11 net -> 'net:[4026531840]'
lrwxrwxrwx 1 0xull 0xull 0 Oct  5 21:11 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 0xull 0xull 0 Oct  5 21:11 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 0xull 0xull 0 Oct  5 21:11 time -> 'time:[4026531834]'
lrwxrwxrwx 1 0xull 0xull 0 Oct  5 21:11 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 0xull 0xull 0 Oct  5 21:11 user -> 'user:[4026531837]'
lrwxrwxrwx 1 0xull 0xull 0 Oct  5 21:11 uts -> 'uts:[4026531838]'

Whilst the plan is to successfully get through them all, but if I don't, allow the agency in you to do that. Here are the important ones:

UTS (UNIX Time-sharing System): Isolates the hostname. The simplest way to make a container feel like its own machine.
User: Isolates User and Group IDs. This is the security powerhouse that lets a process be root inside the container without being root on the host.
PID (Process ID): Isolates the process tree. This is what lets a container have its own PID 1 and be unable to see or signal processes on the host.
Mount: Isolates the filesystem view. This is how a container gets its own private root (/) directory and can't see the host's filesystem.
Network: Isolates the network stack. This gives the container its own private network interfaces, IP addresses, and routing tables.
Cgroup (Control Group): Isolates the view of a process's control groups and, more importantly, works with the Cgroup subsystem to limit resource usage (CPU, memory, etc.).
IPC (Inter-Process Communication): Isolates System V IPC objects and POSIX message queues.
Time: Isolates the system clocks, specifically CLOCK_MONOTONIC and CLOCK_BOOTTIME. This allows a container's processes to see a different system uptime than the host.

I'll progressively tackle each, one by one, building and refining as necessary the confine tool from a simple command runner into something that starts to look a lot like docker run.

In this post, we will use the command prompts $ and $# to denote a shell that is currently running within the host and child (confine) namespace respectively.

The UTS Namespace (new hostname)

I'm going to start with the simplest namespace of them all: UTS. This one is responsible for isolating just two things: the hostname and the NIS domain name. It's the perfect "Hello, World!" for this setup because its effect is immediate and easy to see. It'll give our container the first tiny illusion of being its own machine.

Nicely, before touching Rust, I can prove the concept works using existing command line tools. The unshare command lets you run a program in a new namespace. The -u flag (for UTS) is what we care about.

Open up your terminal, mate; check your machine's real hostname.

$ hostname
foothold-labs

Now, let's create a new UTS namespace and run a bash shell inside it. You'll need sudo because creating namespaces is a privileged operation.

$ sudo unshare -u bash

Inside this new shell, your hostname is initially the same. But because this is a new UTS namespace, any changes made will be isolated. Let's change it.

$# hostname confine-container
$# hostname
confine-container

It worked. But is the change isolated? Let's find out. Type exit to leave the unshared shell and return to your original terminal.

$# exit
$ hostname
foothold-labs

You see, original hostname is untouched. We successfully created a temporary, isolated view of the system's hostname. That's a namespace in a nutshell.

The Rust Implementation

Seeing it in the terminal is cool, but I'm assuming many lots are here to see the Rust build, not just run commands. So I don't fight with Rust-to-C FFI (as if the daily struggles with Rust borrow checker system ain't enough), I will be using the nix crate, which provides Rust wrappers around the raw C-style syscalls.

The mission here is simple: write a program that takes a command (like sh) as an argument and runs it inside a new UTS namespace with a new hostname.

Here's the first version of confine:

use nix::sched::{CloneFlags, clone};
use nix::sys::wait::waitpid;
use nix::unistd::{execvpe, sethostname};
use std::env;
use std::ffi::CString;
use std::process::exit;

const STACK_SIZE: usize = 1024 * 1024; // 1 MB

fn child_main(args: &[CString]) -> isize {
    println!("--- Child process started ---");

    if let Err(e) = sethostname("confine-container") {
        eprintln!("[ERROR] sethostname failed: {}", e);
        return -1;
    }

    // execvpe replaces the current process, so it will never return on success.
    match execvpe(&args[0], &args, &[] as &[CString]) {
        Err(err) => {
            eprintln!("[ERROR] Child process: execvpe failed with error: {err}");
            return 1
        }
    }
}

fn main() {
    let args: Vec<String> = env::args().collect();
    if args.len() < 2 {
        eprintln!("Usage: {} <command> [args...]", args[0]);
        exit(1);
    }

    // Convert string arguments to C-style strings for the execvpe call.
    let c_args: Vec<CString> = args[1..]
        .iter()
        .map(|arg| CString::new(arg.as_bytes()).unwrap())
        .collect();

    let mut stack = [0u8; STACK_SIZE];
    let flags = CloneFlags::CLONE_NEWUTS;

    // The entry point for the child process.
    let child_func = Box::new(|| child_main(&c_args));

    let child_pid = unsafe {
        match clone(
            child_func,
            &mut stack,
            flags,
            Some(nix::sys::signal::Signal::SIGCHLD as i32),
        ) {
            Ok(pid) => pid,
            Err(e) => {
                eprintln!("[ERROR] clone failed: {}", e);
                exit(1);
            }
        }
    };

    println!("Cloned child process with PID {}", child_pid);

    if let Err(e) = waitpid(child_pid, None) {
        eprintln!("[ERROR] waitpid failed: {}", e);
    }
}

Let's break down the guts of this thing.

The Stack: The clone syscall is powerful but primitive. Unlike fork, it doesn't automatically copy the parent's stack. You have to manually provide a chunk of memory to serve as the new process's stack. And for that, we just allocate a 1MB array on the main function's stack. Simple, but it will do for now. trust me mate. *Butcher's smile*
The clone Call: This is the heart of the operation. We're telling the kernel:
- child_func: Here is the function I want the new process to start executing.
- &mut stack: Here is the memory for the new process's stack. Note that stacks on x86 grow downwards, but nix's clone wrapper handles this detail for us.
- flags: This is where we request the new namespace. CloneFlags::CLONE_NEWUTS tells the kernel to give this new process its own private UTS namespace.
- Some(nix::sys::signal::Signal::SIGCHLD as i32): This tells the kernel to send a SIGCHLD signal to the parent when the child exits, which is standard practice for process management.
The Child Function (child_main):
- sethostname("confine-container"): The very first thing we do is set the hostname. Because we're in a new UTS namespace, this call only affects us.
- execvpe(...): This is the final act. The exec family of syscalls replaces the currently running program with a new one. The child_main Rust program ceases to exist, and its process is taken over by the command you provided (e.g., sh). If this call succeeds, it never returns.

Now, compile and run it. Psst..compile before running it. You'll need to run it as root to create the namespace.

$ cargo build
$ sudo ./target/debug/confine sh
--- Child process started ---
Cloned child process with PID 7678
$# hostname
confine-container
$# id
uid=0(root) gid=0(root) groups=0(root)

Hold my beer, it works! There, a process running with a custom hostname! But wait, why are we root? That's if you asked. Because the parent process (sudo ./target/debug/confine) was running as root, the child inherited those credentials. Now, this is a security problem because the whole point of a container is to not give it root access to the "host".

And thankfully, that leads to the next, and arguably most important, namespace: the User namespace.

The User Namespace

We're about to play God, safely at least. So far, we have been unwrapping the black box assuming there's a wrap around it. Time to pry it open, but before that, let's do another yapping again.

The User Namespace is a critical piece of this whole setup. Its job is to isolate user and group IDs by creating a private mapping between the user IDs inside the container and the user IDs outside on the host.

The "Why" isn't that obvious unless discussed. Soo…this is the feature that allows a process to have the full power of root (UID 0) inside its own little universe, able to install packages, bind to privileged ports, and manage its own users, while simultaneously looking like a regular, unprivileged user (like UID 1000) to the host system. You know the saying, "have your cake and eat it too" (I know that ain't exactly how it is said), yeah, this is it but for privilege separation.

Let's see what happens when you just ask for a new user namespace. The flag is -U (uppercase, to distinguish it from the UTS flag -u).

# sudo isn't needed for this
$ unshare -Uu sh

# new user and UTS namespace
$# id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)

Wait, who? We've become nobody. In linux, this is a deliberate, security-first design choice by the kernel. A new user namespace is created with a completely empty ID mapping table. Since the kernel has no entry in the table to map our original UID from the parent namespace, it falls back to a special, unmapped "overflow UID," which is 65534 and conventionally named nobody.

And so you know, a namespace without any valid mappings is severely restricted. We can't do anything that requires privileges. The fix is to provide that map. The kernel exposes this mapping mechanism via two special, write-once files in the /proc filesystem:

/proc/[pid]/uid_map
/proc/[pid]/gid_map

To bring the new namespace to life, a process in the parent namespace (our confine program) needs to write a mapping into these files for the newly created child process. The format of the string we write is simple:

ID_inside   ID_outside   length

Here's how these values translates:

ID_inside: This is the starting UID or GID that processes inside the new namespace will see. To become rootinside the container, you'd start this at 0.
ID_outside: This is the corresponding starting UID or GID on the host system. This is the "real" user that the kernel will use for permission checks.
length: The number of consecutive IDs to include in this mapping range. For our simple case, we just want to map one user, so the length is 1.

So, if your user ID on the host is 1000, and you want to become root inside the container, you would write the following string to /proc/[child_pid]/uid_map:

"0 1000 1"

This tells the kernel: "A range of 1 ID, starting at UID 0 inside the container, corresponds to a range starting at UID 1000 on the host." Say, you want more than one user ID, a line like this, "1 100000 65536", would map UIDs 1-65536 inside the container to UIDs 100000-165535 on the host, which is how you'd give a container a pool of unprivileged users.

So far, we have deliberatively spoken of a user namespace → host (level 0) relationship. User namespaces can be nested! You can create a new user namespace from within an existing one, which in turn could be nested inside another. This forms a parent-child hierarchy that can go up to 32 levels deep.

This image is from Quarkslab's blog.

Each level in this hierarchy has its own distinct set of UID and GID mappings, translating IDs from its own level to the level immediately above it. The diagram is a good visual example, showing a wide range of UIDs at level 0 mapped to a smaller, subordinate range at level 1, which in turn maps an even smaller range to level 2. A process's "real" identity on the host is found by translating its UID up this entire chain until it reaches the root. Please hold that last sentence in working memory as it will tie in nicely when we talk about translation of permissions and ownership of resources by the kernel.

This nesting adds a layer of complexity to how we read the /proc/[pid]/uid_map file. The file's content is not static; what you see depends entirely on which namespace you are reading it from. It always shows a mapping between two namespaces, but which two depends on your perspective.

In scenario 1:

Let's say you have two distinct, non-nested containers, A and B, and you want to see how UIDs in container A map to container B.

If a process in container B reads the uid_map for a process in container A (e.g., /proc/7678/uid_map), and the output contains the line 25 32 7, the translation is between those two peers. It means:

UIDs 25 through 31 (a range of 7) in container A's namespace correspond to UIDs 32 through 38 in container B's namespace.

In scenario 2:

This is the more common and intuitive case. When a process reads the uid_map file for itself (/proc/$$/uid_map) or for any other process within its own namespace, the mapping is always between the current namespace and its immediate parent.

So, if you are inside the confine container (let's call it namespace C) and you read /proc/$$/uid_map, and the output contains 25 32 7, it means:

UIDs 25 through 31 in your current namespace C correspond to UIDs 32 through 38 in C's parent namespace.

These mapping creates a two-way translations on how permissions and ownership of resources are interpreted for processes by the kernel. Here's my attempt at explaining it, I hope it clicks:

Permission checks (Inside → Outside): When a process inside the container, running as what it thinks is UID 0 (root), tries to access a host file (e.g., cat /etc/hostname), the kernel intercepts the operation. It looks at the user namespace map and performs a translation: "Ah, the process thinks it's UID 0, but this namespace's map tells me that corresponds to UID 1000 on the host." The kernel then performs the actual permission check on the file using the host UID of 1000. If your host user 1000 can read that file, the operation succeeds. If not, it fails. This is why it is not a privilege escalation. You only have the permissions of your host user, but you appear as root inside the container.
Metadata display (Outside → Inside): It also works in reverse. When a process inside the container runs ls -l on a directory, it sees files owned by various host UIDs. For each file, the kernel takes the real host UID and tries to find a corresponding mapping into the container's namespace. If a file on the host is owned by your user (UID 1000), the kernel sees the 0 1000 1 mapping and displays the owner as root. If a file is owned by the host's actual root user (UID 0), and you haven't created a mapping for it, the kernel has no translation. In that case, it displays the owner as nobody (65534).

Alright, so far mapping UIDs is straightforward. But mapping GIDs, however, comes with a dangerous trapdoor that we must slam shut. The complexity has everything to do with the supplementary groups (check here also). In Linux, your process doesn't just have one group ID as it can be a member of many groups at once (e.g., audio, video, docker). And the setgroups(2) syscall is a privileged call that allows a process to change its list of supplementary groups. Herein lies the security risk:

We create a new user namespace and map our user to be root (UID 0).
As root inside the namespace, our process is granted a full set of capabilities, including CAP_SETGID, which is the capability needed to call setgroups().
What stops our newly-powerful process from calling setgroups() and adding a powerful host group ID (like the docker group or the sudo group) to its credentials?

If this were allowed, it would be a trivial container breakout, allowing the process to gain "God" privileges on the host system. As such, the kernel provides another file, /proc/[pid]/setgroups, to prevent this exact attack. Before an unprivileged user (on the host) is allowed to write to /proc/[pid]/gid_map, they must first write the string "deny" to the setgroups file. By so doing, irreversibly disables the setgroups(2) syscall within that new user namespace. Only after this supposed closing and locking of the trapdoor can the parent process proceed to write the GID map (e.g., "0 1000 1") to /proc/[pid]/gid_map, just as it did for the UID map.

Jesus! That's a long but sweet yapping!

The Rust Implementation

Finally, I get to show you Rust code, continuing from the initial setup. But before then, there's a bit of a synchronization problem.

Remember that a privileged process in the parent namespace must be the one to write the desired mappings into the uid_map and gid_map files for the child. This creates a tricky timing problem:

We need to clone the child process so that it exists and has a PID.
The parent process needs to use that PID to write to the child's map files.
The child process must wait for the parent to finish writing the maps before it tries to do anything, because until the maps are written, it's still nobody.
Once the maps are written, the child can finally exec the user's command.

A good old solution is to synchronize the two processes with a pipe. The parent will create a pipe, and the child will block by trying to read from it. Once the parent has finished setting everything up, it will write a single byte to the pipe, which unblocks the child and signals that it's safe to proceed.

Now, the code. We'll add the CLONE_NEWUSER flag, set up the pipe, and implement the parent/child synchronization logic.

use nix::sched::{CloneFlags, clone};
use nix::sys::wait::waitpid;
use nix::unistd::{Gid, Uid, close, execvpe, pipe, read, setgid, sethostname, setuid, write};
use std::env;
use std::ffi::CString;
use std::fs::File;
use std::io::Write;
use std::os::fd::{AsFd, AsRawFd, BorrowedFd, RawFd};
use std::process::exit;

const STACK_SIZE: usize = 1024 * 1024; // 1 MB

fn child_main(pipe_read_fd: BorrowedFd, pipe_write_fd: RawFd, args: &[CString]) -> isize {
    println!("[CHILD] Process started");

    if let Err(_) = close(pipe_write_fd) {
        eprintln!("[CHILD:ERROR] Failed to close write end of pipe");
        return -1;
    }

    if let Err(e) = sethostname("confine-container") {
        eprintln!("[CHILD:ERROR] sethostname failed: {}", e);
        return -1;
    }

    println!("[CHILD] Waiting on parent to set up UID/GID maps...");
    let mut buf = [0u8; 1];
    if let Err(_) = read(pipe_read_fd, &mut buf) {
        eprintln!("[CHILD:ERROR] Failed to read from pipe");
        return -1;
    }
    println!("[CHILD] Signal received.");

    // by now, the maps are written, we play god (safely) in this namespace.
    if let Err(err) = setgid(Gid::from_raw(0)) {
        eprintln!("[CHILD:ERROR] setgid failed: {err}");
        return -1;
    }

    if let Err(err) = setuid(Uid::from_raw(0)) {
        eprintln!("[CHILD:ERROR] setuid failed: {err}");
        return -1;
    }

    // execvpe replaces the current process, so it will never return on success.
    match execvpe(&args[0], &args, &[] as &[CString]) {
        Err(err) => {
            eprintln!("[CHILD:ERROR] execvpe failed: {err}");
            return 1;
        }
    }
}

fn main() {
    let args: Vec<String> = env::args().collect();
    if args.len() < 2 {
        eprintln!("Usage: {} <command> [args...]", args[0]);
        exit(1);
    }

    // Convert string arguments to C-style strings for the execvpe call.
    let c_args: Vec<CString> = args[1..]
        .iter()
        .map(|arg| CString::new(arg.as_bytes()).unwrap())
        .collect();

    let (pipe_read_fd, pipe_write_fd) = match pipe() {
        Ok(fds) => fds,
        Err(err) => {
            eprintln!("[PARENT:ERROR] Failed to create pipe: {err}");
            exit(1);
        }
    };

    // The entry point for the child process.
    let child_func =
        Box::new(|| child_main(pipe_read_fd.as_fd(), pipe_write_fd.as_raw_fd(), &c_args));
    let mut stack = [0u8; STACK_SIZE];
    let flags = CloneFlags::CLONE_NEWUTS | CloneFlags::CLONE_NEWUSER;

    let child_pid = unsafe {
        match clone(
            child_func,
            &mut stack,
            flags,
            Some(nix::sys::signal::Signal::SIGCHLD as i32),
        ) {
            Ok(pid) => pid,
            Err(e) => {
                eprintln!("[PARENT:ERROR] clone failed: {}", e);
                exit(1);
            }
        }
    };

    println!("[PARENT] Cloned child process with PID {}", child_pid);

    if let Err(_) = close(pipe_read_fd) {
        eprintln!("[PARENT:ERROR] failed to close read end of pipe");
        exit(1);
    }

    let host_uid = Uid::current();
    let host_gid = Gid::current();

    println!("[PARENT] Setting up UID/GID maps for child...");

    let gid_map_path = format!("/proc/{}/gid_map", child_pid);
    let setgroups_path = format!("/proc/{}/setgroups", child_pid);

    if let Ok(mut file) = File::create(&setgroups_path) {
        if let Err(err) = file.write_all(b"deny") {
            eprintln!("[PARENT:ERROR] Failed to write to {setgroups_path}: {err}");
        }
    } else {
        eprintln!("[PARENT:ERROR] Failed to open {setgroups_path}");
    }

    let gid_map_content = format!("0 {} 1", host_gid);
    if let Ok(mut file) = File::create(&gid_map_path) {
        if let Err(err) = file.write_all(gid_map_content.as_bytes()) {
            eprintln!("[PARENT:ERROR] Failed to write to {gid_map_path}: {err}");
        }
    } else {
        eprintln!("[PARENT:ERROR] Failed to open {gid_map_path}");
    }

    let uid_map_path = format!("/proc/{}/uid_map", child_pid);
    let uid_map_content = format!("0 {} 1", host_uid);
    if let Ok(mut file) = File::create(&uid_map_path) {
        if let Err(err) = file.write_all(uid_map_content.as_bytes()) {
            eprintln!("[PARENT:ERROR] Failed to write to {uid_map_path}: {err}");
        }
    } else {
        eprintln!("[PARENT:ERROR] Failed to open {uid_map_path}");
    }

    // Signal to child
    println!("[PARENT] Signaling child to continue.");
    if let Err(_) = write(pipe_write_fd.as_fd(), &[0]) {
        eprintln!("PARENT:ERROR] Failed to write to pipe");
    }

    if let Err(_) = close(pipe_write_fd) {
        eprintln!("[PARENT:ERROR] Failed to close write end of pipe");
    }

    if let Err(err) = waitpid(child_pid, None) {
        eprintln!("[PARENT:ERROR] waitpid failed: {err}");
    }
}

Alright, core things to look out for are:

The Pipe: nix::unistd::pipe() creates the two-ended pipe. We pass both file descriptors to the child. The parent closes the read end and the child closes the write end. This is standard practice.
Child blocks on read: The read(pipe_read_fd, ...) call in the child is the synchronization point. It will simply pause execution and wait until there is at least one byte of data to be read from the pipe.
Parent writes maps:
- We get the current user's real UID and GID to create the mapping 0 [host_id] 1.
- The setgroups file: This permanently disables the setgroups(2) syscall in the new namespace, preventing the child process from re-adding any privileged groups from the host.
- We then write the GID and UID maps.
Parent unblocks child: The write(pipe_write_fd, &[0]) call in the parent writes a single byte to the pipe. This satisfies the child's read call, unblocking it.
Child becomes root: Once unblocked, the child immediately calls setgid(0) and setuid(0). Because the maps are now in place, the kernel sees these calls and understands that UID/GID 0 inside this namespace corresponds to our unprivileged user outside.

Compile and run it again.

$ sudo ./target/debug/confine sh
[CHILD] Process started
[CHILD] Waiting on parent to set up UID/GID maps...
[PARENT] Cloned child process with PID 22883
[PARENT] Setting up UID/GID maps for child...
[PARENT] Signaling child to continue.
[CHILD] Signal received.
$# id
uid=0(root) gid=0(root) groups=0(root)
$# cat /proc/$$/setgroups
deny

Hold my beer again! We have achieved privilege separation. Inside the container, the process believes it is the all-powerful root. But from the host's perspective, it's just a regular process running as our normal, unprivileged user.

The complete source code for this post can be found here.

Conclusion

I suck at writing conclusion, but so far we have a working isolated process that believes it's on a machine with hostname, "confine-container", and as well, operating as a root user.

But there's more. The isolated process can see the host's entire filesystem and all its processes. So next, we'll try isolating it from the overarching filesystem by tackling Mount and PID namespaces.