I tried to build a container from scratch using only chroot, unshare, and overlayfs. I almost got it working, but PID isolation broke me

Posted by Abject-Hat-4633@reddit | linuxadmin | View on Reddit | 18 comments

I have been learning how containers actually work under the hood. I wanted to move beyond Docker and understand the core Linux primitives—namespaces, cgroups, and overlayfs—that make it all possible.

so i learned about that and i tried to built it all scratch (the way I imagined sysadmins might have before Docker normalized it all) using all isolation and namespace thing ...

what I got working perfectly:

Creating an isolated root filesystem with debootstrap.
Using OverlayFS to have an immutable base image with a writable layer.
Isolating the filesystem, network, UTS, and IPC namespaces with unshare.
Setting up a cgroup to limit memory and CPU.

-->$ cat problem
PID namespace isolation. I can't get it to work reliably. I've tried everything:

Using unshare --pid --fork --mount-proc
Manually mounting a new procfs with mount -t proc proc /proc from inside the chroot
Complex shell scripts to try and get the timing right

it was showing me whole host processes , and it should give me 1-2 processes

I tried to follow the runc runtime
i have used the overlayFS , rootfs ( it is debian , later i will use Alpine like docker, but this before error remove )

I have learned more about kernel namespaces from this failure than any success, but I'm stumped.

Has anyone else tried this deep dive? How did you achieve stable PID isolation without a full-blown runtime like runc ?

here is the github link : https://github.com/VAibhav1031/Scripts/tree/main/Container_Setup

[-]

Sad_Dust_9259@reddit

I tried the same rabbit hole once, and PID namespaces were the wall I crashed into too.

[-]

Before docker there was a bigger jump for most sysadmins. On the basic side you had chroot jails, then jumped to virtualization hosts with not a lot in between. Before docker you wouldn't bother with thin layers over everything, just the 1-2 things you needed or everything in virt.

It really was a big game changer, and to to this day people still assume it's got virtualization levels of overhead and avoid it due to misunderstandings.

[-]

Ssakaa@reddit

Before docker you wouldn't bother with thin layers over everything, just the 1-2 things you needed or everything in virt.

VServer, OpenVZ and LXC were all years before Docker there... which, since you know the history back with Process Containers bringing forth the tooling that became LXC, seems silly that you left LXC itself out of it.

Docker just had better marketing.

[-]

Magneon@reddit

LXC is fine but was always clunkier to use (in my opinion). I've used it from time to time over the years but it's a vestigial betamax/hd-dvd at this point.

I was a sysadmin around the time docker took off and the critical mass or gained was huge. I don't think I've used a VM in production since (directly anyway), although to be fair I haven't done sysadmin work at any important scale for half a decade at least.

[-]

Ssakaa@reddit

The biggest difference... LXC, OpenVZ, and VServer were all made to behave much more like VMs without the weight of VMs, trading full hardware virtualization for the thin containerization type shim, just a separate userspace under the same kernel. They were built very much from a sysadmin perspective... while Docker was shaped much more towards (and pitched heavily to) developers as an escape from dependency hell and a way to bypass pesky sysadmins, and literally packaging up the "works for me on my box" environment from the dev point of view, removing the need to provide support for variable systems.

[-]

Magneon@reddit

That's fair. The shift from sysadmin to devops to whatever mess it is now highlights that.

Docker can work fine as long as you push for true reproducible builds. That means a lot of annoying things, like setting up your own apt mirrors, and managing your own layer of security patching. Repeat per dependency management system... And it's tedious even if it's doable for a company with a few dedicated people on the task.

For smaller companies, you're right: docker is often used as a way to ship your dev environment as a snapshot. It works really well for that even if that's not a good long term strategy. It's kind of come full circle on the servers as cattle thing, where now your docker image is the pet or cattle, and the rest of the system is (hopefully) fairly disposable.

I work on robotics, with Linux based machines, so my requirements are a bit weird compared to most (notably: extremely limited bandwidth a lot of the time, and "offline" is not a failure state, just an annoying one).

I'd argue that for most people shipping a poorly planned container is probably still a better idea than a poorly planned bare metal install.

Docker is more of a buffet style of abstraction system. If you want abstracted disk, host networking, direct gpu access, and access to only 2 CPU cores half the time... That's easy. Nearly other permutation is similarly fairly simple.

There's still some really cool stuff I managed to do back in the VM days that containers don't really address (for example moving a VM between hosts without interrupting networking or stopping any processes... Which KVM can do if you're careful). I don't really thing a well designed modern system needs to get that fancy, but it was very cool!

[-]

dhsjabsbsjkans@reddit

Word. I was using LXC before docker.

[-]

Abject-Hat-4633@reddit (OP)

Thank you for your insights on this topic, I also checked like there were Solaris zone, freebsd jail That provided some kind of containerisation earlier than Linux , but there software were expensive

If you have any ideas that could help me or some resource I can learn more . Plus give little peak on my code also

[-]

aquaherd@reddit

Maybe you can read it up here:

https://github.com/p8952/bocker

[-]

Abject-Hat-4633@reddit (OP)

Thank you I will get an idea from this , It is a bit old repo but still gold for me

Tyy....

[-]

Cody_Learner@reddit

Have you looked into, considered systemd-nspawn containers yet?

It's a very minimal container system that abstract away some of the underlying components you're working with. I use them all the time for both temp/testing and setup as persistent upon boot, ie: a local repo host. I also use them exclusively in my AUR helper for compiling packages.

[-]

Abject-Hat-4633@reddit (OP)

No, i havent yet use that , but i searched about it , but it is more like a Machine Container (it is like it can run whole OS inside it, with login privileges and etc thing )
but Docker/Podman .. are the Application Container (Package and run a single Application)

but yeah for normal test and other task it is not badd

[-]

Cody_Learner@reddit

Sure, You can use them to only run commands, or optionally boot them up. They share the host kernel, etc. They're oci standards complaint.

[-]

Abject-Hat-4633@reddit (OP)

👍🏻

[-]

michaelpaoli@reddit

showing me whole host processes , and it should give me 1-2 processes

So, how 'bout SELinux? The typical default and common in the land of *nix, is all users/PIDs, can get quite a bit of information about other PIDs. With SELinux (and possibly some similarish mechanisms), that can be changed, e.g. such that a user may only be able to get information about their own PIDs, and nothing about any other PIDs on the host. And, don't know if it exists, but I'd think a similar restriction on a PID may be a feature that exists, where that PID could only get information about just itself, or only itself and its children, or only itself and its descendants.

Anyway, may be other approaches, but that might be at least one possible approach (also possible some may utilize same underlying mechanisms by the time one gets down to the system call level).

[-]

Skahldera@reddit

When you DIY containers with `unshare` and `chroot`, you need a proper PID namespace and a `/proc` mount inside it or the kernel gets confused. Having a minimal init to reap zombies also helps; otherwise your orphaned processes bubble up to PID 1. Tools like `setns` or runC handle those fiddly bits for a reason!

[-]

Abject-Hat-4633@reddit (OP)

Thank you 👍, I will try what you said . But what about bubblewrap some folks say use that instead of unshare

[-]

chock-a-block@reddit

In case it’s not clear, no systemd, and isolate proc and dev.