Fix issue 5763 - Discourage `--oci-worker-no-process-sandbox`, due to the leakage of the processes (by design). Instead, encourage setting `systempaths=unconfined` in `docker run`. This corresponds to `securityContext.procMount: Unmasked` in Kubernetes, however, the configuration is hard on Kubernetes, as it has to be used in conjunction with `hostUsers: false`. - Remove `--device /dev/fuse`, as fuse-overlayfs is no longer used typically. - Use the new Kubernetes struct for AppArmor - Add a hint about `kernel.apparmor_restrict_unprivileged_userns` - Remove `$` from command snippets for ease of copypasting - Make `job.*.yaml` more practical - Add `*.userns.yaml`. Needs `UserNamespaceSupport` feature gate to be enabled. Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
7.8 KiB
Rootless mode
Rootless mode allows running BuildKit daemon as a non-root user.
Known limitations
- Using the
overlayfs
snapshotter requires kernel >= 5.11 or Ubuntu kernel. On kernel >= 4.18, thefuse-overlayfs
snapshotter is used instead ofoverlayfs
. On kernel < 4.18, thenative
snapshotter is used. - Network mode is always set to
network.host
.
Running BuildKit in Rootless mode (OCI worker)
RootlessKit needs to be installed.
rootlesskit buildkitd
buildctl --addr unix:///run/user/$UID/buildkit/buildkitd.sock build ...
Tip
To isolate BuildKit daemon's network namespace from the host (recommended):
rootlesskit --net=slirp4netns --copy-up=/etc --disable-host-loopback buildkitd
Running BuildKit in Rootless mode (containerd worker)
RootlessKit needs to be installed.
Run containerd in rootless mode using rootlesskit following containerd's document.
containerd-rootless.sh
CONTAINERD_NAMESPACE=default containerd-rootless-setuptool.sh install-buildkit-containerd
Advanced guide
Alternatively, you can specify the full command line flags as follows:
containerd-rootless.sh --config /path/to/config.toml
containerd-rootless-setuptool.sh nsenter -- buildkitd --oci-worker=false --containerd-worker=true
Containerized deployment
Kubernetes
Docker
docker run \
--name buildkitd \
-d \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
--security-opt systempaths=unconfined \
moby/buildkit:rootless
buildctl --addr docker-container://buildkitd build ...
Tip
If you don't mind using
--privileged
(almost safe for rootless), thedocker run
flags can be shorten as follows:docker run --name buildkitd -d --privileged moby/buildkit:rootless
Justification of the --security-opt
flags:
-
seccomp=unconfined
: For allowing several syscalls such asunshare
(used by runc) andmount
(used by snapshotters, etc). -
apparmor=unconfined
: For allowing mounting filesystems, etc. This flag is not needed when the host operating system does not use AppArmor. -
systempaths=unconfined
: For disabling the masks for the/proc
mount in the container, so that each ofExecOp
(corresponds to aRUN
instruction in Dockerfile) can have a dedicated/proc
filesystem.systempaths=unconfined
potentially allows reading and writing dangerous kernel files from a container, but it is safe when you are runningbuildkitd
as non-root.
Tip
Instead of
--security-opt systempaths=unconfined
,buildkitd
can be also executed with--oci-worker-no-process-sandbox
(flag ofbuildkitd
, notdocker
) to avoid creating a new PID namespace and mounting a new/proc
for it.Using
--oci-worker-no-process-sandbox
is discouraged, as it cannot terminate processes that did not exit during anExecOp
. Also,--oci-worker-no-process-sandbox
allowsExecOp
containers tokill
(and potentiallyptrace
depending on the seccomp configuration) an arbitrary process in the BuildKit daemon container.Despite these caveats, the Kubernetes examples uses
--oci-worker-no-process-sandbox
, as Kubernetes lacks the equivalent ofsystempaths=unconfined
. (securityContext.procMount=Unmasked
is similar, but different in the sense that it depends onhostUsers: false
)
Change UID/GID
The moby/buildkit:rootless
image has the following UID/GID configuration:
Actual ID (shown in the host and the BuildKit daemon container) | Mapped ID (shown in build executor containers) |
---|---|
1000 | 0 |
100000 | 1 |
... | ... |
165535 | 65536 |
$ docker exec buildkitd id
uid=1000(user) gid=1000(user)
$ docker exec buildkitd ps aux
PID USER TIME COMMAND
1 user 0:00 rootlesskit buildkitd --addr tcp://0.0.0.0:1234
13 user 0:00 /proc/self/exe buildkitd --addr tcp://0.0.0.0:1234
21 user 0:00 buildkitd --addr tcp://0.0.0.0:1234
29 user 0:00 ps aux
$ docker exec cat /etc/subuid
user:100000:65536
To change the UID/GID configuration, you need to modify and build the BuildKit image manually.
vi Dockerfile
make images
docker run ... moby/buildkit:local-rootless ...
Troubleshooting
Error related to overlayfs
Try running buildkitd
with --oci-worker-snapshotter=fuse-overlayfs
:
$ rootlesskit buildkitd --oci-worker-snapshotter=fuse-overlayfs
Error related to fuse-overlayfs
Run docker run
with --device /dev/fuse
.
Also try running buildkitd
with --oci-worker-snapshotter=native
:
$ rootlesskit buildkitd --oci-worker-snapshotter=native
Error related to newuidmap
or /etc/subuid
See https://rootlesscontaine.rs/getting-started/common/subuid/
Error Options:[rbind ro]}]: operation not permitted
Make sure to mount an emptyDir
volume on /home/user/.local/share/buildkit
.
Error fork/exec /proc/self/exe: no space left on device
with level=warning msg="/proc/sys/user/max_user_namespaces needs to be set to non-zero."
Run sysctl -w user.max_user_namespaces=N
(N=positive integer, like 63359) on the host nodes.
See ../examples/kubernetes/sysctl-userns.privileged.yaml
.
Error fork/exec /proc/self/exe: permission denied
with This error might have happened because /proc/sys/kernel/apparmor_restrict_unprivileged_userns is set to 1
Add kernel.apparmor_restrict_unprivileged_userns=0
to /etc/sysctl.conf
(or /etc/sysctl.d
) and run sudo sysctl -p
.
Error mount proc:/proc (via /proc/self/fd/6), flags: 0xe: operation not permitted
This error is known to happen when BuildKit is executed in a container without the --security-opt systempaths=unconfined
flag.
Make sure to specify it (See above).
Distribution-specific hint
Using Ubuntu kernel is recommended.
Ubuntu, 24.04 or later
Add kernel.apparmor_restrict_unprivileged_userns=0
to /etc/sysctl.conf
(or /etc/sysctl.d
) and run sudo sysctl -p
.
Container-Optimized OS from Google
Make sure to have an emptyDir
volume below:
spec:
containers:
- name: buildkitd
volumeMounts:
# Dockerfile has `VOLUME /home/user/.local/share/buildkit` by default too,
# but the default VOLUME does not work with rootless on Google's Container-Optimized OS
# as it is mounted with `nosuid,nodev`.
# https://github.com/moby/buildkit/issues/879#issuecomment-1240347038
- mountPath: /home/user/.local/share/buildkit
name: buildkitd
volumes:
- name: buildkitd
emptyDir: {}
See also the example manifests.
Bottlerocket OS
Needs to set the max user namespaces to a positive integer, through the API settings:
[settings.kernel.sysctl]
"user.max_user_namespaces" = "16384"
See ../examples/eksctl/bottlerocket.yaml
for an example to configure a Node Group in EKS.
Old distributions
Debian GNU/Linux 10
Add kernel.unprivileged_userns_clone=1
to /etc/sysctl.conf
(or /etc/sysctl.d
) and run sudo sysctl -p
.
This step is not needed for Debian GNU/Linux 11 and later.
RHEL/CentOS 7
Add user.max_user_namespaces=28633
to /etc/sysctl.conf
(or /etc/sysctl.d
) and run sudo sysctl -p
.
This step is not needed for RHEL/CentOS 8 and later.
Fedora, before kernel 5.13
You may have to disable SELinux, or run BuildKit with --oci-worker-snapshotter=fuse-overlayfs
.