From 3a91b50be17b36ecf3ce27ba730d025378f31071 Mon Sep 17 00:00:00 2001 From: Akihiro Suda Date: Fri, 21 Feb 2025 14:50:24 +0900 Subject: [PATCH] rootless: update docs and examples Fix issue 5763 - Discourage `--oci-worker-no-process-sandbox`, due to the leakage of the processes (by design). Instead, encourage setting `systempaths=unconfined` in `docker run`. This corresponds to `securityContext.procMount: Unmasked` in Kubernetes, however, the configuration is hard on Kubernetes, as it has to be used in conjunction with `hostUsers: false`. - Remove `--device /dev/fuse`, as fuse-overlayfs is no longer used typically. - Use the new Kubernetes struct for AppArmor - Add a hint about `kernel.apparmor_restrict_unprivileged_userns` - Remove `$` from command snippets for ease of copypasting - Make `job.*.yaml` more practical - Add `*.userns.yaml`. Needs `UserNamespaceSupport` feature gate to be enabled. Signed-off-by: Akihiro Suda --- docs/rootless.md | 113 +++++++++++------- examples/kubernetes/README.md | 56 +++++---- .../deployment+service.rootless.yaml | 5 +- .../kubernetes/deployment+service.userns.yaml | 77 ++++++++++++ examples/kubernetes/job.privileged.yaml | 4 +- examples/kubernetes/job.rootless.yaml | 10 +- examples/kubernetes/job.userns.yaml | 47 ++++++++ examples/kubernetes/pod.rootless.yaml | 5 +- examples/kubernetes/pod.userns.yaml | 29 +++++ examples/kubernetes/statefulset.rootless.yaml | 5 +- examples/kubernetes/statefulset.userns.yaml | 42 +++++++ 11 files changed, 318 insertions(+), 75 deletions(-) create mode 100644 examples/kubernetes/deployment+service.userns.yaml create mode 100644 examples/kubernetes/job.userns.yaml create mode 100644 examples/kubernetes/pod.userns.yaml create mode 100644 examples/kubernetes/statefulset.userns.yaml diff --git a/docs/rootless.md b/docs/rootless.md index b2a05883e..aafbdf52d 100644 --- a/docs/rootless.md +++ b/docs/rootless.md @@ -12,18 +12,19 @@ Rootless mode allows running BuildKit daemon as a non-root user. [RootlessKit](https://github.com/rootless-containers/rootlesskit/) needs to be installed. -```console -$ rootlesskit buildkitd +```bash +rootlesskit buildkitd ``` -```console -$ buildctl --addr unix:///run/user/$UID/buildkit/buildkitd.sock build ... +```bash +buildctl --addr unix:///run/user/$UID/buildkit/buildkitd.sock build ... ``` -To isolate BuildKit daemon's network namespace from the host (recommended): -```console -$ rootlesskit --net=slirp4netns --copy-up=/etc --disable-host-loopback buildkitd -``` +> [!TIP] +> To isolate BuildKit daemon's network namespace from the host (recommended): +> ```bash +> rootlesskit --net=slirp4netns --copy-up=/etc --disable-host-loopback buildkitd +> ``` ## Running BuildKit in Rootless mode (containerd worker) @@ -31,15 +32,28 @@ $ rootlesskit --net=slirp4netns --copy-up=/etc --disable-host-loopback buildkitd Run containerd in rootless mode using rootlesskit following [containerd's document](https://github.com/containerd/containerd/blob/main/docs/rootless.md). -``` -$ containerd-rootless.sh +```bash +containerd-rootless.sh + +CONTAINERD_NAMESPACE=default containerd-rootless-setuptool.sh install-buildkit-containerd ``` -Then let buildkitd join the same namespace as containerd. +
+Advanced guide +

+ + +Alternatively, you can specify the full command line flags as follows: +```bash +containerd-rootless.sh --config /path/to/config.toml + +containerd-rootless-setuptool.sh nsenter -- buildkitd --oci-worker=false --containerd-worker=true ``` -$ containerd-rootless-setuptool.sh nsenter -- buildkitd --oci-worker=false --containerd-worker=true --containerd-worker-snapshotter=native -``` + +

+ +
## Containerized deployment @@ -48,36 +62,45 @@ See [`../examples/kubernetes`](../examples/kubernetes). ### Docker -```console -$ docker run \ +```bash +docker run \ --name buildkitd \ -d \ --security-opt seccomp=unconfined \ --security-opt apparmor=unconfined \ - --device /dev/fuse \ - moby/buildkit:rootless --oci-worker-no-process-sandbox -$ buildctl --addr docker-container://buildkitd build ... + --security-opt systempaths=unconfined \ + moby/buildkit:rootless + +buildctl --addr docker-container://buildkitd build ... ``` -If you don't mind using `--privileged` (almost safe for rootless), the `docker run` flags can be shorten as follows: +> [!TIP] +> If you don't mind using `--privileged` (almost safe for rootless), the `docker run` flags can be shorten as follows: +> +> ```bash +> docker run --name buildkitd -d --privileged moby/buildkit:rootless +> ``` -```console -$ docker run --name buildkitd -d --privileged moby/buildkit:rootless -``` +Justification of the `--security-opt` flags: -#### About `--device /dev/fuse` -Adding `--device /dev/fuse` to the `docker run` arguments is required only if you want to use `fuse-overlayfs` snapshotter. +* `seccomp=unconfined`: For allowing several syscalls such as `unshare` (used by runc) and `mount` (used by snapshotters, etc). -#### About `--oci-worker-no-process-sandbox` +* `apparmor=unconfined`: For allowing mounting filesystems, etc. + This flag is not needed when the host operating system does not use AppArmor. -By adding `--oci-worker-no-process-sandbox` to the `buildkitd` arguments, BuildKit can be executed in a container without adding `--privileged` to `docker run` arguments. -However, you still need to pass `--security-opt seccomp=unconfined --security-opt apparmor=unconfined` to `docker run`. +* `systempaths=unconfined`: For disabling the masks for the `/proc` mount in the container, so that each of `ExecOp` + (corresponds to a `RUN` instruction in Dockerfile) can have a dedicated `/proc` filesystem. + `systempaths=unconfined` potentially allows reading and writing dangerous kernel files from a container, but it is safe when you are running `buildkitd` as non-root. -Note that `--oci-worker-no-process-sandbox` allows build executor containers to `kill` (and potentially `ptrace` depending on the seccomp configuration) an arbitrary process in the BuildKit daemon container. - -To allow running rootless `buildkitd` without `--oci-worker-no-process-sandbox`, run `docker run` with `--security-opt systempaths=unconfined`. (For Kubernetes, set `securityContext.procMount` to `Unmasked`.) - -The `--security-opt systempaths=unconfined` flag disables the masks for the `/proc` mount in the container and potentially allows reading and writing dangerous kernel files, but it is safe when you are running `buildkitd` as non-root. +> [!TIP] +> Instead of `--security-opt systempaths=unconfined`, `buildkitd` can be also executed with `--oci-worker-no-process-sandbox` (flag of `buildkitd`, not `docker`) +> to avoid creating a new PID namespace and mounting a new `/proc` for it. +> +> Using `--oci-worker-no-process-sandbox` is discouraged, as it cannot terminate processes that did not exit during an `ExecOp`. +> Also, `--oci-worker-no-process-sandbox` allows `ExecOp` containers to `kill` (and potentially `ptrace` depending on the seccomp configuration) an arbitrary process in the BuildKit daemon container. +> +> Despite these caveats, the [Kubernetes examples](../examples/kubernetes) uses `--oci-worker-no-process-sandbox`, as Kubernetes lacks the equivalent of `systempaths=unconfined`. +> (`securityContext.procMount=Unmasked` is similar, but different in the sense that it depends on `hostUsers: false`) ### Change UID/GID @@ -90,7 +113,7 @@ Actual ID (shown in the host and the BuildKit daemon container)| Mapped ID (show ... | ... 165535 | 65536 -``` +```console $ docker exec buildkitd id uid=1000(user) gid=1000(user) $ docker exec buildkitd ps aux @@ -99,15 +122,16 @@ PID USER TIME COMMAND 13 user 0:00 /proc/self/exe buildkitd --addr tcp://0.0.0.0:1234 21 user 0:00 buildkitd --addr tcp://0.0.0.0:1234 29 user 0:00 ps aux + $ docker exec cat /etc/subuid user:100000:65536 ``` To change the UID/GID configuration, you need to modify and build the BuildKit image manually. -``` -$ vi Dockerfile -$ make images -$ docker run ... moby/buildkit:local-rootless ... +```bash +vi Dockerfile +make images +docker run ... moby/buildkit:local-rootless ... ``` ## Troubleshooting @@ -120,7 +144,9 @@ $ rootlesskit buildkitd --oci-worker-snapshotter=fuse-overlayfs ``` ### Error related to `fuse-overlayfs` -Try running `buildkitd` with `--oci-worker-snapshotter=native`: +Run `docker run` with `--device /dev/fuse`. + +Also try running `buildkitd` with `--oci-worker-snapshotter=native`: ```console $ rootlesskit buildkitd --oci-worker-snapshotter=native @@ -137,12 +163,19 @@ Run `sysctl -w user.max_user_namespaces=N` (N=positive integer, like 63359) on t See [`../examples/kubernetes/sysctl-userns.privileged.yaml`](../examples/kubernetes/sysctl-userns.privileged.yaml). +### Error `fork/exec /proc/self/exe: permission denied` with `This error might have happened because /proc/sys/kernel/apparmor_restrict_unprivileged_userns is set to 1` +Add `kernel.apparmor_restrict_unprivileged_userns=0` to `/etc/sysctl.conf` (or `/etc/sysctl.d`) and run `sudo sysctl -p`. + ### Error `mount proc:/proc (via /proc/self/fd/6), flags: 0xe: operation not permitted` -This error is known to happen when BuildKit is executed in a container without the `--oci-worker-no-sandbox` flag. -Make sure that `--oci-worker-no-process-sandbox` is specified (See [below](#docker)). +This error is known to happen when BuildKit is executed in a container without the `--security-opt systempaths=unconfined` flag. +Make sure to specify it (See [above](#docker)). ## Distribution-specific hint Using Ubuntu kernel is recommended. + +### Ubuntu, 24.04 or later +Add `kernel.apparmor_restrict_unprivileged_userns=0` to `/etc/sysctl.conf` (or `/etc/sysctl.d`) and run `sudo sysctl -p`. + ### Container-Optimized OS from Google Make sure to have an `emptyDir` volume below: ```yaml diff --git a/examples/kubernetes/README.md b/examples/kubernetes/README.md index c8973dc56..4d263d16c 100644 --- a/examples/kubernetes/README.md +++ b/examples/kubernetes/README.md @@ -6,16 +6,26 @@ This directory contains Kubernetes manifests for `Pod`, `Deployment` (with `Serv * `StateFulset`: good for client-side load balancing, without registry-side cache * `Job`: good if you don't want to have daemon pods -Using Rootless mode (`*.rootless.yaml`) is recommended because Rootless mode image is executed as non-root user (UID 1000) and doesn't need `securityContext.privileged`. -See [`../../docs/rootless.md`](../../docs/rootless.md). +## Variants -See also ["Building Images Efficiently And Securely On Kubernetes With BuildKit" (KubeCon EU 2019)](https://kccnceu19.sched.com/event/MPX5). +- `*.privileged.yaml`: Launches the Pod as the fully privileged root user. +- `*.rootless.yaml`: Launches the Pod as a non-root user, whose UID is 1000. +- `*.userns.yaml`: Launches the Pod as a non-root user. The UID is determined by kubelet. + Needs kubelet and kube-apiserver to be reconfigured to enable the + [`UserNamespacesSupport`](https://kubernetes.io/docs/tasks/configure-pod-container/user-namespaces/) feature gate. + +It is recommended to use `*.rootless.yaml` to minimize the chance of container breakout attacks. + +See also: +- [`../../docs/rootless.md`](../../docs/rootless.md). +- ["Building Images Efficiently And Securely On Kubernetes With BuildKit" (KubeCon EU 2019)](https://kccnceu19.sched.com/event/MPX5). ## `Pod` -```console -$ kubectl apply -f pod.rootless.yaml -$ buildctl \ +```bash +kubectl apply -f pod.rootless.yaml + +buildctl \ --addr kube-pod://buildkitd \ build --frontend dockerfile.v0 --local context=/path/to/dir --local dockerfile=/path/to/dir ``` @@ -29,25 +39,27 @@ If rootless mode doesn't work, try `pod.privileged.yaml`. Setting up mTLS is highly recommended. `./create-certs.sh SAN [SAN...]` can be used for creating certificates. -```console -$ ./create-certs.sh 127.0.0.1 +```bash +./create-certs.sh 127.0.0.1 ``` The daemon certificates is created as `Secret` manifest named `buildkit-daemon-certs`. -```console -$ kubectl apply -f .certs/buildkit-daemon-certs.yaml +```bash +kubectl apply -f .certs/buildkit-daemon-certs.yaml ``` Apply the `Deployment` and `Service` manifest: -```console -$ kubectl apply -f deployment+service.rootless.yaml -$ kubectl scale --replicas=10 deployment/buildkitd +```bash +kubectl apply -f deployment+service.rootless.yaml + +kubectl scale --replicas=10 deployment/buildkitd ``` Run `buildctl` with TLS client certificates: -```console -$ kubectl port-forward service/buildkitd 1234 -$ buildctl \ +```bash +kubectl port-forward service/buildkitd 1234 + +buildctl \ --addr tcp://127.0.0.1:1234 \ --tlscacert .certs/client/ca.pem \ --tlscert .certs/client/cert.pem \ @@ -58,10 +70,10 @@ $ buildctl \ ## `StatefulSet` `StatefulSet` is useful for consistent hash mode. -```console -$ kubectl apply -f statefulset.rootless.yaml -$ kubectl scale --replicas=10 statefulset/buildkitd -$ buildctl \ +```bash +kubectl apply -f statefulset.rootless.yaml +kubectl scale --replicas=10 statefulset/buildkitd +buildctl \ --addr kube-pod://buildkitd-4 \ build --frontend dockerfile.v0 --local context=/path/to/dir --local dockerfile=/path/to/dir ``` @@ -70,8 +82,8 @@ See [`./consistenthash`](./consistenthash) for how to use consistent hashing. ## `Job` -```console -$ kubectl apply -f job.rootless.yaml +```bash +kubectl apply -f job.rootless.yaml ``` To push the image to the registry, you also need to mount `~/.docker/config.json` diff --git a/examples/kubernetes/deployment+service.rootless.yaml b/examples/kubernetes/deployment+service.rootless.yaml index 0b554096f..c82ff9820 100644 --- a/examples/kubernetes/deployment+service.rootless.yaml +++ b/examples/kubernetes/deployment+service.rootless.yaml @@ -13,8 +13,6 @@ spec: metadata: labels: app: buildkitd - annotations: - container.apparmor.security.beta.kubernetes.io/buildkitd: unconfined # see buildkit/docs/rootless.md for caveats of rootless mode spec: containers: @@ -54,6 +52,9 @@ spec: # Needs Kubernetes >= 1.19 seccompProfile: type: Unconfined + # Needs Kubernetes >= 1.30 + appArmorProfile: + type: Unconfined # To change UID/GID, you need to rebuild the image runAsUser: 1000 runAsGroup: 1000 diff --git a/examples/kubernetes/deployment+service.userns.yaml b/examples/kubernetes/deployment+service.userns.yaml new file mode 100644 index 000000000..acf19937e --- /dev/null +++ b/examples/kubernetes/deployment+service.userns.yaml @@ -0,0 +1,77 @@ +# Depends on feature gate UserNamespacesSupport +apiVersion: apps/v1 +kind: Deployment +metadata: + labels: + app: buildkitd + name: buildkitd +spec: + replicas: 1 + selector: + matchLabels: + app: buildkitd + template: + metadata: + labels: + app: buildkitd + spec: + hostUsers: false + containers: + - name: buildkitd + image: moby/buildkit:master + args: + - --addr + - unix:///run/buildkit/buildkitd.sock + - --addr + - tcp://0.0.0.0:1234 + - --tlscacert + - /certs/ca.pem + - --tlscert + - /certs/cert.pem + - --tlskey + - /certs/key.pem + # the probe below will only work after Release v0.6.3 + readinessProbe: + exec: + command: + - buildctl + - debug + - workers + initialDelaySeconds: 5 + periodSeconds: 30 + # the probe below will only work after Release v0.6.3 + livenessProbe: + exec: + command: + - buildctl + - debug + - workers + initialDelaySeconds: 5 + periodSeconds: 30 + securityContext: + # Not really privileged + privileged: true + ports: + - containerPort: 1234 + volumeMounts: + - name: certs + readOnly: true + mountPath: /certs + volumes: + # buildkit-daemon-certs must contain ca.pem, cert.pem, and key.pem + - name: certs + secret: + secretName: buildkit-daemon-certs +--- +apiVersion: v1 +kind: Service +metadata: + labels: + app: buildkitd + name: buildkitd +spec: + ports: + - port: 1234 + protocol: TCP + selector: + app: buildkitd diff --git a/examples/kubernetes/job.privileged.yaml b/examples/kubernetes/job.privileged.yaml index 352180efa..472fc0d35 100644 --- a/examples/kubernetes/job.privileged.yaml +++ b/examples/kubernetes/job.privileged.yaml @@ -8,11 +8,11 @@ spec: restartPolicy: Never initContainers: - name: prepare - image: alpine:3.10 + image: busybox command: - sh - -c - - "echo FROM hello-world > /workspace/Dockerfile" + - "echo -e 'FROM alpine\nRUN apk add gcc\n' > /workspace/Dockerfile" volumeMounts: - name: workspace mountPath: /workspace diff --git a/examples/kubernetes/job.rootless.yaml b/examples/kubernetes/job.rootless.yaml index 06e608c6a..a3904f8d6 100644 --- a/examples/kubernetes/job.rootless.yaml +++ b/examples/kubernetes/job.rootless.yaml @@ -4,19 +4,16 @@ metadata: name: buildkit spec: template: - metadata: - annotations: - container.apparmor.security.beta.kubernetes.io/buildkit: unconfined # see buildkit/docs/rootless.md for caveats of rootless mode spec: restartPolicy: Never initContainers: - name: prepare - image: alpine:3.10 + image: busybox command: - sh - -c - - "echo FROM hello-world > /workspace/Dockerfile" + - "echo -e 'FROM alpine\nRUN apk add gcc\n' > /workspace/Dockerfile" securityContext: runAsUser: 1000 runAsGroup: 1000 @@ -45,6 +42,9 @@ spec: # Needs Kubernetes >= 1.19 seccompProfile: type: Unconfined + # Needs Kubernetes >= 1.30 + appArmorProfile: + type: Unconfined # To change UID/GID, you need to rebuild the image runAsUser: 1000 runAsGroup: 1000 diff --git a/examples/kubernetes/job.userns.yaml b/examples/kubernetes/job.userns.yaml new file mode 100644 index 000000000..9305bce14 --- /dev/null +++ b/examples/kubernetes/job.userns.yaml @@ -0,0 +1,47 @@ +# Depends on feature gate UserNamespacesSupport +apiVersion: batch/v1 +kind: Job +metadata: + name: buildkit +spec: + template: + spec: + hostUsers: false + restartPolicy: Never + initContainers: + - name: prepare + image: busybox + command: + - sh + - -c + - "echo -e 'FROM alpine\nRUN apk add gcc\n' > /workspace/Dockerfile" + volumeMounts: + - name: workspace + mountPath: /workspace + containers: + - name: buildkit + image: moby/buildkit:master + command: + - buildctl-daemonless.sh + args: + - build + - --frontend + - dockerfile.v0 + - --local + - context=/workspace + - --local + - dockerfile=/workspace + # To push the image to a registry, add + # `--output type=image,name=docker.io/username/image,push=true` + securityContext: + # Not really privileged + privileged: true + volumeMounts: + - name: workspace + readOnly: true + mountPath: /workspace + # To push the image, you also need to create `~/.docker/config.json` secret + # and set $DOCKER_CONFIG to `/path/to/.docker` directory. + volumes: + - name: workspace + emptyDir: {} diff --git a/examples/kubernetes/pod.rootless.yaml b/examples/kubernetes/pod.rootless.yaml index 130ea4363..4f9864594 100644 --- a/examples/kubernetes/pod.rootless.yaml +++ b/examples/kubernetes/pod.rootless.yaml @@ -2,8 +2,6 @@ apiVersion: v1 kind: Pod metadata: name: buildkitd - annotations: - container.apparmor.security.beta.kubernetes.io/buildkitd: unconfined # see buildkit/docs/rootless.md for caveats of rootless mode spec: containers: @@ -31,6 +29,9 @@ spec: # Needs Kubernetes >= 1.19 seccompProfile: type: Unconfined + # Needs Kubernetes >= 1.30 + appArmorProfile: + type: Unconfined # To change UID/GID, you need to rebuild the image runAsUser: 1000 runAsGroup: 1000 diff --git a/examples/kubernetes/pod.userns.yaml b/examples/kubernetes/pod.userns.yaml new file mode 100644 index 000000000..085c3cde0 --- /dev/null +++ b/examples/kubernetes/pod.userns.yaml @@ -0,0 +1,29 @@ +# Depends on feature gate UserNamespacesSupport +apiVersion: v1 +kind: Pod +metadata: + name: buildkitd +spec: + hostUsers: false + containers: + - name: buildkitd + image: moby/buildkit:master + readinessProbe: + exec: + command: + - buildctl + - debug + - workers + initialDelaySeconds: 5 + periodSeconds: 30 + livenessProbe: + exec: + command: + - buildctl + - debug + - workers + initialDelaySeconds: 5 + periodSeconds: 30 + securityContext: + # Not really privileged + privileged: true diff --git a/examples/kubernetes/statefulset.rootless.yaml b/examples/kubernetes/statefulset.rootless.yaml index 0533d2a10..caf7dde3c 100644 --- a/examples/kubernetes/statefulset.rootless.yaml +++ b/examples/kubernetes/statefulset.rootless.yaml @@ -15,8 +15,6 @@ spec: metadata: labels: app: buildkitd - annotations: - container.apparmor.security.beta.kubernetes.io/buildkitd: unconfined # see buildkit/docs/rootless.md for caveats of rootless mode spec: containers: @@ -44,6 +42,9 @@ spec: # Needs Kubernetes >= 1.19 seccompProfile: type: Unconfined + # Needs Kubernetes >= 1.30 + appArmorProfile: + type: Unconfined # To change UID/GID, you need to rebuild the image runAsUser: 1000 runAsGroup: 1000 diff --git a/examples/kubernetes/statefulset.userns.yaml b/examples/kubernetes/statefulset.userns.yaml new file mode 100644 index 000000000..98af0aad9 --- /dev/null +++ b/examples/kubernetes/statefulset.userns.yaml @@ -0,0 +1,42 @@ +# Depends on feature gate UserNamespacesSupport +apiVersion: apps/v1 +kind: StatefulSet +metadata: + labels: + app: buildkitd + name: buildkitd +spec: + serviceName: buildkitd + podManagementPolicy: Parallel + replicas: 1 + selector: + matchLabels: + app: buildkitd + template: + metadata: + labels: + app: buildkitd + spec: + hostUsers: false + containers: + - name: buildkitd + image: moby/buildkit:master + readinessProbe: + exec: + command: + - buildctl + - debug + - workers + initialDelaySeconds: 5 + periodSeconds: 30 + livenessProbe: + exec: + command: + - buildctl + - debug + - workers + initialDelaySeconds: 5 + periodSeconds: 30 + securityContext: + # Not really privileged + privileged: true