1
0
mirror of https://github.com/opencontainers/runc.git synced 2025-08-01 05:06:52 +03:00
Commit Graph

1574 Commits

Author SHA1 Message Date
1cd71dfd71 systemd properties: support for *Sec values
Some systemd properties are documented as having "Sec" suffix
(e.g. "TimeoutStopSec") but are expected to have "USec" suffix
when passed over dbus, so let's provide appropriate conversion
to improve compatibility.

This means, one can specify TimeoutStopSec with a numeric argument,
in seconds, and it will be properly converted to TimeoutStopUsec
with the argument in microseconds. As a side bonus, even float
values are converted, so e.g. TimeoutStopSec=1.5 is possible.

This turned out a bit more tricky to implement when I was
originally expected, since there are a handful of numeric
types in dbus and each one requires explicit conversion.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-02-17 16:07:19 -08:00
4c5c3fb960 Support for setting systemd properties via annotations
In case systemd is used to set cgroups for the container,
it creates a scope unit dedicated to it (usually named
`runc-$ID.scope`).

This patch adds an ability to set arbitrary systemd properties
for the systemd unit via runtime spec annotations.

Initially this was developed as an ability to specify the
`TimeoutStopUSec` property, but later generalized to work with
arbitrary ones.

Example usage: add the following to runtime spec (config.json):

```
	"annotations": {
		"org.systemd.property.TimeoutStopUSec": "uint64 123456789",
		"org.systemd.property.CollectMode":"'inactive-or-failed'"
	},
```

and start the container (e.g. `runc --systemd-cgroup run $ID`).

The above will set the following systemd parameters:
* `TimeoutStopSec` to 2 minutes and 3 seconds,
* `CollectMode` to "inactive-or-failed".

The values are in the gvariant format (see [1]). To figure out
which type systemd expects for a particular parameter, see
systemd sources.

In particular, parameters with `USec` suffix require an `uint64`
typed argument, while gvariant assumes int32 for a numeric values,
therefore the explicit type is required.

NOTE that systemd receives the time-typed parameters as *USec
but shows them (in `systemctl show`) as *Sec. For example,
the stop timeout should be set as `TimeoutStopUSec` but
is shown as `TimeoutStopSec`.

[1] https://developer.gnome.org/glib/stable/gvariant-text.html

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-02-17 16:07:19 -08:00
81ef5024f8 Merge pull request #2213 from Zyqsempai/2166-convert-cpu-weight-poperly
Added conversion for cpu.weight v2
2020-02-17 07:49:39 -08:00
7c439cc6f6 Added conversion for cpu.weight v2
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-02-12 11:32:34 +02:00
269ea385a4 restore: fix a race condition in process.Wait()
Adrian reported that the checkpoint test stated failing:
=== RUN   TestCheckpoint
--- FAIL: TestCheckpoint (0.38s)
    checkpoint_test.go:297: Did not restore the pipe correctly:

The problem here is when we start exec.Cmd, we don't call its wait
method. This means that we don't wait cmd.goroutines ans so we don't
know when all data will be read from process pipes.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2020-02-10 10:21:08 -08:00
3b992087b8 Fix skip message for cgroupv2
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-02-03 14:27:12 +02:00
2fc03cc11c Merge pull request #2207 from cyphar/fix-double-volume-attack
rootfs: do not permit /proc mounts to non-directories
2020-01-22 08:06:10 -08:00
3291d66b98 rootfs: do not permit /proc mounts to non-directories
mount(2) will blindly follow symlinks, which is a problem because it
allows a malicious container to trick runc into mounting /proc to an
entirely different location (and thus within the attacker's control for
a rename-exchange attack).

This is just a hotfix (to "stop the bleeding"), and the more complete
fix would be finish libpathrs and port runc to it (to avoid these types
of attacks entirely, and defend against a variety of other /proc-related
attacks). It can be bypased by someone having "/" be a volume controlled
by another container.

Fixes: CVE-2019-19921
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-01-17 14:00:30 +11:00
f6fb7a0338 merge branch 'pr-2133'
Julia Nedialkova (1):
  Handle ENODEV when accessing the freezer.state file

LGTMs: @crosbymichael @cyphar
Closes #2133
2020-01-17 02:07:19 +11:00
5b96f314ba Exchanged deprecated systemd resources with the appropriate for cgroupv2
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-01-15 18:09:33 +02:00
cf9b7c33e1 Fix MAJ:MIN io.stat parsing order
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-01-15 14:39:14 +02:00
55f8c254be temporarily disable CRIU tests
Ubuntu kernel is temporarily broken: https://github.com/opencontainers/runc/pull/2198#issuecomment-571124087

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-01-14 11:18:44 +09:00
5c20ea1472 fix merging #2177 and #2169
A new method was added to the cgroup interface when #2177 was merged.

After #2177 got merged, #2169 was merged without rebase (sorry!) and compilation was failing:

  libcontainer/cgroups/fs2/fs2.go:208:22: container.Cgroup undefined (type *configs.Config has no field or method Cgroup)

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-01-14 11:13:25 +09:00
5cc0deaf7a Merge pull request #2169 from AkihiroSuda/split-fs
cgroup2: split fs2 from fs
2020-01-13 16:23:27 -08:00
2b52db7527 Merge pull request #2177 from devimc/topic/libcontainer/kata-containers
libcontainer: export and add new methods to allow cgroups manipulation
2020-01-02 11:47:12 -05:00
8541d9cf3d Fix race checking for process exit and waiting for exec fifo
Signed-off-by: Jordan Liggitt <liggitt@google.com>
2019-12-18 18:48:18 +00:00
8ddd892072 libcontainer: add method to get cgroup config from cgroup Manager
`configs.Cgroup` contains the configuration used to create cgroups. This
configuration must be saved to disk, since it's required to restore the
cgroup manager that was used to create the cgroups.
Add method to get cgroup configuration from cgroup Manager to allow API users
save it to disk and restore a cgroup manager later.

fixes #2176

Signed-off-by: Julio Montes <julio.montes@intel.com>
2019-12-17 22:46:03 +00:00
cd7c59d042 libcontainer: export createCgroupConfig
A `config.Cgroups` object is required to manipulate cgroups v1 and v2 using
libcontainer.
Export `createCgroupConfig` to allow API users to create `config.Cgroups`
objects using directly libcontainer API.

Signed-off-by: Julio Montes <julio.montes@intel.com>
2019-12-17 22:46:03 +00:00
7496a96825 merge branch 'pr-2086'
* Kurnia D Win (1):
  fix permission denied

LGTMs: @crosbymichael @cyphar
Closes #2086
2019-12-17 20:49:52 +11:00
201b063745 merge branch 'pr-2141'
Radostin Stoyanov (1):
  criu: Ensure other users cannot read c/r files

LGTMs: @crosbymichael @cyphar
Closes #2141
2019-12-07 09:32:58 +11:00
ec49f98d72 fs2: support legacy device spec (to pass CI)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-12-06 15:53:07 +09:00
88e8350de2 cgroup2: split fs2 from fs
split fs2 package from fs, as mixing up fs and fs2 is very likely to result in
unmaintainable code.

Inspired by containerd/cgroups#109

Fix #2157

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-12-06 15:42:10 +09:00
5e63695384 merge branch 'pr-2174'
Sascha Grunert (1):
  Expose network interfaces via runc events

LGTMs: @cyphar @mrunalp
Closes #2174
2019-12-06 13:07:44 +11:00
8bb10af481 Merge pull request #2165 from AkihiroSuda/travis-f31
.travis.yml: add Fedora 31 vagrant box (for cgroup2)
2019-12-05 16:26:51 -05:00
41a20b5852 Expose network interfaces via runc events
The libcontainer network statistics are unreachable without manually
creating a libcontainer instance. To retrieve them via the CLI interface
of runc, we now expose them as well.

Signed-off-by: Sascha Grunert <sgrunert@suse.com>
2019-12-05 13:20:51 +01:00
faf1e44ea9 cgroup2: ebpf: increase RLIM_MEMLOCK to avoid BPF_PROG_LOAD error
Fix #2167

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-11-07 15:43:27 +09:00
46def4cc4c Merge pull request #2154 from jpeach/2008-remove-static-build-tag
Remove the static_build build tag.
2019-11-04 17:10:59 -08:00
ccd4436fc4 .travis.yml: add Fedora 31 vagrant box (for cgroup2)
As the baby step, only unit tests are executed.

Failing tests are currently skipped and will be fixed in follow-up PRs.

Fix #2124

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-31 16:53:01 +09:00
faf673ee45 cgroup2: port over eBPF device controller from crun
The implementation is based on https://github.com/containers/crun/blob/0.10.2/src/libcrun/ebpf.c

Although ebpf.c is originally licensed under LGPL-3.0-or-later, the author
Giuseppe Scrivano agreed to relicense the file in Apache License 2.0:
https://github.com/opencontainers/runc/issues/2144#issuecomment-543116397

See libcontainer/cgroups/ebpf/devicefilter/devicefilter_test.go for tested configurations.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-31 14:01:46 +09:00
e57a774066 Merge pull request #2149 from AkihiroSuda/cgroup2-ps
cgroup2: implement `runc ps`
2019-10-31 09:44:39 +08:00
d239ca8425 Merge pull request #2148 from AkihiroSuda/cg2-ignore-cpuset-when-no-config
cgroup2: cpuset_v2: skip Apply when no limit is specified
2019-10-29 21:57:58 +08:00
03cf145f5a Merge pull request #2159 from AkihiroSuda/cgroup2-mount-in-userns
cgroup2: allow mounting /sys/fs/cgroup in UserNS without unsharing CgroupNS
2019-10-28 19:19:09 -07:00
74a3fe5d1b cgroup2: do not parse /proc/cgroups
/proc/cgroups is meaningless for v2 and should be ignored.

https://github.com/torvalds/linux/blob/v5.3/Documentation/admin-guide/cgroup-v2.rst#deprecated-v1-core-features

* Now GetAllSubsystems() parses /sys/fs/cgroup/cgroup.controller, not /proc/cgroups.
  The function result also contains "pseudo" controllers: {"devices", "freezer"}.
  As it is hard to detect availability of pseudo controllers, pseudo controllers
  are always assumed to be available.

* Now IOGroupV2.Name() returns "io", not "blkio"

Fix #2155 #2156

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-28 00:00:33 +09:00
9c81440fb5 cgroup2: allow mounting /sys/fs/cgroup in UserNS without unsharing CgroupNS
Bind-mount /sys/fs/cgroup when we are in UserNS but CgroupNS is not unshared,
because we cannot mount cgroup2.

This behavior correspond to crun v0.10.2.

Fix #2158

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-27 23:09:41 +09:00
13919f5dfd Remove the static_build build tag.
The `static_build` build tag was introduced in e9944d0f
to remove build warnings related to systemd cgroup driver
dependencies. Since then, those dependencies have changed and
building the systemd cgroup driver no longer imports dlopen.

After this change, runc builds will always include the systemd
cgroup driver.

This fixes #2008.

Signed-off-by: James Peach <jpeach@apache.org>
2019-10-26 08:28:45 +11:00
c4d8e1688c Merge pull request #2140 from crosbymichael/fs-unified
Set unified mountpoint in find mnt func
2019-10-24 15:20:47 -04:00
dbd771e475 cgroup2: implement runc ps
Implemented `runc ps` for cgroup v2 , using a newly added method `m.GetUnifiedPath()`.
Unlike the v1  implementation that checks `m.GetPaths()["devices"]`, the v2 implementation does not require the device controller to be available.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-19 01:59:24 +09:00
d918e7f408 cpuset_v2: skip Apply when no limit is specified
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-19 00:33:31 +09:00
033936ef76 io_v2.go: remove blkio v1 code
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-18 21:33:48 +09:00
a610a84821 criu: Ensure other users cannot read c/r files
No checkpoint files should be readable by
anyone else but the user creating it.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
2019-10-17 07:49:38 +01:00
b28f58f31b Set unified mountpoint in find mnt func
This is needed for the fsv2 cgroups to work when there is a unified mountpoint.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2019-10-15 15:40:03 -04:00
f017e0f9e1 checkpoint: Set descriptors.json file mode to 0600
Prevent unprivileged users from being able to read descriptors.json

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
2019-10-12 19:29:44 +01:00
1b8a1eeec3 merge branch 'pr-2132'
Support different field counts of cpuaact.stats

LGTMs: @crosbymichael @cyphar
Closes #2132
2019-10-02 01:50:47 +10:00
d463f6485b *: verify that operations on /proc/... are on procfs
This is an additional mitigation for CVE-2019-16884. The primary problem
is that Docker can be coerced into bind-mounting a file system on top of
/proc (resulting in label-related writes to /proc no longer happening).

While we are working on mitigations against permitting the mounts, this
helps avoid our code from being tricked into writing to non-procfs
files. This is not a perfect solution (after all, there might be a
bind-mount of a different procfs file over the target) but in order to
exploit that you would need to be able to tweak a config.json pretty
specifically (which thankfully Docker doesn't allow).

Specifically this stops AppArmor from not labeling a process silently
due to /proc/self/attr/... being incorrectly set, and stops any
accidental fd leaks because /proc/self/fd/... is not real.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-09-30 09:06:48 +10:00
28e58a0f6a Support different field counts of cpuaact.stats
Signed-off-by: skilxnTL <tylxltt@gmail.com>
2019-09-29 10:20:58 +08:00
e63b797f38 Handle ENODEV when accessing the freezer.state file
...when checking if a container is paused

Signed-off-by: Julia Nedialkova <julianedialkova@hotmail.com>
2019-09-27 17:02:56 +03:00
84373aaa56 Add SCMP_ACT_LOG as a valid Seccomp action (#1951)
Signed-off-by: blacktop <blacktop@users.noreply.github.com>
2019-09-26 11:03:03 -04:00
331692baa7 Only allow proc mount if it is procfs
Fixes #2128

This allows proc to be bind mounted for host and rootless namespace usecases but
it removes the ability to mount over the top of proc with a directory.

```bash
> sudo docker run --rm  apparmor
docker: Error response from daemon: OCI runtime create failed:
container_linux.go:346: starting container process caused "process_linux.go:449:
container init caused \"rootfs_linux.go:58: mounting
\\\"/var/lib/docker/volumes/aae28ea068c33d60e64d1a75916cf3ec2dc3634f97571854c9ed30c8401460c1/_data\\\"
to rootfs
\\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged\\\"
at \\\"/proc\\\" caused
\\\"\\\\\\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged/proc\\\\\\\"
cannot be mounted because it is not of type proc\\\"\"": unknown.

> sudo docker run --rm -v /proc:/proc apparmor

docker-default (enforce)        root     18989  0.9  0.0   1288     4 ?
Ss   16:47   0:00 sleep 20
```

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2019-09-24 11:00:18 -04:00
af7b6547ec libcontainer/nsenter: Don't import C in non-cgo file
Signed-off-by: Jonathan Rudenberg <jonathan@titanous.com>
2019-09-11 17:03:07 +00:00
718a566e02 cgroup: support mount of cgroup2
convert a "cgroup" mount to "cgroup2" when the system uses cgroups v2
unified hierarchy.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-09-06 17:57:14 +02:00