1
0
mirror of https://github.com/opencontainers/runc.git synced 2025-08-08 12:42:06 +03:00
Commit Graph

4449 Commits

Author SHA1 Message Date
lifubang
a67dab0ac2 Revert "CreateCgroupPath: only enable needed controllers"
1. Partially revert "CreateCgroupPath: only enable needed controllers"
If we update a resource which did not limited in the beginning,
it will have no effective.
2. Returns err if we use an non enabled controller,
or else the user may feel success, but actually there are no effective.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-05-21 12:17:46 +08:00
Mrunal Patel
3c8da9dae0 Merge pull request #2422 from kolyshkin/criu-j
Dockerfile: speed up criu build
2020-05-20 17:43:43 -07:00
Kir Kolyshkin
d57f5bb286 cgroupv1: don't ignore MemorySwap if Memory==-1
Commit 18ebc51b3cc3 "Reset Swap when memory is set to unlimited (-1)"
added handling of the case when a user updates the container limits
to set memory to unlimited (-1) but do not set any other limits.
Apparently, in this case, if swap limit was previously set, kernel fails
to set memory.limit_in_bytes to -1 if memory.memsw.limit_in_bytes is
not set to -1.

What the above commit fails to handle correctly is the request when
Memory is set to -1 and MemorySwap is set to some specific limit N
(where N > 0). In this case, the value of N is silently discarded
and MemorySwap is set to -1 instead.

This is wrong thing to do, as the limit set, even if incorrectly,
should not be ignored.

Fix this by only assigning MemorySwap == -1 in case it was not
explicitly set.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 17:23:40 -07:00
Aleksa Sarai
21cb2360b6 merge branch 'pr-2427'
Akihiro Suda (1):
  README.md: fix a dead link

LGTMs: @kolyshkin @cyphar
Closes #2427
2020-05-21 10:06:32 +10:00
Mrunal Patel
6a6ba0c036 Merge pull request #2423 from kolyshkin/systemd-v2-pids-max
Fix setting some systemd limits, add more tests
2020-05-20 16:33:46 -07:00
Akihiro Suda
8cd84e35f8 Merge pull request #2333 from opencontainers/add-cii-badge
Add CII Badge to README
2020-05-21 07:45:35 +09:00
Kir Kolyshkin
59897367c4 cgroups/systemd: allow to set -1 as pids.limit
Currently, both systemd cgroup drivers (v1 and v2) only set
"TasksMax" unit property if the value > 0, so there is no
way to update the limit to -1 / unlimited / infinity / max.

Since systemd driver is backed by fs driver, and both fs and fs2
set the limit of -1 properly, it works, but systemd still has
the old value:

 # runc --systemd-cgroup update $CT --pids-limit 42
 # systemctl show runc-$CT.scope | grep TasksMax
 TasksMax=42
 # cat /sys/fs/cgroup/system.slice/runc-$CT.scope/pids.max
 42

 # ./runc --systemd-cgroup update $CT --pids-limit -1
 # systemctl show runc-$CT.scope | grep TasksMax=
 TasksMax=42
 # cat /sys/fs/cgroup/system.slice/runc-xx77.scope/pids.max
 max

Fix by changing the condition to allow -1 as a valid value.

NOTE other negative values are still being ignored by systemd drivers
(as it was done before). I am not sure whether this is correct, or
should we return an error.

A test case is added.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:20:04 -07:00
Kir Kolyshkin
95413ecdb0 tests/int/update: add cgroupv1 systemd CPU checks
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:19:03 -07:00
Kir Kolyshkin
06d7c1d261 systemd+cgroupv1: fix updating CPUQuotaPerSecUSec
1. do not allow to set quota without period or period without quota, as we
   won't be able to calculate new value for CPUQuotaPerSecUSec otherwise.

2. do not ignore setting quota to -1 when a period is not set.

3. update the test case accordingly.

Note that systemd value checks will be added in the next commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:17:18 -07:00
Kir Kolyshkin
7abd93d156 tests/integration/update.bats: more systemd checks
1. add missing checks for systemd's MemoryMax / MemoryLimit.

2. add checks for systemd's MemoryLow and MemorySwapMax.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:16:50 -07:00
Kir Kolyshkin
e4a84bea99 cgroupv2+systemd: set MemoryLow
For some reason, this was not set before.

Test case is added by the next commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:15:29 -07:00
Kir Kolyshkin
4fc9fa05da tests/int: simplify check_systemd_value use
...so it will be easier to write more tests

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:15:11 -07:00
Kir Kolyshkin
716079f95b Merge pull request #2406 from cyphar/devices-cgroup-header
cgroups: add copyright header to devices.Emulator implementation
2020-05-20 13:01:34 -07:00
Akihiro Suda
5b601c66d0 README.md: fix a dead link
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-05-21 02:31:33 +09:00
Akihiro Suda
cd4b71c27a Merge pull request #2409 from adrianreber/go-criu-4-0-0
Update to latest go-criu
2020-05-21 01:39:09 +09:00
Kir Kolyshkin
28cd9d9c18 Merge pull request #2419 from tianon/buildmode-arch-toggle
Remove "-buildmode=pie" from platforms that don't support it

LGTMs: AkihiroSuda, kolyshkin
2020-05-20 09:15:55 -07:00
Mrunal Patel
9a808dd014 Merge pull request #2424 from giuseppe/errno-ret
libcontainer: honor seccomp errnoRet
2020-05-20 07:41:01 -07:00
Adrian Reber
944e057025 Update to latest go-criu (4.0.2)
This updates to the latest version of go-criu (4.0.2) which is based on
CRIU 3.14.

As go-criu provides an existing way to query the CRIU binary for its
version this also removes all the code from runc to handle CRIU version
checking and now relies on go-criu.

An important side effect of this change is that this raises the minimum
CRIU version to 3.0.0 as that is the first CRIU version that supports
CRIU version queries via RPC in contrast to parsing the output of
'criu --version'

CRIU 3.0 has been released in April of 2017.

Signed-off-by: Adrian Reber <areber@redhat.com>
2020-05-20 13:49:38 +02:00
Giuseppe Scrivano
41aa19662b libcontainer: honor seccomp errnoRet
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2020-05-20 09:11:55 +02:00
Giuseppe Scrivano
510c79f9cf vendor: update runtime-specs to 237cc4f519e
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2020-05-20 09:11:54 +02:00
Kir Kolyshkin
236ec04599 Dockerfile: speed up criu build
... in case we have more than one CPU, that is.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-19 17:19:14 -07:00
Tianon Gravi
be66519c26 Remove "-buildmode=pie" from platforms that don't support it
This sequence (and syntax) is inspired by containerd's implementation of the same:
4e08c2de67/Makefile.linux (L21-L26)

Signed-off-by: Tianon Gravi <admwiggin@gmail.com>
2020-05-19 16:00:37 -07:00
Kir Kolyshkin
b207d578ec Merge pull request #2418 from AkihiroSuda/fix-bad-rebase-2413
fix "libcontainer/cgroups/fs/cpuset.go:63:14: undefined: fmt"
2020-05-19 11:28:09 -07:00
Akihiro Suda
2fa3c286b5 fix "libcontainer/cgroups/fs/cpuset.go:63:14: undefined: fmt"
The compilation error had ocurred because of a bad rebase during #2401 and #2413

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-05-19 23:38:20 +09:00
Akihiro Suda
f369199ff6 Merge pull request #2413 from JFHwang/2392-spec-check
Add nil check of spec.Process in validateProcessSpec()
2020-05-19 08:11:22 +09:00
Mrunal Patel
53a4649776 Merge pull request #2401 from kolyshkin/fs-cpuset-mountinfo
libct/cgroup: rm GetClosestMountpointAncestor using moby/sys/mountinfo parser
2020-05-18 10:43:55 -07:00
Mrunal Patel
825e91ada6 Merge pull request #2341 from kolyshkin/test-cpt-lazy
runc checkpoint: fix --status-fd to accept fd
2020-05-18 10:43:24 -07:00
Mrunal Patel
67fac528d0 Merge pull request #2410 from lifubang/swap0patch
cgroupv2: never write empty string to memory.swap.max
2020-05-18 10:42:17 -07:00
John Hwang
5aa0601a59 validateProcessSpec: prevent SEGV when config is valid json, but invalid.
Signed-off-by: John Hwang <John.F.Hwang@gmail.com>
2020-05-18 09:38:22 -07:00
John Hwang
7fc291fd45 Replace formatted errors when unneeded
Signed-off-by: John Hwang <John.F.Hwang@gmail.com>
2020-05-16 18:13:21 -07:00
lifubang
9ad1beb40f never write empty string to memory.swap.max
Because the empty string means set swap to 0.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-05-16 06:52:14 +08:00
Aleksa Sarai
dc9a7879f9 cgroups: add copyright header to devices.Emulator implementation
I forgot to include this in the original patchset.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-15 11:29:51 +10:00
Akihiro Suda
3f1e886991 Merge pull request #2391 from cyphar/devices-cgroup
cgroup: devices: major cleanups and minimal transition rules
2020-05-14 09:57:06 +09:00
Kir Kolyshkin
2db3240f35 libct/cgroups: rm GetClosestMountpointAncestor
The function GetClosestMountpointAncestor is not very efficient,
does not really belong to cgroup package, and is only used once
(from fs/cpuset.go).

Remove it, replacing with the implementation based on moby/sys/mountinfo
parser.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-13 17:32:06 -07:00
Kir Kolyshkin
f160352682 libct/cgroup: prep to rm GetClosestMountpointAncestor
This function is not very efficient, does not really belong to cgroup
package, and is only used once (from fs/cpuset.go).

Prepare to remove it by replacing with the implementation based on
the parser from github.com/moby/sys/mountinfo parser.

This commit is here to make sure the proposed replacement passes the
unit test.

Funny, but the unit test need to be slightly modified since it
supplies the wrong mountinfo (space as the first character, empty line
at the end).

Validated by

 $ go test -v -run Ance
 === RUN   TestGetClosestMountpointAncestor
 --- PASS: TestGetClosestMountpointAncestor (0.00s)
 PASS
 ok  	github.com/opencontainers/runc/libcontainer/cgroups	0.002s

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-13 16:26:16 -07:00
Kir Kolyshkin
85d4264d8a Merge pull request #2390 from lifubang/threadedordomain
cgroupv2: don't enable threaded mode by default

LGTMs: AkihiroSuda, cyphar, kolyshkin
2020-05-13 14:30:25 -07:00
Kir Kolyshkin
4b71877f99 Merge pull request #2292 from Creatone/creatone/extend-intelrdt
Add RDT Memory Bandwidth Monitoring (MBM) and Cache Monitoring Technology (CMT) statistics.
2020-05-13 13:33:55 -07:00
Kir Kolyshkin
41855317b6 Merge pull request #2271 from katarzyna-z/kk-cpuacct-usage-all
Add reading of information from cpuacct.usage_all
2020-05-13 13:33:05 -07:00
lifubang
fe0669b26d don't enable threaded mode by default
Because in threaded mode, we can't enable the memory controller -- it isn't thread-aware.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-05-13 16:27:36 +08:00
Aleksa Sarai
ba6eb28229 tests: add integration test for paused-and-updated containers
Such containers should remain paused after the update. This has
historically been true, but this helps ensure that the systemd cgroup
changes (freezing the container during SetUnitProperties) don't break
this behaviour.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:44:11 +10:00
Aleksa Sarai
4438eaa5e4 tests: add integration test for devices transition rules
Unfortunately, runc update doesn't support setting devices rules
directly so we have to trigger it by modifying a different rule (which
happens to trigger a devices update).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:44:11 +10:00
Aleksa Sarai
b810da1490 cgroups: systemd: make use of Device*= properties
It seems we missed that systemd added support for the devices cgroup, as
a result systemd would actually *write an allow-all rule each time you
did 'runc update'* if you used the systemd cgroup driver. This is
obviously ... bad and was a clear security bug. Luckily the commits which
introduced this were never in an actual runc release.

So we simply generate the cgroupv1-style rules (which is what systemd's
DeviceAllow wants) and default to a deny-all ruleset. Unfortunately it
turns out that systemd is susceptible to the same spurrious error
failure that we were, so that problem is out of our hands for systemd
cgroup users.

However, systemd has a similar bug to the one fixed in [1]. It will
happily write a disruptive deny-all rule when it is not necessary.
Unfortunately, we cannot even use devices.Emulator to generate a minimal
set of transition rules because the DBus API is limited (you can only
clear or append to the DeviceAllow= list -- so we are forced to always
clear it). To work around this, we simply freeze the container during
SetUnitProperties.

[1]: afe83489d4 ("cgroupv1: devices: use minimal transition rules with devices.Emulator")

Fixes: 1d4ccc8e0c ("fix data inconsistent when runc update in systemd driven cgroup v1")
Fixes: 7682a2b2a5 ("fix data inconsistent when runc update in systemd driven cgroup v2")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:43:56 +10:00
Aleksa Sarai
afe83489d4 cgroupv1: devices: use minimal transition rules with devices.Emulator
Now that all of the infrastructure for devices.Emulator is in place, we
can finally implement minimal transition rules for devices cgroups. This
allows for minimal disruption to running containers if a rule update is
requested. Only in very rare circumstances (black-list cgroups and mode
switching) will a clear-all rule be written. As a result, containers
should no longer see spurious errors.

A similar issue affects the cgroupv2 devices setup, but that is a topic
for another time (as the solution is drastically different).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:42:43 +10:00
Aleksa Sarai
2353ffec2b cgroups: implement a devices cgroupv1 emulator
Okay, this requires a bit of explanation.

The reason for this emulation is to allow us to have seamless updates of
the devices cgroup for running containers. This was triggered by several
users having issues where our initial writing of a deny-all rule (in all
cases) results in spurrious errors.

The obvious solution would be to just remove the deny-all rule, right?
Well, it turns out that runc doesn't actually control the deny-all rule
because all users of runc have explicitly specified their own deny-all
rule for many years. This appears to have been done to work around a bug
in runc (which this series has fixed in [1]) where we would actually act
as a black-list despite this being a violation of the OCI spec.

This means that not adding our own deny-all rule in the case of updates
won't solve the issue. However, it will also not solve the issue in
several other cases (the most notable being where a container is being
switched between default-permission modes).

So in order to handle all of these cases, a way of tracking the relevant
internal cgroup state (given a certain state of "cgroups.list" and a set
of rules to apply) is necessary. That is the purpose of DevicesEmulator.
Reading "devices.list" is quite important because that's the only way we
can tell if it's safe to skip the troublesome deny-all rules without
making potentially-dangerous assumptions about the container.

We also are currently bug-compatible with the devices cgroup (namely,
removing rules that don't exist or having superfluous rules all works as
with the in-kernel implementation). The only exception to this is that
we give an error if a user requests to revoke part of a wildcard
exception, because allowing such configurations could result in security
holes (cgroupv1 silently ignores such rules, meaning in white-list mode
that the access is still permitted).

[1]: b2bec9806f ("cgroup: devices: eradicate the Allow/Deny lists")

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:42:20 +10:00
Aleksa Sarai
24388be71e configs: use different types for .Devices and .Resources.Devices
Making them the same type is simply confusing, but also means that you
could accidentally use one in the wrong context. This eliminates that
problem. This also includes a whole bunch of cleanups for the types
within DeviceRule, so that they can be used more ergonomically.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Aleksa Sarai
60e21ec26e specconv: remove default /dev/console access
/dev/console is a host resouce which gives a bunch of permissions that
we really shouldn't be giving to containers, not to mention that
/dev/console in containers is actually /dev/pts/$n. Drop this since
arguably this is a fairly scary thing to allow...

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Aleksa Sarai
b2bec9806f cgroup: devices: eradicate the Allow/Deny lists
These lists have been in the codebase for a very long time, and have
been unused for a large portion of that time -- specconv doesn't
generate them and the only user of these flags has been tests (which
doesn't inspire much confidence).

In addition, we had an incorrect implementation of a white-list policy.
This wasn't exploitable because all of our users explicitly specify
"deny all" as the first rule, but it was a pretty glaring issue that
came from the "feature" that users can select whether they prefer a
white- or black- list. Fix this by always writing a deny-all rule (which
is what our users were doing anyway, to work around this bug).

This is one of many changes needed to clean up the devices cgroup code.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Aleksa Sarai
859a780d6f cgroups: add GetFreezerState() helper to Manager
This is effectively a nicer implementation of the container.isPaused()
helper, but to be used within the cgroup code for handling some fun
issues we have to fix with the systemd cgroup driver.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Aleksa Sarai
a79fa7caa0 contrib: recvtty: add --no-stdin flag
This is mostly just useful for testing with the "single" mode, since it
allows you to run recvtty in the background without the console being
closed.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Mrunal Patel
df3d7f673a Merge pull request #2393 from kolyshkin/criu-pi
Vagrantfile: use criu from stable repo
2020-05-12 17:48:34 -07:00