The errrolint linter can finally ignore errors from Close,
and it also ignores direct comparisons of errors from x/sys/unix.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This removes libcontainer/cgroups packages and starts
using those from github.com/opencontainers/cgroups repo.
Mostly generated by:
git rm -f libcontainer/cgroups
find . -type f -name "*.go" -exec sed -i \
's|github.com/opencontainers/runc/libcontainer/cgroups|github.com/opencontainers/cgroups|g' \
{} +
go get github.com/opencontainers/cgroups@v0.0.1
make vendor
gofumpt -w .
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This addresses the following TODO in the code (added back in 2015
by commit 845fc65e5):
> // TODO: fix libcontainer's API to better support uid/gid in a typesafe way.
Historically, libcontainer internally uses strings for user, group, and
additional (aka supplementary) groups.
Yet, runc receives those credentials as part of runtime-spec's process,
which uses integers for all of them (see [1], [2]).
What happens next is:
1. runc start/run/exec converts those credentials to strings (a User
string containing "UID:GID", and a []string for additional GIDs) and
passes those onto runc init.
2. runc init converts them back to int, in the most complicated way
possible (parsing container's /etc/passwd and /etc/group).
All this conversion and, especially, parsing is totally unnecessary,
but is performed on every container exec (and start).
The only benefit of all this is, a libcontainer user could use user and
group names instead of numeric IDs (but runc itself is not using this
feature, and we don't know if there are any other users of this).
Let's remove this back and forth translation, hopefully increasing
runc exec performance.
The only remaining need to parse /etc/passwd is to set HOME environment
variable for a specified UID, in case $HOME is not explicitly set in
process.Env. This can now be done right in prepareEnv, which simplifies
the code flow a lot. Alas, we can not use standard os/user.LookupId, as
it could cache host's /etc/passwd or the current user (even with the
osusergo tag).
PS Note that the structures being changed (initConfig and Process) are
never saved to disk as JSON by runc, so there is no compatibility issue
for runc users.
Still, this is a breaking change in libcontainer, but we never promised
that libcontainer API will be stable (and there's a special package
that can handle it -- github.com/moby/sys/user). Reflect this in
CHANGELOG.
For 3998.
[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/config.md#posix-platform-user
[2]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/specs-go/config.go#L86
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Use the old package name as an alias to minimize the patch.
No functional change; this just eliminates a bunch of deprecation
warnings.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Make CommandHook.Command a pointer, which reduces the amount of data
being copied when using hooks, and allows to modify command hooks.
2. Add SetDefaultEnv, which is to be used by the next commit.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is a slight refactor of TestExecInEnvironment, making it more
strict wrt checking the exec output.
1. Explain why DEBUG is added twice to the env.
2. Reuse the execEnv for the check.
3. Make the check more strict -- instead of looking for substrings,
check line by line.
4. Add a check for extra environment variables.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Here's what it shows on my laptop (with -count 10 -benchtime 10s,
summarized by benchstat):
│ sec/op │
ExecTrue-20 8.477m ± 2%
ExecInBigEnv-20 61.53m ± 1%
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 98fe566c removed inheritable capabilities from the example spec
(used by runc spec) and from the libcontainer/integration test config,
but neglected to also remove ambient capabilities.
An ambient capability could only be set if the same inheritable
capability is set, so as a result of the above change ambient
capabilities were not set (but due to a bug in gocapability package,
those errors are never reported).
Once we start using a library with the fix [1], that bug will become
apparent (both bats-based and libct/int tests will fail).
[1]: https://github.com/kolyshkin/capability/pull/3
Fixes: 98fe566c ("runc: do not set inheritable capabilities")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In all the three cases, we check that the program returned non-zero exit
code. This can be done in a much simpler manner.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 4316df8b53 isolated RunningInUserNS
to a separate package to make it easier to consume without bringing in
additional dependencies, and with the potential to move it separate in
a similar fashion as libcontainer/user was moved to a separate module
in commit ca32014adb. While RunningInUserNS
is fairly trivial to implement, it (or variants of this utility) is used
in many codebases, and moving to a separate module could consolidate
those implementations, as well as making it easier to consume without
large dependency trees (when being a package as part of a larger code
base).
Commit 1912d5988b and follow-ups introduced
cgo code into the userns package, and code introduced in those commits
are not intended for external use, therefore complicating the potential
of moving the userns package separate.
This commit moves the new code to a separate package; some of this code
was included in v1.1.11 and up, but I could not find external consumers
of `GetUserNamespaceMappings` and `IsSameMapping`. The `Mapping` and
`Handles` types (added in ba0b5e2698) only
exist in main and in non-stable releases (v1.2.0-rc.x), so don't need
an alias / deprecation.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
By definition, every container has only 1 init (i.e. PID 1) process.
Apparently, libcontainer API supported running more than 1 init, and
at least one tests mistakenly used it.
Let's not allow that, erroring out if we already have init. Doing
otherwise _probably_ results in some confusion inside the library.
Fix two cases in libct/int which ran two inits inside a container.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since we're now testing on ARM, the test case fails when trying to do
pre-dump since MemTrack is not available.
Skip the pre-dump part if so.
This also reverts part of commit 3f4a73d6 as it is no longer needed
(now, instead of skipping the whole test, we're just skipping the
pre-dump).
[Review with --ignore-all-space]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since commit c77aaa3f the tail of criu log is printed by runc, so
there's no need to do the same thing in tests.
This also fixes a test failure on ARM where showLog fails (because
there's no log file) and thus the conditional t.Skip is not called.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
criu check --feature userns also tests for the /proc/self/ns/user
presense, so remove the redundant check, and simplify the error message.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
If a file descriptor of a directory in the host's mount namespace is
leaked to runc init, a malicious config.json could use /proc/self/fd/...
as a working directory to allow for host filesystem access after the
container runs. This can also be exploited by a container process if it
knows that an administrator will use "runc exec --cwd" and the target
--cwd (the attacker can change that cwd to be a symlink pointing to
/proc/self/fd/... and wait for the process to exec and then snoop on
/proc/$pid/cwd to get access to the host). The former issue can lead to
a critical vulnerability in Docker and Kubernetes, while the latter is a
container breakout.
We can (ab)use the fact that getcwd(2) on Linux detects this exact case,
and getcwd(3) and Go's Getwd() return an error as a result. Thus, if we
just do os.Getwd() after chdir we can easily detect this case and error
out.
In runc 1.1, a /sys/fs/cgroup handle happens to be leaked to "runc
init", making this exploitable. On runc main it just so happens that the
leaked /sys/fs/cgroup gets clobbered and thus this is only consistently
exploitable for runc 1.1.
Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626
Co-developed-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
[refactored the implementation and added more comments]
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.
In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:
* Checks that certain namespaces are supported by stat(2)ing
/proc/self/ns/...
* /proc/self/exe and /proc/self/cmdline are not thread-local.
* While threads can be in different cgroups, we do not do this for the
runc binary (or libcontainer) and thus we do not need to switch to
the thread-local version of /proc/self/cgroups.
* All of the CLONE_NEWUSER files are not thread-local because you
cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
is blocked for multi-threaded programs).
Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.
In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.
Yes this is all pretty ugly. I also wish it wasn't necessary.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
For userns and timens, the mappings (and offsets, respectively) cannot
be changed after the namespace is first configured. Thus, configuring a
container with a namespace path to join means that you cannot also
provide configuration for said namespace. Previously we would silently
ignore the configuration (and just join the provided path), but we
really should be returning an error (especially when you consider that
the configuration userns mappings are used quite a bit in runc with the
assumption that they are the correct mapping for the userns -- but in
this case they are not).
In the case of userns, the mappings are also required if you _do not_
specify a path, while in the case of the time namespace you can have a
container with a timens but no mappings specified.
It should be noted that the case checking that the user has not
specified a userns path and a userns mapping needs to be handled in
specconv (as opposed to the configuration validator) because with this
patchset we now cache the mappings of path-based userns configurations
and thus the validator can't be sure whether the mapping is a cached
mapping or a user-specified one. So we do the validation in specconv,
and thus the test for this needs to be an integration test.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This includes quite a few cleanups and improvements to the way we do
synchronisation. The core behaviour is unchanged, but switching to
embedding json.RawMessage into the synchronisation structure will allow
us to do more complicated synchronisation operations in future patches.
The file descriptor passing through the synchronisation system feature
will be used as part of the idmapped-mount and bind-mount-source
features when switching that code to use the new mount API outside of
nsexec.c.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This commit does two things:
1. Consolidate StartInitialization calling logic into Init().
2. Fix init error handling logic.
The main issues at hand are:
- the "unable to convert _LIBCONTAINER_INITPIPE" error from
StartInitialization is never shown;
- errors from WriteSync and WriteJSON are never shown;
- the StartInit calling code is triplicated;
- using panic is questionable.
Generally, our goals are:
- if there's any error, do our best to show it;
- but only show each error once;
- simplify the code, unify init implementations.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
As of previous commit, this is implied in a particular scenario. In
fact, this is the one and only scenario that justifies the use of -a.
Drop the option from the documentation. For backward compatibility, do
recognize it, and retain the feature of ignoring the "container is
stopped" error when set.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
By default, the container has its own PID namespace, and killing (with
SIGKILL) its init process from the parent PID namespace also kills all
the other processes.
Obviously, it does not work that way when the container is sharing its
PID namespace with the host or another container, since init is no
longer special (it's not PID 1). In this case, killing container's init
will result in a bunch of other processes left running (and thus the
inability to remove the cgroup).
The solution to the above problem is killing all the container
processes, not just init.
The problem with the current implementation is, the killing logic is
implemented in libcontainer's initProcess.wait, and thus only available
to libcontainer users, but not the runc kill command (which uses
nonChildProcess.kill and does not use wait at all). So, some workarounds
exist:
- func destroy(c *Container) calls signalAllProcesses;
- runc kill implements -a flag.
This code became very tangled over time. Let's simplify things by moving
the killing all processes from initProcess.wait to container.Signal,
and documents the new behavior.
In essence, this also makes `runc kill` to automatically kill all container
processes when the container does not have its own PID namespace.
Document that as well.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When someone is using libcontainer to start and kill containers from a
long lived process (i.e. the same process creates and removes the
container), initProcess.wait method is used, which has a kludge to work
around killing containers that do not have their own PID namespace.
The code that checks for own PID namespace is not entirely correct.
To be exact, it does not set sharePidns flag when the host/caller PID
namespace is implicitly used. As a result, the above mentioned kludge
does not work.
Fix the issue, add a test case (which fails without the fix).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There are two very distinct usage scenarios for signalAllProcesses:
* when used from the runc binary ("runc kill" command), the processes
that it kills are not the children of "runc kill", and so calling
wait(2) on each process is totally useless, as it will return ECHLD;
* when used from a program that have created the container (such as
libcontainer/integration test suite), that program can and should call
wait(2), not the signalling code.
So, the child reaping code is totally useless in the first case, and
should be implemented by the program using libcontainer in the second
case. I was not able to track down how this code was added, my best
guess is it happened when this code was part of dockerd, which did not
have a proper child reaper implemented at that time.
Remove it, and add a proper documentation piece.
Change the integration test accordingly.
PS the first attempt to disable the child reaping code in
signalAllProcesses was made in commit bb912eb00c, which used a
questionable heuristic to figure out whether wait(2) should be called.
This heuristic worked for a particular use case, but is not correct in
general.
While at it:
- simplify signalAllProcesses to use unix.Kill;
- document (container).Signal.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Currently, TestInit sets up logrus, and init uses it to log an error
from StartInitialization(). This is solely used by TestExecInError
to check that error returned from StartInitialization is the one it
expects.
Note that the very same error is communicated to the runc init parent
and is ultimately returned by container.Run(), so checking what
StartInitialization returned is redundant.
Remove logrus setup and use from TestMain/init.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The purpose of this test is to check that there are no extra file
descriptors left open after repeated calls to runContainer. In fact,
the first call to runContainer leaves a few file descriptors opened,
and this is by design.
Previously, this test relied on two things:
1. some other tests were run before it (and thus all such opened-once
file descriptors are already opened);
2. explicitly excluding fd opened to /sys/fs/cgroup.
Now, if we run this test separately, it will fail (because of 1 above).
The same may happen if the tests are run in a random order.
To fix this, add a container run before collection the initial fd list,
so those fds that are opened once are included and won't be reported.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This commit separates the functionality of setting cgroup device
rules out of libct/cgroups to libct/cgroups/devices package. This
package, if imported, sets the function variables in libct/cgroups and
libct/cgroups/systemd, so that a cgroup manager can use those to manage
devices. If those function variables are nil (when libct/cgroups/devices
are not imported), a cgroup manager returns the ErrDevicesUnsupported
in case any device rules are set in Resources.
It also consolidates the code from libct/cgroups/ebpf and
libct/cgroups/ebpf/devicefilter into libct/cgroups/devices.
Moved some tests in libct/cg/sd that require device management to
libct/sd/devices.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Do not set inheritable capabilities in runc spec, runc exec --cap,
and in libcontainer integration tests.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The only implementation of these is linuxContainer. It does not make
sense to have an interface with a single implementation, and we do not
foresee other types of containers being added to runc.
Remove BaseContainer and Container interfaces, moving their methods
documentation to linuxContainer.
Rename linuxContainer to Container.
Adopt users from using interface to using struct.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since LinuxFactory has become the means to specify containers state
top directory (aka --root), and is only used by two methods (Create
and Load), it is easier to pass root to them directly.
Modify all the users and the docs accordingly.
While at it, fix Create and Load docs (those that were originally moved
from the Factory interface docs).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
StartInitialization does not have to be a method of Factory (while
it is clear why it was done that way initially, now we only have
Linux containers so it does not make sense).
Fix callers and docs accordingly.
No change in functionality.
Also, since this was the only user of libcontainer.New with the empty
string as an argument, the corresponding check can now be removed
from it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Those were added by commit 59c5c3ac0 back in Apr 2015, but AFAICS were
never used and are obsoleted by more generic container hooks (initially
added by commit 05567f2c94 in Sep 2015).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Runtime spec says:
> sysctl (object, OPTIONAL) allows kernel parameters to be modified at
> runtime for the container. For more information, see the sysctl(8)
> man page.
and sysctl(8) says:
> variable
> The name of a key to read from. An example is
> kernel.ostype. The '/' separator is also accepted in place of a '.'.
Apparently, runc config validator do not support sysctls with / as a
separator. Fortunately this is a one-line fix.
Add some more test data where / is used as a separator.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>