Add a parameter for enabling per-container resctrl monitoring.
This supersedes and replaces the previous "enableCMT" and "enableMBM"
settings whose functionality was very vaguely specified. Separate
parameter for every monitoring metric does not seem to make much sense, in
particular because in the resctrl filesystem it is not possible to
selectively enable a subset of the monitoring features. You always get
all the metrics that the system provides. Also, with separate settings
(and corresponding check if the specific metric is available) the user
cannot specify "enable whatever is available" - setting everything to
"true" might fail because one of the metrics is not available on the
platform. In addition, having separate parameters is very
future-unproof, making support for new monitoring metrics unnecessarily
cumbersome to add. New metrics are certain to be added in new hardware
generations, e.g. perf/energy monitoring in the near future
(https://lkml.org/lkml/2025/5/21/1631), and requiring an update to the
runtime-spec for each one of them feels like an overkill without much
benefits. It is easier to have one switch for "enable container-specific
metrics" and let the user read whatever metrics the platform provides.
Moreover, it is not even possible to turn off monitoring (from the
resctrl filesystem). For example, you always get the metrics for all
CTRL_MON groups (closIDs). However, that is not always very useful as
there likely are a lot of applications packed in the same group. The new
intelRdt.enableMontoring parameter will enable creation of a MON group
specific to a single container allowing monitoring of resctrl metrics on
per-container granularity.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
* config-linux: add schemata field to IntelRdt
Add a new "schemata" field to the Linux IntelRdt configuration. This
addresses the complexity of separate schema fields and resolves the
issue of supporting currently uncovered RDT features like L2 cache
allocation and CDP (Code and Data Prioritization).
The new field is for specifying the complete schemata (all schemas) to
be written to the schemata file in Linux resctrl fs. The aim is for
simple usage and runtime implementation (by not requiring any
parsing/filtering of data or otherwise re-implement parsing or
validation of the Linux resctrl interface) and also to support all RDT
features now and in the future (i.e. schemas like L2, L2CODE, L2DATA,
L3CODE and L3DATA and who knows L4 or something else in the future).
Behavior of existing fields is not changed but it is required that the
new schemata field is applied last.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
* Add linux.intelRdt.schemata to features.md
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
---------
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
Enable setting a NUMA memory policy for the container. New
linux.memoryPolicy object contains inputs to the set_mempolicy(2)
syscall.
Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>
The proposed "netdevices" field provides a declarative way to
specify which host network devices should be moved into a container's
network namespace.
This approach is similar than the existing "devices" field used for block
devices but uses a dictionary keyed by the interface name instead.
The proposed scheme is based on the existing representation of network
device by the `struct net_device`
https://docs.kernel.org/networking/netdevices.html.
This proposal focuses solely on moving existing network devices into
the container namespace. It does not cover the complexities of
network configuration or network interface creation, emphasizing the
separation of device management and network configuration.
Signed-off-by: Antonio Ojea <aojea@google.com>
This PR proposes updates to the OCI runtime spec with
z/OS platform-specific details, including adding
namespaces, adding noNewPrivileges flag, and removing
devices. These changes are currently in use by the
IBM z/OS Container Platform (zOSCP) product - details
can be found here:
https://www.ibm.com/products/zos-container-platform.
Signed-off-by: Neil Johnson <najohnsn@us.ibm.com>
Signed-off-by: Kershaw Mehta <kershaw@us.ibm.com>
Most of these either redirect (so changing saves an extra redirect),
or have a TLS version available.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
High level container runtimes sometimes need to know if the OCI runtime
supports idmap mounts or not, as the OCI runtime silently ignores
unknown fields.
This means that if it doesn't support idmap mounts, a container with
userns will be started, without idmap mounts, and the files created on
the volumes will have a "garbage" owner/group. Furthermore, as the
userns mapping is not guaranteed to be stable over time, it will be
completely unusable.
Let's expose idmap support in the features subcommand, so high level
container runtimes use the feature safely.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
extend the process struct to represent scheduling attributes for a
process based on the sched_setattr(2) syscall.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Add `features.md` and `features-linux.md`, to formalize the `runc features` JSON that was introduced in runc v1.1.0.
A runtime caller MAY use this JSON to detect the features implemented by the runtime.
The spec corresponds to https://github.com/opencontainers/runc/blob/v1.1.0/types/features/features.go
(opencontainers/runc PR 3296, opencontainers/runc PR 3310)
Differences since runc v1.1.0:
- Add `.linux.intelRdt.enabled` field
- Add `.linux.cgroup.rdma` field
- Add `.linux.seccomp.knownFlags` and `.linux.seccomp.supportedFlags` fields (Implemented in runc PR 3588)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The time namespace is a new kernel feature available in 5.6+ to
isolate the system monotonic and boot-time clocks.
Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
Linux 5.19 introduced a new seccomp flag:
SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV
It is useful for seccomp notify when handling notification from Golang
programs which are often preempted by the runtime with SIGURG.
Signed-off-by: Alban Crequy <albancrequy@microsoft.com>
Burstable CFS controller is introduced in Linux 5.14. This helps with
parallel workloads that might be bursty. They can get throttled even
when their average utilization is under quota. And they may be latency
sensitive at the same time so that throttling them is undesired.
This feature borrows time now against the future underrun, at the cost
of increased interference against the other system users, by introducing
`cfs_burst_us` into CFS bandwidth control to enact the cap on unused
bandwidth accumulation, which will then used additionally for burst.
The patch adds the support/control for CFS bandwidth burst.
Fixes https://github.com/opencontainers/runtime-spec/issues/1119
Signed-off-by: Kailun Qin <kailun.qin@intel.com>
This setting can be used to mimic cgroup v1 behavior on cgroup v2,
when setting the new memory limit during update operation.
In cgroup v1, a limit which is lower than the current usage is rejected.
In cgroup v2, such a low limit is causing an OOM kill.
Ref: https://github.com/opencontainers/runc/issues/3509
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
add the domainname entity so that container runtimes can add special handling similar to hostname. The current workaround of adding a sysctl for kernel.domainname only works with rootful execution in most cases. This will allow for rootless execution.
container runtimes will be able to add special handling as they do for hostname, using setdomainname to add the entry to /proc/sys/kernel/domainname.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Commit "Add Seccomp Notify support"
(58798e75e9) just added
SECCOMP_FILTER_FLAG_NEW_LISTENER to the schema and not to the list of
flags in config-linux.md. However, it was a mistake to add them to the
schema, as the user will never really need to specify that flag.
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
the specs already support overriding the errno code for the syscalls
but the default value is hardcoded to EPERM.
Add a new attribute to override the default value.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
The seccomp action has been added to libseccomp a while ago, so I guess
the runtime spec should support it as well:
b2f15f3d02
Signed-off-by: Sascha Grunert <sgrunert@suse.com>