Hopefully, this ends the long story of CapabilityBoundingSet
in mariadb.service.
Started from MDEV-9095 (27e6fd9a59) which was supposed
to let --memlock work without root, but instead of
adding the necessary capability (CAP_IPC_LOCK) by putting it into
AmbientCapabilities it removed all other capabilities,
by putting CAP_IPC_LOCK into CapabilityBoundingSet
(which is the mask of allowed capabilities).
This broke pam plugin, which needed CAP_DAC_OVERRIDE,
it was fixed in MDEV-19878 (dd93028dae) by appending
CAP_DAC_OVERRIDE to CapabilityBoundingSet.
Obviously, memlock still didn't work, this was fixed
in MDEV-33301 (76a27155b4) by moving CAP_IPC_LOCK
to AmbientCapabilities.
Unfortunately, it moved too much (everything), so
MDEV-36229 (85ecb80fa3) fixed it moving CAP_DAC_OVERRIDE
back to CapabilityBoundingSet.
This caused MDEV-36591 (8925877dc8) triggering a bug in old
systemd versions. And it broke pam plugin on CentOS Stream 10,
where CAP_DAC_OVERRIDE alone was apparently not enough.
Let's finally fix this by removing CapabilityBoundingSet
completely and keeping CAP_IPC_LOCK in AmbientCapabilities,
which should've been the correct fix for MDEV-9095 from the start.
Combined AmbientCapabilities and CapabilityBoundingSet configuration
within a service file we have found by testing aren't supported in the
systemd v245 (Ubuntu 20.04) and v239 (RHEL8) for non-root users. This
resulted in a service start error EXIT_CAPABILITIES, a systemd limitation
of the version that we cannot work around consequences.
The systemd version 247 these combined capabilities have been tested to
work on Debian 11. No other supported major distros run systemd
version 246, and if they did, the missing capability of CAP_IPC_LOCK
won't be noticed as it was a convenience for --memlock users.
As such we disable the AmbientCapabilites for CAP_IPC_LOCK rather
that disabling the CapabilityBoundingSet, because doing the later
will disable authentication for MariaDB users that have configured PAM
with MariaDB.
Should a user require CAP_IPC_LOCK they can append in their own
systemd overlay file this configuration in the CapabilityBoundingSet
and configure the capability file attributes on the mariadbd executable
to have the IPC_LOCK capability. This isn't configured by default as the
presence of a capability in the MariaDB Server is detected by
openssl libraries as "insecure" which will then ignore any user configured TLS
configuration file passed though by the OPENSSL_CONF environment variable.
In resolving MDEV-33301 (76a27155b4) we
moved all the capabilities from CapabilityBoundingSet to AmbientCapabilities
where only add/moving CAP_IPC_LOCK was intended.
The effect of this is the defaulting running MariaDB HAS the capabiltiy
CAP_DAC_OVERRIDE CAP_AUDIT_WRITE allowing it to access any file,
even while running as a non-root user.
Resolve this by making CAP_IPC_LOCK apply to AmbientCapabilities and
leave the remaining CAP_DAC_OVERRIDE CAP_AUDIT_WRITE to CapabilityBoundingSet
for the use by auth_pam_tool.
Per https://github.com/systemd/systemd/issues/36529 OOM counts
as a on-abnormal condition. To ensure that MariaDB testart on
OOM the Restart is changes to on-abnormal which an extension
on the current on-abort condition.
.. even with MDEV-9095 fix
CapabilityBounding sets require filesystem setcap attributes
for the executable to gain privileges during execution.
A side effect of this however is the getauxvec(AT_SECURE) gets
set, and the secure_getenv from OpenSSL internals on
OPENSSL_CONF environment variable will get ignored (openssl gh issue
21770).
According to capabilities(7), Ambient capabilities don't trigger
ld.so triggering the secure execution mode.
Include SELinux and Apparmor capabilities for ipc_lock
Originally requested to be infinity, but rolled back to 99%
to allow for a remote ssh connection or the odd needed system
job. This is up from 15% which is the effective default of
DefaultTasksMax.
Thanks Rick Pizzi for the bug report.
Add SYSTEMD_READWRITEPATH-variable to mariadb{@,}.service.in to make sure that
if one is not building RPM or DEB packages then make sure there is ReadWritePaths
directive is defined in systemd service file.
This ensures that tar-ball installation has permissions to write database default
installation path (default: /usr/local/mysql/data) even if it's located
under /usr. Writing to that location is prevented by 'ProtectSystem=full'
systemd directive by default.
Prefixing the path with "-" in systemd causes there to not be an error if the
path doesn't exist. This may occur if the user has configured a datadir
elsewhere.
Reviewer: Daniel Black
Quoting MDEV reporter Daniel Lewart:
Starting MariaDB with default configuration causes the following problems:
"[Warning] Could not increase number of max_open_files to more than 16384 (request: 32186)"
silently reduces table_open_cache_instances from 8 (default) to 4
Default Server System Variables:
extra_max_connections = 1
max_connections = 151
table_open_cache = 2000
table_open_cache_instances = 8
thread_pool_size = 4
LimitNOFILE=16834 is in the following files:
support-files/mariadb.service.in
support-files/mariadb@.service.in
Looking at sql/mysqld.cc lines 3837-3917:
wanted_files= (extra_files + max_connections + extra_max_connections +
tc_size * 2 * tc_instances);
wanted_files+= threadpool_size;
Plugging in the default values:
wanted_files = (30 + 151 + 1 + 2000 * 2 * 8 + 4) = 32186
However, systemd configuration has LimitNOFILE = 16384, which is far smaller.
I suggest increasing LimitNOFILE to 32768.
liburing is a new optional dependency (WITH_URING=auto|yes|no)
that replaces libaio when it is available.
aio_uring: class which wraps io_uring stuff
aio_uring::bind()/unbind(): optional optimization
aio_uring::submit_io(): mutex prevents data race. liburing calls are
thread-unsafe. But if you look into it's implementation you'll see
atomic operations. They're used for synchronization between kernel and
user-space only. That's why our own synchronization is still needed.
For systemd, we add LimitMEMLOCK=524288 (ulimit -l 524288)
because the io_uring_setup system call that is invoked
by io_uring_queue_init() requests locked memory. The value
was found empirically; with 262144, we would occasionally
fail to enable io_uring when using the maximum values of
innodb_read_io_threads=64 and innodb_write_io_threads=64.
aio_uring::thread_routine(): Tolerate -EINTR return from
io_uring_wait_cqe(), because it may occur on shutdown
on Ubuntu 20.10 (Groovy Gorilla).
This was mostly implemented by Eugene Kosov. Systemd integration
and improved startup/shutdown error handling by Marko Mäkelä.
Replace all references to /usr/sbin/mysqld (and bin and libexec) with
mariadbd, so that the binary server will always be 'mariadbd'.
Also update all places that reference the server binary in other ways,
such as AppArmor profiles and scripts that previously expected to find
a 'mysqld' in process lists.
When trying to start mariadb via systemctl, WSREP failed
to start mysqld for wsrep recovery, because the binary
"galera-recovery" is neither searching the mysqld in the
same folder as the binary itself nor in the path variable
but instead expects the root to be /usr/local/mysql.
This fix changes the current directory to the desired
directory before starting mysqld.
The arg was introduced as part of 75bcf1f9ad
to fix a SELinux problem caused by mysqld_safe accessing files it should
not be via the my_which function.
The root cause for this was fixed in 10.3, via
355ee6877b which eliminated the my_which
function from mysqld_safe entirely. Thus, in 10.3, this --basedir flag
is not necessary.
The unit files made systemd print:
systemd[1]: Started MariaDB 10.3.13 database server (multi-instance).
Let's add the instance name, so starting mariadb@foo.service
makes it print:
systemd[1]: Started MariaDB 10.3.13 database server (multi-instance foo).