1
0
mirror of https://github.com/MariaDB/server.git synced 2025-12-07 17:42:39 +03:00
Commit Graph

2103 Commits

Author SHA1 Message Date
Marko Mäkelä
4ca355d863 MDEV-33894: Resurrect innodb_log_write_ahead_size
As part of commit 685d958e38 (MDEV-14425)
the parameter innodb_log_write_ahead_size was removed, because it was
thought that determining the physical block size would be a sufficient
replacement.

However, we can only determine the physical block size on Linux or
Microsoft Windows. On some file systems, the physical block size
is not relevant. For example, XFS uses a block size of 4096 bytes
even if the underlying block size may be smaller.

On Linux, we failed to determine the physical block size if
innodb_log_file_buffered=OFF was not requested or possible.
This will be fixed.

log_sys.write_size: The value of the reintroduced parameter
innodb_log_write_ahead_size. To keep it simple, this is read-only
and a power of two between 512 and 4096 bytes, so that the previous
alignment guarantees are fulfilled. This will replace the previous
log_sys.get_block_size().

log_sys.block_size, log_t::get_block_size(): Remove.

log_t::set_block_size(): Ensure that write_size will not be less
than the physical block size. There is no point to invoke this
function with 512 or less, because that is the minimum value of
write_size.

innodb_params_adjust(): Add some disabled code for adjusting
the minimum value and default value of innodb_log_write_ahead_size
to reflect the log_sys.write_size.

log_t::set_recovered(): Mark the recovery completed. This is the
place to adjust some things if we want to allow write_size>4096.

log_t::resize_write_buf(): Refer to write_size.

log_t::resize_start(): Refer to write_size instead of get_block_size().

log_write_buf(): Simplify some arithmetics and remove a goto.

log_t::write_buf(): Refer to write_size. If we are writing less than
that, do not switch buffers, but keep writing to the same buffer.
Move some code to improve the locality of reference.

recv_scan_log(): Refer to write_size instead of get_block_size().

os_file_create_func(): For type==OS_LOG_FILE on Linux, always invoke
os_file_log_maybe_unbuffered(), so that log_sys.set_block_size() will
be invoked even if we are not attempting to use O_DIRECT.

recv_sys_t::find_checkpoint(): Read the entire log header
in a single 12 KiB request into log_sys.buf.

Tested with:
./mtr --loose-innodb-log-write-ahead-size=4096
./mtr --loose-innodb-log-write-ahead-size=2048
2024-06-27 16:38:08 +03:00
Alexey Yurchenko
a1e5a284fc MDEV-31809 Automatic SST user account management
Implement automatic creation of temporary accounts for SST and pass
account credentials to SST script via socket as opposed to environment
variables. Delete the user after the SST script returns,

Respect wsrep_sst_auth set by the adminitrator in case some additional
privilege grants are needed for particular SST method.

mysqldump SST requires significant change to make use of the new
automatic user generation facility. For now just make it compatible
by ignoring automatically generated user and rely only on wsrep_sst_auth
setting on the joiner node to keep backward compatibility.

Adapt mysqldump SST to automatic SST user generation changes:
 - disable special treatment for mysqldump SST on donor
 - make mysqldump SST script compatible with the new SST script
   interface.

Differentiate user privileges for different SST methods:
 - grant minimum required privileges for clone and xtrabackup SST
   accounts
 - grant all privileges to custom SST accounts as it is not known what
   is needed.
 - disable SST account generation for rsync SST since it is not needed.

MTR tests:
 - add MTR tests for clone and xtrabackup SSTs without wsrep_sst_auth,
 - add MTR test for testing masking of wsrep_sst_auth.
 - don't attmept to restore original wsrep_sst_auth in MTR tests as it
   is always masked.

Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>
2024-06-10 23:29:05 +02:00
Sergei Golubchik
26f01f8be5 11.6 branch 2024-05-28 10:07:05 +02:00
Monty
b9f5793176 MDEV-9101 Limit size of created disk temporary files and tables
Two new variables added:
- max_tmp_space_usage : Limits the the temporary space allowance per user
- max_total_tmp_space_usage: Limits the temporary space allowance for
  all users.

New status variables: tmp_space_used & max_tmp_space_used
New field in information_schema.process_list: TMP_SPACE_USED

The temporary space is counted for:
- All SQL level temporary files. This includes files for filesort,
  transaction temporary space, analyze, binlog_stmt_cache etc.
  It does not include engine internal temporary files used for repair,
  alter table, index pre sorting etc.
- All internal on disk temporary tables created as part of resolving a
  SELECT, multi-source update etc.

Special cases:
- When doing a commit, the last flush of the binlog_stmt_cache
  will not cause an error even if the temporary space limit is exceeded.
  This is to avoid giving errors on commit. This means that a user
  can temporary go over the limit with up to binlog_stmt_cache_size.

Noteworthy issue:
- One has to be careful when using small values for max_tmp_space_limit
  together with binary logging and with non transactional tables.
  If a the binary log entry for the query is bigger than
  binlog_stmt_cache_size and one hits the limit of max_tmp_space_limit
  when flushing the entry to disk, the query will abort and the
  binary log will not contain the last changes to the table.
  This will also stop the slave!
  This is also true for all Aria tables as Aria cannot do rollback
  (except in case of crashes)!
  One way to avoid it is to use @@binlog_format=statement for
  queries that updates a lot of rows.

Implementation:
- All writes to temporary files or internal temporary tables, that
  increases the file size, are routed through temp_file_size_cb_func()
  which updates and checks the temp space usage.
- Most of the temporary file monitoring is done inside IO_CACHE.
  Temporary file monitoring is done inside the Aria engine.
- MY_TRACK and MY_TRACK_WITH_LIMIT are new flags for ini_io_cache().
  MY_TRACK means that we track the file usage. TRACK_WITH_LIMIT means
  that we track the file usage and we give an error if the limit is
  breached. This is used to not give an error on commit when
  binlog_stmp_cache is flushed.
- global_tmp_space_used contains the total tmp space used so far.
  This is needed quickly check against max_total_tmp_space_usage.
- Temporary space errors are using EE_LOCAL_TMP_SPACE_FULL and
  handler errors are using HA_ERR_LOCAL_TMP_SPACE_FULL.
  This is needed until we move general errors to it's own error space
  so that they cannot conflict with system error numbers.
- Return value of my_chsize() and mysql_file_chsize() has changed
  so that -1 is returned in the case my_chsize() could not decrease
  the file size (very unlikely and will not happen on modern systems).
  All calls to _chsize() are updated to check for > 0 as the error
  condition.
- At the destruction of THD we check that THD::tmp_file_space == 0
- At server end we check that global_tmp_space_used == 0
- As a precaution against errors in the tmp_space_used code, one can set
  max_tmp_space_usage and max_total_tmp_space_usage to 0 to disable
  the tmp space quota errors.
- truncate_io_cache() function added.
- Aria tables using static or dynamic row length are registered in 8K
  increments to avoid some calls to update_tmp_file_size().

Other things:
- Ensure that all handler errors are registered.  Before, some engine
  errors could be printed as "Unknown error".
- Fixed bug in filesort() that causes a assert if there was an error
  when writing to the temporay file.
- Fixed that compute_window_func() now takes into account write errors.
- In case of parallel replication, rpl_group_info::cleanup_context()
  could call trans_rollback() with thd->error set, which would cause
  an assert. Fixed by resetting the error before calling trans_rollback().
- Fixed bug in subselect3.inc which caused following test to use
  heap tables with low value for max_heap_table_size
- Fixed bug in sql_expression_cache where it did not overflow
  heap table to Aria table.
- Added Max_tmp_disk_space_used to slow query log.
- Fixed some bugs in log_slow_innodb.test
2024-05-27 12:39:04 +02:00
Sergei Golubchik
9293d40fa7 MDEV-33145 support for old-mode=OLD_FLUSH_STATUS
add old-mode that restores inconsistent legacy behavior for FLUSH STATUS.
It doesn't affect FLUSH { SESSION | GLOBAL } STATUS.
2024-05-27 12:39:03 +02:00
Monty
775cba4d0f MDEV-33145 Add FLUSH GLOBAL STATUS
- FLUSH GLOBAL STATUS now resets most global_status_vars.
  At this stage, this is mainly to be used for testing.
- FLUSH SESSION STATUS added as an alias for FLUSH STATUS.
- FLUSH STATUS does not require any privilege (before required RELOAD).
- FLUSH GLOBAL STATUS requires RELOAD privilege.
- All global status reset moved to FLUSH GLOBAL STATUS.
- Replication semisync status variables are now reset by
  FLUSH GLOBAL STATUS.
- In test cases, the only changes are:
  - Replace FLUSH STATUS with FLUSH GLOBAL STATUS
  - Replace FLUSH STATUS with FLUSH STATUS; FLUSH GLOBAL STATUS.
    This was only done in a few tests where the test was using SHOW STATUS
    for both local and global variables.
- Uptime_since_flush_status is now always provided, independent if
  ENABLED_PROFILING is enabled when compiling MariaDB.
- @@global.Uptime_since_flush_status is reset on FLUSH GLOBAL STATUS
  and @@session.Uptime_since_flush_status is reset on FLUSH SESSION STATUS.
- When connected, @@session.Uptime_since_flush_status is set to 0.
2024-05-27 12:39:03 +02:00
Sergei Golubchik
cc758332ba ER_VARIABLE_DELETED fix typos, adjust wording, fix plugins.
plugins can have unused variables too. If they use a literal "Unused"
string a compiler might or might not merge two identical strings into
one (-fmerge-constants) and depending on that the server will or will
not issue a "variable is ignored" warning.
2024-05-27 12:39:03 +02:00
Monty
2464ee758a MDEV-33655 Remove alter_algorithm
Remove alter_algorithm but keep the variable as no-op (with a warning).

The reasons for removing alter_algorithm are:
- alter_algorithm was introduced as a replacement for the
  old_alter_table that was used to force the usage of the original
  alter table algorithm (copy) in the cases where the new alter
  algorithm did not work. The new option was added as a way to force
  the usage of a specific algorithm when it should instead have made
  it possible to disable algorithms that would not work for some
  reason.
- alter_algorithm introduced some cases where ALTER TABLE would not
  work without specifying the ALGORITHM=XXX option together with
  ALTER TABLE.
- Having different values of alter_algorithm on master and slave could
  cause slave to stop unexpectedly.
- ALTER TABLE FORCE, as used by mariadb-upgrade, would not always work
  if alter_algorithm was set for the server.
- As part of the MDEV-33449 "improving repair of tables" it become
  clear that alter- algorithm made it harder to provide a better and
  more consistent ALTER TABLE FORCE and REPAIR TABLE and it would be
  better to remove it.
2024-05-27 12:39:03 +02:00
Monty
8af7a99443 Fixed warnings when using deprecated variables
Also fixed that all unused variables are using the same variable comment.
The warning will be tested with the next commit that deprecates the
variable alter_algorithm.
2024-05-27 12:39:02 +02:00
Monty
dfdedd46e4 MDEV-32188 make TIMESTAMP use whole 32-bit unsigned range
This patch extends the timestamp from
2038-01-19 03:14:07.999999 to 2106-02-07 06:28:15.999999
for 64 bit hardware and OS where 'long' is 64 bits.
This is true for 64 bit Linux but not for Windows.

This is done by treating the 32 bit stored int as unsigned instead of
signed.  This is safe as MariaDB has never accepted dates before the epoch
(1970).
The benefit of this approach that for normal timestamp the storage is
compatible with earlier version.

However for tables using system versioning we before stored a
timestamp with the year 2038 as the 'max timestamp', which is used to
detect current values.  This patch stores the new 2106 year max value
as the max timestamp. This means that old tables using system
versioning needs to be updated with mariadb-upgrade when moving them
to 11.4. That will be done in a separate commit.
2024-05-27 12:39:02 +02:00
Sergei Golubchik
5296f908ed MDEV-28671 post-testing fixes
Various help message improvements:
* MySQL->MariaDB, mysqld->mariadbd, "mysqld daemon" -> "mariadbd process"
* typos
* don't specify defaults directly in the help message
* don't say that an option is deprecated, mark is as such
* missing spaces in the middle of the text
etc
2024-05-27 12:39:02 +02:00
Sergei Golubchik
df10a945fc MDEV-28671 post-merge fixes
* use new deprecated printer for all deprecated server options
* restore alphabetic option sorting order
* move deprecated printer from mysqld.cc to my_getopt.c
* in --help print deprecation message at the end of the option help
* move 'ALL' help text where it belongs - to other SET options, and
  with a correct indentation.
* consistently end all or none command-line option help strings
  with a dot - my_print_help() needs that.
  It's about 50/50 now, so let's do none, less line wraps in --help
* remove trailing spaces from command-line option help strings
2024-05-27 12:39:02 +02:00
Oleksandr Byelkin
6c323c7a03 Fix version 2024-05-23 21:54:29 +02:00
Oleksandr Byelkin
dd7d9d7fb1 Merge branch '11.4' into 11.5 2024-05-23 17:01:43 +02:00
Sergei Golubchik
19f7edf420 mysqltest: support MARIADB_OPT_RESTRICTED_AUTH
C/C 3.4 disables mysql_old_password by default, so

add an option for the `connect` command to support specifying
allowed authentication plugins (MARIADB_OPT_RESTRICTED_AUTH).

use it to enable mysql_old_password when needed for testing
2024-05-21 19:40:03 +02:00
Oleksandr Byelkin
99b370e023 Merge branch '11.2' into 11.4 2024-05-21 19:38:51 +02:00
Sergei Golubchik
bf5da43e50 Merge branch '11.1' into 11.2 2024-05-13 10:00:26 +02:00
Sergei Golubchik
f8621f2a16 remove redundant slow tests 2024-05-13 09:55:28 +02:00
Sergei Golubchik
f0a5412037 Merge branch '11.0' into 11.1 2024-05-13 09:52:30 +02:00
Sergei Golubchik
f9807aadef Merge branch '10.11' into 11.0 2024-05-12 12:18:28 +02:00
Yuchen Pei
b86a2f03b6 MDEV-32640 Reset thd->lex->mi.connection_name.str towards the end of mysql_execute_command
Reset the connection_name to contain a null string, if the pointer
points to the same space as that of the system variable
default_master_connection.

We do this because the system variable may be updated which could free
the pointer and create a new one, causing use-after-free for
re-execution of prepared statements and stored procedures where the
LEX may be reused.

This allows connection_name to be set again be to the system variable
pointer in the next call of this function (see earlier in this
function), after any possible updates to the system variable.
2024-05-07 14:54:13 +10:00
Sergei Golubchik
018d537ec1 Merge branch '10.6' into 10.11 2024-04-22 15:23:10 +02:00
Sergei Golubchik
3f9182126c mysqltest: support MARIADB_OPT_RESTRICTED_AUTH
C/C 3.4 disables mysql_old_password by default, so

add an option for the `connect` command to support specifying
allowed authentication plugins (MARIADB_OPT_RESTRICTED_AUTH).

use it to enable mysql_old_password when needed for testing
2024-04-22 14:59:05 +02:00
Sergei Golubchik
41296a07c8 Merge branch '10.5' into 10.6 2024-04-11 13:58:22 +02:00
Alexander Barkov
9fb8881ef8 MDEV-28366 GLOBAL debug_dbug setting affected by collation_connection=utf16...
When the system variables @@debug_dbug was assigned to
some expression, Sys_debug_dbug::do_check() did not properly
convert the value from the expression character set to utf8.
So the value was erroneously re-interpretted as utf8 without
conversion. In case of a tricky expression character set
(e.g. utf16le), this led to unexpected results.

Fix:

Re-using Sys_var_charptr::do_string_check() in Sys_debug_dbug::do_check().
2024-04-10 06:09:45 +04:00
Yuchen Pei
e0b6db2de7 MDEV-31609 Send initial values of system variables in first OK packet
Values of all session tracking system variables will be sent in the
first ok packet upon connection after successful authentication.

Also updated mtr to print session track info on connection (h/t Sergei
Golubchik) so that we can write mtr tests for this change.
2024-04-10 11:13:46 +10:00
Oleksandr Byelkin
cd28b2479c Merge branch '11.1' into 11.2 2024-04-09 12:12:33 +02:00
Marko Mäkelä
0892e6d028 MDEV-33585 The maximum innodb_log_buffer_size is too large
On Microsoft Windows, ReadFile() as well as WriteFile() limit the size
of the request to DWORD, which is 32 bits (at most 4 GiB - 1) also on
64-bit systems.

On FreeBSD, sysctl debug.iosize_max_clamp could limit the size of a
write request to INT_MAX. The size of a read request is always limited
to INT_MAX. This would allow the request size to be 4095 bytes more than
the Linux limit (0x7ffff000 according to "man 2 read" and "man 2 write").

On OpenBSD, Solaris and possibly NetBSD, the read request size is limited
to SSIZE_T_MAX, which would be half the current maximum
innodb_log_buffer_size. This should be not much of an issue anyway,
because on contemporary 64-bit platforms, the virtual addresses are
limited to 48 bits.

IBM AIX documentation mentions OFF_MAX which would apply when
a 64-bit application is running on a 32-bit kernel.

Let us declare innodb_log_buffer_size as 32-bit unsigned and make the
maximum 0x7ffff000, to be compatible with the least common
denominator (Linux).

The maximum innodb_sort_buffer_size already was 64 MiB,
which is not a problem.

SyncFileIO::execute(): Assert that the size of a synchronous read or
write request is limited to the maximum.

Reviewed by: Vladislav Vaintroub
2024-04-09 09:32:47 +03:00
Marko Mäkelä
1122ac978e MDEV-33545: Improve innodb_doublewrite to cover NO_FSYNC
In commit 24648768b4 (MDEV-30136)
the parameter innodb_flush_method was deprecated, with no direct
replacement for innodb_flush_method=O_DIRECT_NO_FSYNC.

Let us change innodb_doublewrite from Boolean to ENUM that can
be changed while the server is running:

OFF: Assume that writes of innodb_page_size are atomic
ON: Prevent torn writes (the default)
fast: Like ON, but avoid synchronizing writes to data files

The deprecated start-up parameter innodb_flush_method=NO_FSYNC will cause
innodb_doublewrite=ON to be changed to innodb_doublewrite=fast,
which will prevent InnoDB from making any durable writes to data files.
This would normally be done right before the log checkpoint LSN is updated.
Depending on the file systems being used and their configuration,
this may or may not be safe.

The value innodb_doublewrite=fast differs from the previous combination of
innodb_doublewrite=ON and innodb_flush_method=O_DIRECT_NO_FSYNC by always
invoking os_file_flush() on the doublewrite buffer itself
in buf_dblwr_t::flush_buffered_writes_completed(). This should be safer
when there are multiple doublewrite batches between checkpoints.
Typically, once per second, buf_flush_page_cleaner() would write out
up to innodb_io_capacity pages and advance the log checkpoint.
Also typically, innodb_io_capacity>128, which is the size of the
doublewrite buffer in pages. Should os_file_flush_func() not be invoked
between doublewrite batches, writes could be reordered in an unsafe way.

The setting innodb_doublewrite=fast could be safe when the doublewrite
buffer (the first file of the system tablespace) and the data files
reside in the same file system.

This was tested by running "./mtr --rr innodb.alter_kill". On the first
server startup, with innodb_doublewrite=fast, os_file_flush_func()
would only be invoked on the ibdata1 file and possibly ib_logfile0.
On subsequent startups with innodb_doublewrite=OFF, os_file_flush_func()
will be invoked on the individual data files during log_checkpoint().

Note: The setting debug_no_sync (in the code, my_disable_sync) would
disable all durable writes to InnoDB files, which would be much less safe.

IORequest::Type: Introduce special values WRITE_DBL and PUNCH_DBL
for asynchronous writes that are submitted via the doublewrite buffer.
In this way, fil_space_t::use_doublewrite() or buf_dblwr.in_use()
will only be consulted during buf_page_t::flush() and the doublewrite
buffer can be enabled or disabled without any fear of inconsistency.

buf_dblwr_t::block_size: Replaces block_size().

buf_dblwr_t::flush_buffered_writes(): If !in_use() and the doublewrite
buffer is empty, just invoke fil_flush_file_spaces() and return. The
doublewrite buffer could have been disabled while a batch was in
progress.

innodb_init_params(): If innodb_flush_method=O_DIRECT_NO_FSYNC,
set innodb_doublewrite=fast or innodb_doublewrite=fearless.

Thanks to Mark Callaghan for reporting this, and Vladislav Vaintroub
for feedback.
2024-04-04 08:12:54 +03:00
Marko Mäkelä
683fbced6b Merge 11.0 into 11.1 2024-03-28 12:15:36 +02:00
Marko Mäkelä
fec2fd6add Merge 10.11 into 11.0 2024-03-28 10:51:36 +02:00
Marko Mäkelä
788953463d Merge 10.6 into 10.11
Some fixes related to commit f838b2d799 and
Rows_log_event::do_apply_event() and Update_rows_log_event::do_exec_row()
for system-versioned tables were provided by Nikita Malyavin.
This was required by test versioning.rpl,trx_id,row.
2024-03-28 09:16:57 +02:00
Sergei Golubchik
f50694c52b remove pointless test 2024-03-27 16:14:55 +01:00
Marko Mäkelä
bf0b82d24b MDEV-33515 log_sys.lsn_lock causes excessive context switching
The log_sys.lsn_lock is a very contended resource with a small
critical section in log_sys.append_prepare(). On many processor
microarchitectures, replacing the system call based log_sys.lsn_lock
with a pure spin lock would fare worse during high concurrency workloads,
wasting a significant amount of CPU cycles in the spin loop.

On other microarchitectures, we would see a significant amount of time
being spent in native_queued_spin_lock_slowpath() in the Linux kernel,
plus context switching between user and kernel address space. This was
pointed out by Steve Shaw from Intel Corporation.

Depending on the workload and the hardware implementation, it may be
useful to use a pure spin lock in log_sys.append_prepare().
We will introduce a parameter. The statement

	SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=50;

would enable a spin lock that will execute that many MY_RELAX_CPU()
operations (such as the x86 PAUSE instruction) between successive
attempts of acquiring the spin lock. The use of a system call based
log_sys.lsn_lock (which is the default setting) can be enabled by

	SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=0;

This patch will also introduce #ifdef LOG_LATCH_DEBUG
(part of cmake -DWITH_INNODB_EXTRA_DEBUG=ON) for more accurate
tracking of log_sys.latch ownership and reorganize the fields of
log_sys to improve the locality of reference and to reduce the
chances of false sharing.

When a spin lock is being used, it will be maintained in the
most significant bit of log_sys.buf_free. This is useful, because that is
one of the fields that is covered by the lock. For IA-32 or AMD64, we
implement the spin lock specially via log_t::lsn_lock_bts(), employing the
i386 LOCK BTS instruction. A straightforward std::atomic::fetch_or() would
translate into an inefficient loop around LOCK CMPXCHG.

mtr_t::spin_wait_delay: The value of innodb_log_spin_wait_delay.

mtr_t::finisher: Pointer to the currently used mtr_t::finish_write()
implementation. This allows to avoid introducing conditional branches.
We no longer invoke log_sys.is_pmem() at the mini-transaction level,
but we would do that in log_write_up_to().

mtr_t::finisher_update(): Update finisher when spin_wait_delay is
changed from or to 0 (the spin lock is changed to log_sys.lsn_lock or
vice versa).
2024-03-22 12:29:01 +02:00
Marko Mäkelä
b8a6719889 MDEV-26642/MDEV-26643/MDEV-32898 Implement innodb_snapshot_isolation
https://jepsen.io/analyses/mysql-8.0.34 highlights that the
transaction isolation levels in the InnoDB storage engine do not
correspond to any widely accepted definitions, such as
"Generalized Isolation Level Definitions"
https://pmg.csail.mit.edu/papers/icde00.pdf
(PL-1 = READ UNCOMMITTED, PL-2 = READ COMMITTED, PL-2.99 = REPEATABLE READ,
PL-3 = SERIALIZABLE).
Only READ UNCOMMITTED in InnoDB seems to match the above definition.

The issue is that InnoDB does not detect write/write conflicts
(Section 4.4.3, Definition 6) in the above.

It appears that as soon as we implement write/write conflict detection
(SET SESSION innodb_snapshot_isolation=ON), the default isolation level
(SET TRANSACTION ISOLATION LEVEL REPEATABLE READ) will become
Snapshot Isolation (similar to Postgres), as defined in Section 4.2 of
"A Critique of ANSI SQL Isolation Levels", MSR-TR-95-51, June 1995
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf

Locking reads inside InnoDB used to read the latest committed version,
ignoring what should actually be visible to the transaction.
The added test innodb.lock_isolation illustrates this. The statement
	UPDATE t SET a=3 WHERE b=2;
is executed in a transaction that was started before a read view or
a snapshot of the current transaction was created, and committed before
the current transaction attempts to execute
	UPDATE t SET b=3;
If SET innodb_snapshot_isolation=ON is in effect when the second
transaction was started, the second transaction will be aborted with
the error ER_CHECKREAD. By default (innodb_snapshot_isolation=OFF),
the second transaction would execute inconsistently, displaying an
incorrect SELECT COUNT(*) FROM t in its read view.

If innodb_snapshot_isolation=ON, if an attempt to acquire a lock on a
record that does not exist in the current read view is made, an error
DB_RECORD_CHANGED (HA_ERR_RECORD_CHANGED, ER_CHECKREAD) will
be raised. This error will be treated in the same way as a deadlock:
the transaction will be rolled back.

lock_clust_rec_read_check_and_lock(): If the current transaction has
a read view where the record is not visible and
innodb_snapshot_isolation=ON, fail before trying to acquire the lock.

row_sel_build_committed_vers_for_mysql(): If innodb_snapshot_isolation=ON,
disable the "semi-consistent read" logic that had been implemented by
myself on the directions of Heikki Tuuri in order to address
https://bugs.mysql.com/bug.php?id=3300 that was motivated by a customer
wanting UPDATE to skip locked rows that do not match the WHERE condition.
It looks like my changes were included in the MySQL 5.1.5
commit ad126d90e019f223470e73e1b2b528f9007c4532; at that time, employees
of Innobase Oy (a recent acquisition of Oracle) had lost write access to
the repository.

The only reason why we set innodb_snapshot_isolation=OFF by default is
backward compatibility with applications, such as the one that motivated
the implementation of "semi-consistent read" back in 2005. In a later
major release, we can default to innodb_snapshot_isolation=ON.

Thanks to Peter Alvaro, Kyle Kingsbury and Alexey Gotsman for their work
on https://github.com/jepsen-io/ and to Kyle and Alexey for explanations
and some testing of this fix.

Thanks to Vladislav Lesin for the initial test for MDEV-26643,
as well as reviewing these changes.
2024-03-20 09:48:03 +02:00
Alexander Barkov
929c2e06aa MDEV-31531 Remove my_casedn_str() and my_caseup_str()
Under terms of MDEV 27490 we'll add support for non-BMP identifiers
and upgrade casefolding information to Unicode version 14.0.0.
In Unicode-14.0.0 conversion to lower and upper cases can increase octet length
of the string, so conversion won't be possible in-place any more.

This patch removes virtual functions performing in-place casefolding:
  - my_charset_handler_st::casedn_str()
  - my_charset_handler_st::caseup_str()
and fixes the code to use the non-inplace functions instead:
  - my_charset_handler_st::casedn()
  - my_charset_handler_st::caseup()
2024-02-28 22:20:29 +04:00
Marko Mäkelä
d73baa402a Merge 10.11 into 11.0 2024-02-20 12:02:01 +02:00
Sergei Golubchik
eeba940311 remove deprecated since 10.4 2024-02-17 17:10:25 +01:00
Sergei Golubchik
04f0504831 11.5 branch 2024-02-17 17:10:25 +01:00
Daniel Bartholomew
9b6e267bfd bump the VERSION 2024-02-17 15:30:50 +01:00
Oleksandr Byelkin
fa69b085b1 Merge branch '11.3' into 11.4 2024-02-15 13:53:21 +01:00
Sergei Golubchik
3ae6680eec update 32bit rdiffs 2024-02-15 00:03:27 +01:00
Marko Mäkelä
64cce8d5bf Merge 10.6 into 10.11 2024-02-14 16:12:53 +02:00
Monty
18dfcfdecf MDEV-31404 Implement binlog_space_limit
binlog_space_limit is a variable in Percona server used to limit the total
size of all binary logs.

This implementation is based on code from Percona server 5.7.

In MariaDB we decided to call the variable max-binlog-total-size to be
similar to max-binlog-size. This makes it easier to find in the output
from 'mariadbd --help --verbose'). MariaDB will also support
binlog_space_limit for compatibility with Percona.

Some internal notes to explain implementation notes:

- When running MariaDB does not delete binary logs that are either
  used by slaves or have active xid that are not yet committed.

Some implementation notes:

- max-binlog-total-size is by default 0 (no limit).
- max-binlog-total-size can be changed without server restart.
- Binlog file sizes are checked on startup, or if
  max-binlog-total-size is set to a value > 0, not for every log write.
  The total size of all binary logs is cached and dynamically updated
  when updating the binary log on binary log rotation.
- max-binlog-total-size is checked against existing log files during
  serverstart, binlog rotation, FLUSH LOGS, when writing to binary log
  or when max-binlog-total-size changes value.
- Option --slave-connections-needed-for-purge with 1 as default added.
  This allows one to ensure that we do not delete binary logs if there
  is less than 'slave-connections-needed-for-purge' connected.
  Without this option max-binlog-total-size would potentially delete
  binlogs needed by slaves on server startup or when a slave disconnects
  as there are then no connected slaves to protect active binlogs.
- PURGE BINARY LOGS TO ... will be executed as if
  slave-connectitons-needed-for-purge would be zero. In other words
  it will do the purge even if there is no slaves connected. If there
  are connected slaves working on the logs, these will be protected.
- If binary log is on and max-binlog-total_size <> 0 then the status
  variable 'Binlog_disk_use' shows the current size of all old binary
  logs + the state of the current one.
- Removed test of strcmp(log_file_name, log_info.log_file_name) in
  purge_logs_before_date() as this is tested in can_purge_logs()
- To avoid expensive calls of log_in_use() we cache the result for the
  last log that is in use by a slave. Future calls to can_purge_logs()
  for this binary log will be quickly detected and false will be returned
  until a slave starts working on a new log.
- Note that after a binary log rotation caused by max_binlog_size,
  the last log will not be purged directly as it is still in use
  internally. The next binary log write will purge binlogs if needed.

Reviewer:Kristian Nielsen <knielsen@knielsen-hq.org>
2024-02-14 15:02:21 +01:00
Monty
3907345e22 MDEV-33306 Optimizer choosing incorrect index in 10.6, 10.5 but not in 10.4
In MariaDB up to 10.11, the test_if_cheaper_ordering() code (that tries
to optimizer how GROUP BY is executed) assumes that if a table scan is used
then if there is any index usable by GROUP BY it will be used.

The reason MySQL 10.4 provides a better plan is because of two differences:
- Plans using 'ref' has a cost of 1/10 of what it should be (as a
  protection against table scans). This is why 'ref' is used in 10.4
  and not in 10.5.
- When 'ref' is used, then GROUP BY will not use an index for GROUP BY.

In MariaDB 10.5 the chosen plan is a table scan (as it calculated to be
faster) but as 'ref' is not used, the test_if_cheaper_ordering()
optimizer phase decides (as ref is not usd) to use an index for GROUP BY,
which has bad performance.

Description of fix:
- All new code is protected by the "optimizer_adjust_secondary_key_costs"
  variable, which is now a bit map, and is only executed if the option
  "disable_forced_index_in_group_by" set.
- Corrects GROUP BY handling in test_if_cheaper_ordering() by making
  the choise of using and index with GROUP BY cost based instead of rule
  based.
- Adds TIME_FOR_COMPARE to all costs, when using group by, to make
  read_time, index_scan_time and range_cost comparable.

Other things:
- Made optimizer_adjust_secondary_key_costs a bit map (compatible with old
  code).

Notes:
Current code ignores costs for the algorithm used when doing GROUP
BY on the first table:
  - Create an in-memory temporary table for handling group by and doing a
    filesort of the result file
We can probably in 10.6 continue to ignore this cost.

This patch should NOT be merged to 11.0 series (not needed in 11.0).
2024-02-12 16:43:00 +02:00
Marko Mäkelä
86c2c89743 Merge 10.6 into 10.11 2024-02-08 15:04:46 +02:00
Marko Mäkelä
91a2192bf2 Merge 10.5 into 10.6 2024-02-07 13:51:03 +02:00
Thirunarayanan Balathandayuthapani
c31b1ee26a MDEV-33341 innodb.undo_space_dblwr test case fails with Unknown Storage Engine InnoDB
- Failed to reset the innodb_fil_make_page_dirty_debug variable in
innodb_saved_page_number_debug_basic test case.
2024-02-07 12:35:18 +02:00
Oleksandr Byelkin
d21cb43db1 Merge branch '11.2' into 11.3 2024-02-04 16:42:31 +01:00
Sergei Golubchik
75bfb4b8a3 deprecate SQL_NOTES variable in favor of NOTE_VERBOSITY
as suggested by Monty
2024-02-03 11:22:20 +01:00