MySQL 5.5 in commit 177d8b0c12
introduced a configuration parameter innodb_force_load_corrupted
whose purpose was to allow a corrupted table to be dropped.
Given that MDEV-11412 in MariaDB 10.5.4 aims to allow any metadata
for a missing or corrupted table to be dropped, and given that
MDEV-17567 and MDEV-25506 and related tasks made DDL operations
crash-safe, the parameter no longer serves any purpose.
Because this obscure parameter was read-only (not settable by a client),
it seems that we can simply declare it with MARIADB_REMOVED_OPTION
(commit 1bc9cce702) without breaking
any upgrades.
DICT_ERR_IGNORE_INDEX: Replaces DICT_ERR_IGNORE_INDEX_ROOT and
DICT_ERR_IGNORE_CORRUPT, which were always set equally.
dict_load_indexes(): Report "No indexes found for table" in
a uniform way, and only when the DICT_ERR_IGNORE_INDEX flag is
not set.
If the clustered index is marked corrupted, and the operation
is DICT_ERR_IGNORE_DROP (we are about to drop the table), we will
load the metadata; else, we will return DB_INDEX_CORRUPT.
If SYS_INDEXES.PAGE is FIL_NULL, report an error or warning
unless we are about to drop the table.
dict_load_table_one(): Simplify the logic.
There is a server startup option --gdb a.k.a. --debug-gdb that requests
signals to be set for more convenient debugging. Most notably, SIGINT
(ctrl-c) will not be ignored, and you will be able to interrupt the
execution of the server while GDB is attached to it.
When we are debugging, the signal handlers that would normally display
a terse stack trace are useless.
When we are debugging with rr, the signal handlers may interfere with
a SIGKILL that could be sent to the process by the environment, and ruin
the rr replay trace, due to a Linux kernel bug
https://lkml.org/lkml/2021/10/31/311
To be able to diagnose bugs in kill+restart tests, we may really need
both a trace before the SIGKILL and a trace of the failure after a
subsequent server startup. So, we had better avoid hitting the problem
by simply not installing those signal handlers.
Dead code cleanup:
part_info->num_parts usage was wrong and working incorrectly in
mysql_drop_partitions() because num_parts is already updated in
prep_alter_part_table(). We don't have to update part_info->partitions
because part_info is destroyed at alter_partition_lock_handling().
Cleanups:
- DBUG_EVALUATE_IF() macro replaced by shorter form DBUG_IF();
- Typo in ER_KEY_COLUMN_DOES_NOT_EXITS.
Refactorings:
- Splitted write_log_replace_delete_frm() into write_log_delete_frm()
and write_log_replace_frm();
- partition_info via DDL_LOG_STATE;
- set_part_info_exec_log_entry() removed.
DBUG_EVALUATE removed
DBUG_EVALUTATE was only added for consistency together with
DBUG_EVALUATE_IF. It is not used anywhere in the code.
DBUG_SUICIDE() fix on release build
On release DBUG_SUICIDE() was statement. It was wrong as
DBUG_SUICIDE() is used in expression context.
The crash happened because my_isalnum() does not support character
sets with mbminlen>1.
The value of "ft_boolean_syntax" is converted to utf8 in do_string_check().
So calling my_isalnum() is combination with "default_charset_info" was wrong.
Adding new parameters (size_t length, CHARSET_INFO *cs) to
ft_boolean_check_syntax_string() and passing self->charset(thd)
as the character set.
This patch is the plan D variant for fixing potetial mutex locking
order exercised by BF aborting and KILL command execution.
In this approach, KILL command is replicated as TOI operation.
This guarantees total isolation for the KILL command execution
in the first node: there is no concurrent replication applying
and no concurrent DDL executing. Therefore there is no risk of
BF aborting to happen in parallel with KILL command execution
either. Potential mutex deadlocks between the different mutex
access paths with KILL command execution and BF aborting cannot
therefore happen.
TOI replication is used, in this approach, purely as means
to provide isolated KILL command execution in the first node.
KILL command should not (and must not) be applied in secondary
nodes. In this patch, we make this sure by skipping KILL
execution in secondary nodes, in applying phase, where we
bail out if applier thread is trying to execute KILL command.
This is effective, but skipping the applying of KILL command
could happen much earlier as well.
This patch also fixes mutex locking order and unprotected
THD member accesses on bf aborting case. We try to hold
THD::LOCK_thd_data during bf aborting. Only case where it
is not possible is at wsrep_abort_transaction before
call wsrep_innobase_kill_one_trx where we take InnoDB
mutexes first and then THD::LOCK_thd_data.
This will also fix possible race condition during
close_connection and while wsrep is disconnecting
connections.
Added wsrep_bf_kill_debug test case
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
- DISCARD/IMPORT TABLESPACE are the only tablespace commands left
- TABLESPACE arguments for CREATE TABLE and ALTER ... ADD PARTITION are
ignored.
- Tablespace names are not shown anymore in .frm and not shown in
information schema
Other things
- Removed end spaces from sql/CMakeList.txt
This adds following new thread states:
* waiting to execute in isolation - DDL is waiting to execute in TOI mode.
* waiting for TOI DDL - some other statement is waiting for DDL to complete.
* waiting for flow control - some statement is paused while flow control is in effect.
* waiting for certification - the transaction is being certified.
in about a hundred of users of MY_BITMAP, only two were using its
built-in mutex, and only one of those two was actually needing it.
Remove the mutex from MY_BITMAP, remove all associated conditions
and checks in bitmap functions. Use an external LOCK_temp_pool
mutex and temp_pool_set_next/temp_pool_clear_bit acccessors.
Remove bitmap_init/bitmap_free, always use my_* versions.
because the name was misleading, it counts not threads, but THDs,
and as THD_count is the only way to increment/decrement it, it
could as well be declared inside THD_count.
This is useful for thing like Item_true and Item_false that we
allocated and initalize once and want to ensure that nothing can
change them
Main changes:
- Memory protection is achived by allocating memory with mmap() and
protect it from write with mprotect()
- init_alloc_root(...,MY_ROOT_USE_MPROTECT) will create a
memroot that one can later use with protect_root() to turn it
read only or turn it back to read-write. All allocations to this
memroot is done with mmap() to ensure page alligned allocations.
- alloc_root() code was rearranged to combine normal and valgrind code.
- init_alloc_root() now changes block size to be power of 2's, to get less
memory fragmentation.
- Changed MEM_ROOT structure to make it smaller. Also renamed
MEM_ROOT m_psi_key to psi_key.
- Moved MY_THREAD_SPECIFIC marker in MEM_ROOT from block size (old hack)
to flags.
- Added global variable my_system_page_size. This is initialized at
startup.
The issue was that we sent two different signals to different threads
after each other. The DEBUG_SYNC functionality cannot handle this (as
the signal is stored in a global variable) and the first one can get
lost.
Fixed by using the same signal for both threads.
This fixed the MySQL bug# 20338 about misuse of double underscore
prefix __WIN__, which was old MySQL's idea of identifying Windows
Replace it by _WIN32 standard symbol for targeting Windows OS
(both 32 and 64 bit)
Not that connect storage engine is not fixed in this patch (must be
fixed in "upstream" branch)
On shutdown, to kill remaining connections, do the same thing that
server does during KILL CONNECTION, i.e thd->awake().
The stripped-down KILL version, that was used prior to this patch
for shutdown, missed the engine specific part (ha_kill_query)
before change test:
strace -fe trace=file -o /tmp/f.strace sql/mysqld --datadir=/tmp/d --log-bin=foo-bin --help --verbose && ls -la /tmp/
...
'mysqladmin variables' instead of 'mysqld --verbose --help'.
total 0
drwxrwxr-x. 2 dan dan 60 May 19 18:05 .
drwxrwxrwt. 27 root root 640 May 19 18:03 ..
-rw-rw----. 1 dan dan 0 May 19 18:05 foo-bin.index
MDEV-25604 Atomic DDL: Binlog event written upon recovery does not
have default database
The purpose of this task is to ensure that ALTER TABLE is atomic even if
the MariaDB server would be killed at any point of the alter table.
This means that either the ALTER TABLE succeeds (including that triggers,
the status tables and the binary log are updated) or things should be
reverted to their original state.
If the server crashes before the new version is fully up to date and
commited, it will revert to the original table and remove all
temporary files and tables.
If the new version is commited, crash recovery will use the new version,
and update triggers, the status tables and the binary log.
The one execption is ALTER TABLE .. RENAME .. where no changes are done
to table definition. This one will work as RENAME and roll back unless
the whole statement completed, including updating the binary log (if
enabled).
Other changes:
- Added handlerton->check_version() function to allow the ddl recovery
code to check, in case of inplace alter table, if the table in the
storage engine is of the new or old version.
- Added handler->table_version() so that an engine can report the current
version of the table. This should be changed each time the table
definition changes.
- Added ha_signal_ddl_recovery_done() and
handlerton::signal_ddl_recovery_done() to inform all handlers when
ddl recovery has been done. (Needed by InnoDB).
- Added handlerton call inplace_alter_table_committed, to signal engine
that ddl_log has been closed for the alter table query.
- Added new handerton flag
HTON_REQUIRES_NOTIFY_TABLEDEF_CHANGED_AFTER_COMMIT to signal when we
should call hton->notify_tabledef_changed() during
mysql_inplace_alter_table. This was required as MyRocks and InnoDB
needed the call at different times.
- Added function server_uuid_value() to be able to generate a temporary
xid when ddl recovery writes the query to the binary log. This is
needed to be able to handle crashes during ddl log recovery.
- Moved freeing of the frm definition to end of mysql_alter_table() to
remove duplicate code and have a common exit strategy.
-------
InnoDB part of atomic ALTER TABLE
(Implemented by Marko Mäkelä)
innodb_check_version(): Compare the saved dict_table_t::def_trx_id
to determine whether an ALTER TABLE operation was committed.
We must correctly recover dict_table_t::def_trx_id for this to work.
Before purge removes any trace of DB_TRX_ID from system tables, it
will make an effort to load the user table into the cache, so that
the dict_table_t::def_trx_id can be recovered.
ha_innobase::table_version(): return garbage, or the trx_id that would
be used for committing an ALTER TABLE operation.
In InnoDB, table names starting with #sql-ib will remain special:
they will be dropped on startup. This may be revisited later in
MDEV-18518 when we implement proper undo logging and rollback
for creating or dropping multiple tables in a transaction.
Table names starting with #sql will retain some special meaning:
dict_table_t::parse_name() will not consider such names for
MDL acquisition, and dict_table_rename_in_cache() will treat such
names specially when handling FOREIGN KEY constraints.
Simplify InnoDB DROP INDEX.
Prevent purge wakeup
To ensure that dict_table_t::def_trx_id will be recovered correctly
in case the server is killed before ddl_log_complete(), we will block
the purge of any history in SYS_TABLES, SYS_INDEXES, SYS_COLUMNS
between ha_innobase::commit_inplace_alter_table(commit=true)
(purge_sys.stop_SYS()) and purge_sys.resume_SYS().
The completion callback purge_sys.resume_SYS() must be between
ddl_log_complete() and MDL release.
--------
MyRocks support for atomic ALTER TABLE
(Implemented by Sergui Petrunia)
Implement these SE API functions:
- ha_rocksdb::table_version()
- hton->check_version = rocksdb_check_versionMyRocks data dictionary
now stores table version for each table.
(Absence of table version record is interpreted as table_version=0,
that is, which means no upgrade changes are needed)
- For inplace alter table of a partitioned table, call the underlying
handlerton when checking if the table is ok. This assumes that the
partition engine commits all changes at once.
- Major rewrite of ddl_log.cc and ddl_log.h
- ddl_log.cc described in the beginning how the recovery works.
- ddl_log.log has unique signature and is dynamic. It's easy to
add more information to the header and other ddl blocks while still
being able to execute old ddl entries.
- IO_SIZE for ddl blocks is now dynamic. Can be changed without affecting
recovery of old logs.
- Code is more modular and is now usable outside of partition handling.
- Renamed log file to dll_recovery.log and added option --log-ddl-recovery
to allow one to specify the path & filename.
- Added ddl_log_entry_phase[], number of phases for each DDL action,
which allowed me to greatly simply set_global_from_ddl_log_entry()
- Changed how strings are stored in log entries, which allows us to
store much more information in a log entry.
- ddl log is now always created at start and deleted on normal shutdown.
This simplices things notable.
- Added probes debug_crash_here() and debug_simulate_error() to simply
crash testing and allow crash after a given number of times a probe
is executed. See comments in debug_sync.cc and rename_table.test for
how this can be used.
- Reverting failed table and view renames is done trough the ddl log.
This ensures that the ddl log is tested also outside of recovery.
- Added helper function 'handler::needs_lower_case_filenames()'
- Extend binary log with Q_XID events. ddl log handling is using this
to check if a ddl log entry was logged to the binary log (if yes,
it will be deleted from the log during ddl_log_close_binlogged_events()
- If a DDL entry fails 3 time, disable it. This is to ensure that if
we have a crash in ddl recovery code the server will not get stuck
in a forever crash-restart-crash loop.
mysqltest.cc changes:
- --die will now replace $variables with their values
- $error will contain the error of the last failed statement
storage engine changes:
- maria_rename() was changed to be more robust against crashes during
rename.
This change removed 68 explict strlen() calls from the code.
The following renames was done to ensure we don't use the old names
when merging code from earlier releases, as using the new variables
for print function could result in crashes:
- charset->csname renamed to charset->cs_name
- charset->name renamed to charset->coll_name
Almost everything where mechanical changes except:
- Changed to use the new Protocol::store(LEX_CSTRING..) when possible
- Changed to use field->store(LEX_CSTRING*, CHARSET_INFO*) when possible
- Changed to use String->append(LEX_CSTRING&) when possible
Other things:
- There where compiler issues with ensuring that all character set names
points to the same string: gcc doesn't allow one to use integer constants
when defining global structures (constant char * pointers works fine).
To get around this, I declared defines for each character set name
length.
- Remove 'dummy_for_valgrind' overrun marker as this doesn't help much.
The element also distorts the sizes of objects a bit, which makes it
harder to calculate gain in object sizes when doing size optimizations.
- Replace usage of thd_get_current_thd() with _current_thd()
- Avoid one extra call indirection when using thd_get_current_thd(), which
is used by Sql_alloc, by replacing it with _current_thd()