mariadb

database/mariadb

Fork 0

mirror of https://github.com/MariaDB/server.git synced 2025-08-08 11:22:35 +03:00

Commit Graph

Author SHA1 Message Date

Author	SHA1	Message	Date
Marko Mäkelä	1122ac978e	MDEV-33545: Improve innodb_doublewrite to cover NO_FSYNC In commit `24648768b4` (MDEV-30136) the parameter innodb_flush_method was deprecated, with no direct replacement for innodb_flush_method=O_DIRECT_NO_FSYNC. Let us change innodb_doublewrite from Boolean to ENUM that can be changed while the server is running: OFF: Assume that writes of innodb_page_size are atomic ON: Prevent torn writes (the default) fast: Like ON, but avoid synchronizing writes to data files The deprecated start-up parameter innodb_flush_method=NO_FSYNC will cause innodb_doublewrite=ON to be changed to innodb_doublewrite=fast, which will prevent InnoDB from making any durable writes to data files. This would normally be done right before the log checkpoint LSN is updated. Depending on the file systems being used and their configuration, this may or may not be safe. The value innodb_doublewrite=fast differs from the previous combination of innodb_doublewrite=ON and innodb_flush_method=O_DIRECT_NO_FSYNC by always invoking os_file_flush() on the doublewrite buffer itself in buf_dblwr_t::flush_buffered_writes_completed(). This should be safer when there are multiple doublewrite batches between checkpoints. Typically, once per second, buf_flush_page_cleaner() would write out up to innodb_io_capacity pages and advance the log checkpoint. Also typically, innodb_io_capacity>128, which is the size of the doublewrite buffer in pages. Should os_file_flush_func() not be invoked between doublewrite batches, writes could be reordered in an unsafe way. The setting innodb_doublewrite=fast could be safe when the doublewrite buffer (the first file of the system tablespace) and the data files reside in the same file system. This was tested by running "./mtr --rr innodb.alter_kill". On the first server startup, with innodb_doublewrite=fast, os_file_flush_func() would only be invoked on the ibdata1 file and possibly ib_logfile0. On subsequent startups with innodb_doublewrite=OFF, os_file_flush_func() will be invoked on the individual data files during log_checkpoint(). Note: The setting debug_no_sync (in the code, my_disable_sync) would disable all durable writes to InnoDB files, which would be much less safe. IORequest::Type: Introduce special values WRITE_DBL and PUNCH_DBL for asynchronous writes that are submitted via the doublewrite buffer. In this way, fil_space_t::use_doublewrite() or buf_dblwr.in_use() will only be consulted during buf_page_t::flush() and the doublewrite buffer can be enabled or disabled without any fear of inconsistency. buf_dblwr_t::block_size: Replaces block_size(). buf_dblwr_t::flush_buffered_writes(): If !in_use() and the doublewrite buffer is empty, just invoke fil_flush_file_spaces() and return. The doublewrite buffer could have been disabled while a batch was in progress. innodb_init_params(): If innodb_flush_method=O_DIRECT_NO_FSYNC, set innodb_doublewrite=fast or innodb_doublewrite=fearless. Thanks to Mark Callaghan for reporting this, and Vladislav Vaintroub for feedback.	2024-04-04 08:12:54 +03:00
Marko Mäkelä	ac2410f6d8	Bug#19330255 WL#7142 - CRASH DURING ALTER TABLE LEADS TO DATA DICTIONARY INCONSISTENCY The server crashes on a SELECT because of space id mismatch. The mismatch happens if the server crashes during an ALTER TABLE. There are actually two cases of inconsistency, and three fixes needed for the InnoDB problems. We have dictionary data (tablespace or table name) in 3 places: (a) The .frm file is for the old table definition. (b) The InnoDB data dictionary is for the new table definition. (c) The file system did not rename the tablespace files yet. In this fix, we will not care if the .frm file is in sync with the InnoDB data dictionary and file system. We will concentrate on the mismatch between (b) and (c). Two scenarios have been mentioned in this bug report. The simpler one first: 1. The changes to SYS_TABLES were committed, and MLOG_FILE_RENAME2 records were written in a single mini-transaction commit. The files were not yet renamed in the file system. 2a. The server is killed, without making a log checkpoint. 3a. The server refuses to start up, because replaying MLOG_FILE_RENAME2 fails. I failed to repeat this myself. I repeated step 3a with a saved dataset. The problem seems to be that MLOG_FILE_RENAME2 replay is incorrectly being skipped when there is no page-redo log or MLOG_FILE_NAME record for the old name of the tablespace. FIX#1: Recover the id-to-name mapping also from MLOG_FILE_RENAME2 records when scanning the redo log. It is not necessary to write MLOG_FILE_NAME records in addition to MLOG_FILE_RENAME2 records for renaming tablespace files. The scenario in the original Description involves a log checkpoint: 1. The changes to SYS_TABLES were committed, and MLOG_FILE_RENAME2 records were written in a single mini-transaction commit. 2. A log checkpoint and a server kill was injected. 3. Crash recovery will see no records (other than the MLOG_CHECKPOINT). 4. dict_check_tablespaces_and_store_max_id() will emit a message about a non-found table #sql-ib22. 5. A mismatch is triggering the assertion failure. In my test, at step 4 the SYS_TABLES root page (0:8) contains these 3 records right before the page supremum: delete-marked (committed) name=#sql-ib21* record, with space=10. * name=#sql-ib22, space=9. name=t1, space=10. space=10 is the rebuilt table (#sql-ib21.ibd in the file system). space=9 is the old table (t1.ibd in the file system). The function dict_check_tablespaces_and_store_max_id() will enter t1.ibd with space_id=10 into the fil_system cache without noticing that t1.ibd contains space_id=9, because it invokes fil_open_single_table_tablespace() with validate=false. In MySQL 5.6, the space_id from all .ibd files are being read when the redo log checkpoint LSN disagrees with the FIL_PAGE_FILE_FLUSH_LSN in the system tablespace. This field is only updated during a clean shutdown, after performing the final log checkpoint. FIX#2: dict_check_tablespaces_and_store_max_id() should pass validate=true to fil_open_single_table_tablespace() when a non-clean shutdown is detected, forcing the first page of each .ibd file to be read. (We do not want to slow down startup after a normal shutdown.) With FIX#2, the SELECT would fail to find the table. This would introduce a regression, because before WL#7142, a copy of the table was accessible after recovery. FIX#3: Maintain a list of MLOG_FILE_RENAME2 records that have been written to the redo log, but not performed yet in the file system. When performing a checkpoint, re-emit these records to the redo log. In this way, a mismatch between (b) and (c) should be impossible. fil_name_process(): Refactored from fil_name_parse(). Adds an item to the id-to-filename mapping. fil_name_parse(): Parses and applies a MLOG_FILE_NAME, MLOG_FILE_DELETE or MLOG_FILE_RENAME2 record. This implements FIX#1. fil_name_write_rename(): A wrapper function for writing MLOG_FILE_RENAME2 records. fil_op_replay_rename(): Apply MLOG_FILE_RENAME2 records. Replaces fil_op_log_parse_or_replay(), whose logic was moved to fil_name_parse(). fil_tablespace_exists_in_mem(): Return fil_space_t instead of bool. dict_check_tablespaces_and_store_max_id(): Add the parameter "validate" to implement FIX#2. log_sys->append_on_checkpoint: Extra log records to append in case of a checkpoint. Needed for FIX#3. log_append_on_checkpoint(): New function, to update log_sys->append_on_checkpoint. mtr_write_log(): New function, to append mtr_buf_t to the redo log. fil_names_clear(): Append the data from log_sys->append_on_checkpoint if needed. ha_innobase::commit_inplace_alter_table(): Add any MLOG_FILE_RENAME2 records to log_sys->append_on_checkpoint(), and remove them once the files have been renamed in the file system. mtr_buf_copy_t: A helper functor for copying a mini-transaction log. rb#6282 approved by Jimmy Yang	2018-05-16 15:03:09 +05:30

Marko Mäkelä

1122ac978e

MDEV-33545: Improve innodb_doublewrite to cover NO_FSYNC

In commit 24648768b4 (MDEV-30136)
the parameter innodb_flush_method was deprecated, with no direct
replacement for innodb_flush_method=O_DIRECT_NO_FSYNC.

Let us change innodb_doublewrite from Boolean to ENUM that can
be changed while the server is running:

OFF: Assume that writes of innodb_page_size are atomic
ON: Prevent torn writes (the default)
fast: Like ON, but avoid synchronizing writes to data files

The deprecated start-up parameter innodb_flush_method=NO_FSYNC will cause
innodb_doublewrite=ON to be changed to innodb_doublewrite=fast,
which will prevent InnoDB from making any durable writes to data files.
This would normally be done right before the log checkpoint LSN is updated.
Depending on the file systems being used and their configuration,
this may or may not be safe.

The value innodb_doublewrite=fast differs from the previous combination of
innodb_doublewrite=ON and innodb_flush_method=O_DIRECT_NO_FSYNC by always
invoking os_file_flush() on the doublewrite buffer itself
in buf_dblwr_t::flush_buffered_writes_completed(). This should be safer
when there are multiple doublewrite batches between checkpoints.
Typically, once per second, buf_flush_page_cleaner() would write out
up to innodb_io_capacity pages and advance the log checkpoint.
Also typically, innodb_io_capacity>128, which is the size of the
doublewrite buffer in pages. Should os_file_flush_func() not be invoked
between doublewrite batches, writes could be reordered in an unsafe way.

The setting innodb_doublewrite=fast could be safe when the doublewrite
buffer (the first file of the system tablespace) and the data files
reside in the same file system.

This was tested by running "./mtr --rr innodb.alter_kill". On the first
server startup, with innodb_doublewrite=fast, os_file_flush_func()
would only be invoked on the ibdata1 file and possibly ib_logfile0.
On subsequent startups with innodb_doublewrite=OFF, os_file_flush_func()
will be invoked on the individual data files during log_checkpoint().

Note: The setting debug_no_sync (in the code, my_disable_sync) would
disable all durable writes to InnoDB files, which would be much less safe.

IORequest::Type: Introduce special values WRITE_DBL and PUNCH_DBL
for asynchronous writes that are submitted via the doublewrite buffer.
In this way, fil_space_t::use_doublewrite() or buf_dblwr.in_use()
will only be consulted during buf_page_t::flush() and the doublewrite
buffer can be enabled or disabled without any fear of inconsistency.

buf_dblwr_t::block_size: Replaces block_size().

buf_dblwr_t::flush_buffered_writes(): If !in_use() and the doublewrite
buffer is empty, just invoke fil_flush_file_spaces() and return. The
doublewrite buffer could have been disabled while a batch was in
progress.

innodb_init_params(): If innodb_flush_method=O_DIRECT_NO_FSYNC,
set innodb_doublewrite=fast or innodb_doublewrite=fearless.

Thanks to Mark Callaghan for reporting this, and Vladislav Vaintroub
for feedback.

2024-04-04 08:12:54 +03:00

Marko Mäkelä

ac2410f6d8

Bug#19330255 WL#7142 - CRASH DURING ALTER TABLE LEADS TO DATA DICTIONARY INCONSISTENCY

The server crashes on a SELECT because of space id mismatch. The
mismatch happens if the server crashes during an ALTER TABLE.

There are actually two cases of inconsistency, and three fixes needed
for the InnoDB problems.

We have dictionary data (tablespace or table name) in 3 places:

(a) The *.frm file is for the old table definition.
(b) The InnoDB data dictionary is for the new table definition.
(c) The file system did not rename the tablespace files yet.

In this fix, we will not care if the *.frm file is in sync with the
InnoDB data dictionary and file system. We will concentrate on the
mismatch between (b) and (c).

Two scenarios have been mentioned in this bug report. The simpler one
first:

1. The changes to SYS_TABLES were committed, and MLOG_FILE_RENAME2
records were written in a single mini-transaction commit.
The files were not yet renamed in the file system.
2a. The server is killed, without making a log checkpoint.
3a. The server refuses to start up, because replaying MLOG_FILE_RENAME2
fails.

I failed to repeat this myself. I repeated step 3a with a saved
dataset. The problem seems to be that MLOG_FILE_RENAME2 replay is
incorrectly being skipped when there is no page-redo log or
MLOG_FILE_NAME record for the old name of the tablespace.

FIX#1: Recover the id-to-name mapping also from MLOG_FILE_RENAME2
records when scanning the redo log. It is not necessary to write
MLOG_FILE_NAME records in addition to MLOG_FILE_RENAME2 records for
renaming tablespace files.

The scenario in the original Description involves a log checkpoint:
1. The changes to SYS_TABLES were committed, and MLOG_FILE_RENAME2
records were written in a single mini-transaction commit.
2. A log checkpoint and a server kill was injected.
3. Crash recovery will see no records (other than the MLOG_CHECKPOINT).
4. dict_check_tablespaces_and_store_max_id() will emit a message about
a non-found table #sql-ib22*.
5. A mismatch is triggering the assertion failure.

In my test, at step 4 the SYS_TABLES root page (0:8) contains these 3
records right before the page supremum:
* delete-marked (committed) name=#sql-ib21* record, with space=10.
* name=#sql-ib22*, space=9.
* name=t1, space=10.
space=10 is the rebuilt table (#sql-ib21*.ibd in the file system).
space=9 is the old table (t1.ibd in the file system).

The function dict_check_tablespaces_and_store_max_id() will enter
t1.ibd with space_id=10 into the fil_system cache without noticing
that t1.ibd contains space_id=9, because it invokes
fil_open_single_table_tablespace() with validate=false.

In MySQL 5.6, the space_id from all *.ibd files are being read when
the redo log checkpoint LSN disagrees with the FIL_PAGE_FILE_FLUSH_LSN
in the system tablespace. This field is only updated during a clean
shutdown, after performing the final log checkpoint.

FIX#2: dict_check_tablespaces_and_store_max_id() should pass
validate=true to fil_open_single_table_tablespace() when a non-clean
shutdown is detected, forcing the first page of each *.ibd file to be
read. (We do not want to slow down startup after a normal shutdown.)

With FIX#2, the SELECT would fail to find the table. This would
introduce a regression, because before WL#7142, a copy of the table
was accessible after recovery.

FIX#3: Maintain a list of MLOG_FILE_RENAME2 records that have been
written to the redo log, but not performed yet in the file system.
When performing a checkpoint, re-emit these records to the redo
log. In this way, a mismatch between (b) and (c) should be impossible.

fil_name_process(): Refactored from fil_name_parse(). Adds an item to
the id-to-filename mapping.

fil_name_parse(): Parses and applies a MLOG_FILE_NAME,
MLOG_FILE_DELETE or MLOG_FILE_RENAME2 record. This implements FIX#1.

fil_name_write_rename(): A wrapper function for writing
MLOG_FILE_RENAME2 records.

fil_op_replay_rename(): Apply MLOG_FILE_RENAME2 records. Replaces
fil_op_log_parse_or_replay(), whose logic was moved to fil_name_parse().

fil_tablespace_exists_in_mem(): Return fil_space_t* instead of bool.

dict_check_tablespaces_and_store_max_id(): Add the parameter
"validate" to implement FIX#2.

log_sys->append_on_checkpoint: Extra log records to append in case of
a checkpoint. Needed for FIX#3.

log_append_on_checkpoint(): New function, to update
log_sys->append_on_checkpoint.

mtr_write_log(): New function, to append mtr_buf_t to the redo log.

fil_names_clear(): Append the data from log_sys->append_on_checkpoint
if needed.

ha_innobase::commit_inplace_alter_table(): Add any MLOG_FILE_RENAME2
records to log_sys->append_on_checkpoint(), and remove them once the
files have been renamed in the file system.

mtr_buf_copy_t: A helper functor for copying a mini-transaction log.

rb#6282 approved by Jimmy Yang

2018-05-16 15:03:09 +05:30

2 Commits