During read only mode, InnoDB doesn't allow checkpoint to happen.
So InnoDB should throw the warning when InnoDB tries to
force the checkpoint when innodb_read_only = 1 or
innodb_force_recovery = 6.
As part of commit 685d958e38 (MDEV-14425)
the parameter innodb_log_write_ahead_size was removed, because it was
thought that determining the physical block size would be a sufficient
replacement.
However, we can only determine the physical block size on Linux or
Microsoft Windows. On some file systems, the physical block size
is not relevant. For example, XFS uses a block size of 4096 bytes
even if the underlying block size may be smaller.
On Linux, we failed to determine the physical block size if
innodb_log_file_buffered=OFF was not requested or possible.
This will be fixed.
log_sys.write_size: The value of the reintroduced parameter
innodb_log_write_ahead_size. To keep it simple, this is read-only
and a power of two between 512 and 4096 bytes, so that the previous
alignment guarantees are fulfilled. This will replace the previous
log_sys.get_block_size().
log_sys.block_size, log_t::get_block_size(): Remove.
log_t::set_block_size(): Ensure that write_size will not be less
than the physical block size. There is no point to invoke this
function with 512 or less, because that is the minimum value of
write_size.
innodb_params_adjust(): Add some disabled code for adjusting
the minimum value and default value of innodb_log_write_ahead_size
to reflect the log_sys.write_size.
log_t::set_recovered(): Mark the recovery completed. This is the
place to adjust some things if we want to allow write_size>4096.
log_t::resize_write_buf(): Refer to write_size.
log_t::resize_start(): Refer to write_size instead of get_block_size().
log_write_buf(): Simplify some arithmetics and remove a goto.
log_t::write_buf(): Refer to write_size. If we are writing less than
that, do not switch buffers, but keep writing to the same buffer.
Move some code to improve the locality of reference.
recv_scan_log(): Refer to write_size instead of get_block_size().
os_file_create_func(): For type==OS_LOG_FILE on Linux, always invoke
os_file_log_maybe_unbuffered(), so that log_sys.set_block_size() will
be invoked even if we are not attempting to use O_DIRECT.
recv_sys_t::find_checkpoint(): Read the entire log header
in a single 12 KiB request into log_sys.buf.
Tested with:
./mtr --loose-innodb-log-write-ahead-size=4096
./mtr --loose-innodb-log-write-ahead-size=2048
innodb_doublewrite_update(): Disallow any change if srv_read_only_mode
holds, that is, the server was started with innodb_read_only=ON or
innodb_force_recovery=6.
This fixes up commit 1122ac978e (MDEV-33545).
- Few of test case should make sure that InnoDB does hit
the debug sync point during startup of the server.
InnoDB can remove the double quotes of debug point
in restart parameters.
The fixes in b8a6719889 have not disabled
semi-consistent read for innodb_snapshot_isolation=ON mode, they just allowed
read uncommitted version of a record, that's why the test for MDEV-26643 worked
well.
The semi-consistent read should be disabled on upper level in
row_search_mvcc() for READ COMMITTED isolation level.
Reviewed by Marko Mäkelä.
- InnoDB tries to write FILE_CHECKPOINT marker during
early recovery when log file size is insufficient.
While updating the log checkpoint at the end of the recovery,
InnoDB must already have written out all pending changes
to the persistent files. To complete the checkpoint, InnoDB
has to write some log records for the checkpoint and to
update the checkpoint header. If the server gets killed
before updating the checkpoint header then it would lead
the logfile to be unrecoverable.
- This patch avoids FILE_CHECKPOINT marker during early
recovery and narrows down the window of opportunity to
make the log file unrecoverable.
- The deadlock counter was moved from
Deadlock::find_cycle into Deadlock::report, because
the find_cycle method is called multiple times during deadlock
detection flow, which means it shouldn't have such side effects.
But report() can, which called only once for
a victim transaction.
- Also the deadlock_detect.test and *.result test case
has been extended to handle the fix.
number of non-user tablespace.
fil_space_t::try_to_close(): Don't try to close
the tablespace which is acquired by the caller of
the function
Added the suppression message in open_files_limit test case
number of non-user tablespace.
- InnoDB only closes the user tablespace when the number of open
files exceeds innodb_open_files limit. In that case, InnoDB should
make sure that innodb_open_files value should be greater
than number of undo tablespace, system and temporary tablespace files.
When checkpoint age goes beyond the sync flush threshold and
buf_flush_sync_lsn is set, page cleaner enters into "furious flush"
stage to aggressively flush dirty pages from flush list and pull
checkpoint LSN above safe margin. In this stage, page cleaner skips
doing LRU flush and eviction.
In 10.6, all other threads entirely rely on page cleaner to generate
free pages. If free pages get over while page cleaner is busy in
"furious flush" stage, a session thread could wait for free page in the
middle of a min-transaction(mtr) while holding latches on other pages.
It, in turn, can prevent page cleaner to flush such pages preventing
checkpoint LSN to move forward creating a deadlock situation. Even
otherwise, it could create a stall and hang like situation for large BP
with plenty of dirty pages to flush before the stage could finish.
Fix: During furious flush, check and evict LRU pages after each flush
iteration.
- Added a counter innodb_num_bulk_insert_operation in
INFORMATION_SCHEMA.GLOBAL_STATUS. This counter is incremented
whenever a InnoDB undergoes bulk insert operation.
- Change the innodb_instant_alter_column to atomic variable.
tablespace truncation
- InnoDB fails with out of bound write error after temporary
tablespace truncation. This issue caused by
commit c507678b20 (MDEV-28699).
InnoDB fail to clear freed ranges if shrinking size
is the last offset of the freed range.
Columns added to TABLE_STATISTICS
- ROWS_INSERTED, ROWS_DELETED, ROWS_UPDATED, KEY_READ_HITS and
KEY_READ_MISSES.
Columns added to CLIENT_STATISTICS and USER_STATISTICS:
- KEY_READ_HITS and KEY_READ_MISSES.
User visible changes (except new columns):
- CLIENT_STATISTICS and USER_STATISTICS has columns KEY_READ_HITS and
KEY_READ_MISSES added after column ROWS_UPDATED before SELECT_COMMANDS.
Other changes:
- Do not collect table statistics for system tables like index_stats
table_stats, performance_schema, information_schema etc as the user
has no control of these and the generate noice in the statistics.
- All row variables that are part of user_stats are moved to
'struct rows_stats' to make it easy to clear all of them at once.
- ha_read_key_misses added to STATUS_VAR
Notes:
- userstat.result has a change of numbers of rows for handler_read_key.
This is because use-stat-tables is now disabled for the test.
This task is to ensure we have a clear definition and rules of how to
repair or optimize a table.
The rules are:
- REPAIR should be used with tables that are crashed and are
unreadable (hardware issues with not readable blocks, blocks with
'unexpected data' etc)
- OPTIMIZE table should be used to optimize the storage layout for the
table (recover space for delete rows and optimize the index
structure.
- ALTER TABLE table_name FORCE should be used to rebuild the .frm file
(the table definition) and the table (with the original table row
format). If the table is from and older MariaDB/MySQL release with a
different storage format, it will convert the data to the new
format. ALTER TABLE ... FORCE is used as part of mariadb-upgrade
Here follows some more background:
The 3 ways to repair a table are:
1) ALTER TABLE table_name FORCE" (not other options).
As an alias we allow: "ALTER TABLE table_name ENGINE=original_engine"
2) "REPAIR TABLE" (without FORCE)
3) "OPTIMIZE TABLE"
All of the above commands will optimize row space usage (which means that
space will be needed to hold a temporary copy of the table) and
re-generate all indexes. They will also try to replicate the original
table definition as exact as possible.
For ALTER TABLE and "REPAIR TABLE without FORCE", the following will hold:
If the table is from an older MariaDB version and data conversion is
needed (for example for old type HASH columns, MySQL JSON type or new
TIMESTAMP format) "ALTER TABLE table_name FORCE, algorithm=COPY" will be
used.
The differences between the algorithms are
1) Will use the fastest algorithm the engine supports to do a full repair
of the table (except if data conversions are is needed).
2) Will use the storage engine internal REPAIR facility (MyISAM, Aria).
If the engine does not support REPAIR then
"ALTER TABLE FORCE, ALGORITHM=COPY" will be used.
If there was data incompatibilities (which means that FORCE was used)
then there will be a warning after REPAIR that ALTER TABLE FORCE is
still needed.
The reason for this is that REPAIR may be able to go around data
errors (wrong incompatible data, crashed or unreadable sectors) that
ALTER TABLE cannot do.
3) Will use the storage engine internal OPTIMIZE. If engine does not
support optimize, then "ALTER TABLE FORCE" is used.
The above will ensure that ALTER TABLE FORCE is able to
correct almost any errors in the row or index data. In case of
corrupted blocks then REPAIR possible followed by ALTER TABLE is needed.
This is important as mariadb-upgrade executes ALTER TABLE table_name
FORCE for any table that must be re-created.
Bugs fixed with InnoDB tables when using ALTER TABLE FORCE:
- No error for INNODB_DEFAULT_ROW_FORMAT=COMPACT even if row length
would be too wide. (Independent of innodb_strict_mode).
- Tables using symlinks will be symlinked after any of the above commands
(Independent of the setting of --symbolic-links)
If one specifies an algorithm together with ALTER TABLE FORCE, things
will work as before (except if data conversion is required as then
the COPY algorithm is enforced).
ALTER TABLE .. OPTIMIZE ALL PARTITIONS will work as before.
Other things:
- FORCE argument added to REPAIR to allow one to first run internal
repair to fix damaged blocks and then follow it with ALTER TABLE.
- REPAIR will not update frm_version if ha_check_for_upgrade() finds
that table is still incompatible with current version. In this case the
REPAIR will end with an error.
- REPAIR for storage engines that does not have native repair, like InnoDB,
is now using ALTER TABLE FORCE.
- REPAIR csv-table USE_FRM now works.
- It did not work before as CSV tables had extension list in wrong
order.
- Default error messages length for %M increased from 128 to 256 to not
cut information from REPAIR.
- Documented HA_ADMIN_XX variables related to repair.
- Added HA_ADMIN_NEEDS_DATA_CONVERSION to signal that we have to
do data conversions when converting the table (and thus ALTER TABLE
copy algorithm is needed).
- Fixed typo in error message (caused test changes).
Remove alter_algorithm but keep the variable as no-op (with a warning).
The reasons for removing alter_algorithm are:
- alter_algorithm was introduced as a replacement for the
old_alter_table that was used to force the usage of the original
alter table algorithm (copy) in the cases where the new alter
algorithm did not work. The new option was added as a way to force
the usage of a specific algorithm when it should instead have made
it possible to disable algorithms that would not work for some
reason.
- alter_algorithm introduced some cases where ALTER TABLE would not
work without specifying the ALGORITHM=XXX option together with
ALTER TABLE.
- Having different values of alter_algorithm on master and slave could
cause slave to stop unexpectedly.
- ALTER TABLE FORCE, as used by mariadb-upgrade, would not always work
if alter_algorithm was set for the server.
- As part of the MDEV-33449 "improving repair of tables" it become
clear that alter- algorithm made it harder to provide a better and
more consistent ALTER TABLE FORCE and REPAIR TABLE and it would be
better to remove it.
Currently there are mechanism to mark a system variable as
deprecated, but they are only used to print warning messages
when a deprecated variable is set.
Leverage the existing mechanisms in order to make the
deprecation information available at the --help output of mysqld by:
* Moving the deprecation information (i.e `deprecation_substitute`
attribute) from the `sys_var` class into the `my_option` struct.
As every `sys_var` contains its own `my_option` struct, the access
to the deprecation information remains available to `sys_var`
objects. `my_getotp` functions, which works directly with
`my_option` structs, gain access to this information while building
the --help output.
* For plugin variables, leverages the `PLUGIN_VAR_DEPRECATED` flag
and set the `deprecation_substitute` attribute accordingly when
building the `my_option` objects.
* Change the `option_cmp` function to use the `deprecation_substitute`
attribute instead of the name when sorting the options. This way
deprecated options and the substitutes will be grouped together.
All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer
Amazon Web Services, Inc.
- InnoDB page compression works only on COMPACT or DYNAMIC row
format tables. So InnoDB should throw error when alter table
tries to enable PAGE_COMPRESSED for redundant table.
in buf_dblwr_t::init_or_load_pages()
- InnoDB fails to set the TRX_SYS_DOUBLEWRITE_SPACE_ID_STORED
flag in transaction system header page while recreating
the undo log tablespaces
buf_dblwr_t::init_or_load_pages(): Tries to reset the
space id and try to write into doublewrite buffer even
when read_only mode is enabled.
In srv_all_undo_tablespaces_open(), InnoDB should try to
open the extra unused undo tablespaces instead of trying to
creating it.
in buf_dblwr_t::init_or_load_pages()
- InnoDB fails to set the TRX_SYS_DOUBLEWRITE_SPACE_ID_STORED
flag in transaction system header page while recreating
the undo log tablespaces
buf_dblwr_t::init_or_load_pages(): Tries to reset the
space id and try to write into doublewrite buffer even
when read_only mode is enabled.
In srv_all_undo_tablespaces_open(), InnoDB should try to
open the extra unused undo tablespaces instead of trying to
creating it.
BUF_LRU_MIN_LEN (256) is too high value for low buffer pool(BP) size.
For example, for BP size lower than 80M and 16 K page size, the limit is
more than 5% of total BP and for lowest BP 5M, it is 80% of the BP.
Non-data objects like explicit locks could occupy part of the BP pool
reducing the pages available for LRU. If LRU reaches minimum limit and
if no free pages are available, server would hang with page cleaner not
able to free any more pages.
Fix: To avoid such hang, we adjust the LRU limit lower than the limit
for data objects as checked in buf_LRU_check_size_of_non_data_objects()
i.e. one page less than 5% of BP.
trx_free_at_shutdown(): Similar to trx_t::commit_in_memory(),
clear the detailed_error (FOREIGN KEY constraint error) before
invoking trx_t::free(). We only do this on debug instrumented
builds in order to avoid a debug assertion failure on shutdown.
This regression is introduced in 10.6 by following commit.
commit 898dcf93a8
(Cleanup the lock creation)
It removed one important optimization for lock bitmap pre-allocation.
We pre-allocate about 8 byte extra space along with every lock object to
adjust for similar locks on newly created records on the same page by
same transaction. When it is exhausted, a new lock object is created
with similar 8 byte pre-allocation. With this optimization removed we
are left with only 1 byte pre-allocation. When large number of records
are inserted and locked in a single page, we end up creating too many
new locks almost in n^2 order.
Fix-1: Bring back LOCK_PAGE_BITMAP_MARGIN for pre-allocation.
Fix-2: Use the extra space (40 bytes) for bitmap in trx->lock.rec_pool.