Parenthesis around table names and derived tables should be allowed
in FROM clauses and some other context as it was in earlier versions.
Returned test queries that used such parenthesis in 10.3 to their
original form. Adjusted test results accordingly.
Windows does atomic writes, as long as they are aligned and multiple
of sector size. this is documented in MSDN.
Fix innodb.doublewrite test to always use doublewrite buffer,
(even if atomic writes are autodetected)
InnoDB could return the same list again and again if the buffer
passed to trx_recover_for_mysql() is smaller than the number of
transactions that InnoDB recovered in XA PREPARE state.
We introduce the transaction state TRX_PREPARED_RECOVERED, which
is like TRX_PREPARED, but will be set during trx_recover_for_mysql()
so that each transaction will only be returned once.
Because init_server_components() is invoking ha_recover() twice,
we must reset the state of the transactions back to TRX_PREPARED
after returning the complete list, so that repeated traversals
will see the complete list again, instead of seeing an empty list.
Without this tweak, the test main.tc_heuristic_recover would hang
in MariaDB 10.1.
dict_create_foreign_constraints_low(): Tolerate the keywords
IGNORE and ONLINE between the keywords ALTER and TABLE.
We should really remove the hacky FOREIGN KEY constraint parser
from InnoDB.
DROP DATABASE would internally execute DROP TABLE on every contained
table and finally remove the directory. In InnoDB, DROP TABLE is
sometimes executed in the background. The table would be renamed to
a name that starts with #sql. The existence of these files would
prevent DROP DATABASE from succeeding.
CREATE OR REPLACE DATABASE can internally execute DROP DATABASE if
the directory already exists. This could fail due to the InnoDB
background DROP TABLE, possibly due to some tables that were
leftovers from earlier tests.
InnoDB crash recovery used to read every data page for which
redo log exists. This is unnecessary for those pages that are
initialized by the redo log. If a newly created page is corrupted,
recovery could unnecessarily fail. It would suffice to reinitialize
the page based on the redo log records.
To add insult to injury, InnoDB crash recovery could hang if it
encountered a corrupted page. We will fix also that problem.
InnoDB would normally refuse to start up if it encounters a
corrupted page on recovery, but that can be overridden by
setting innodb_force_recovery=1.
Data pages are completely initialized by the records
MLOG_INIT_FILE_PAGE2 and MLOG_ZIP_PAGE_COMPRESS.
MariaDB 10.4 additionally recognizes MLOG_INIT_FREE_PAGE,
which notifies that a page has been freed and its contents
can be discarded (filled with zeroes).
The record MLOG_INDEX_LOAD notifies that redo logging has
been re-enabled after being disabled. We can avoid loading
the page if all buffered redo log records predate the
MLOG_INDEX_LOAD record.
For the internal tables of FULLTEXT INDEX, no MLOG_INDEX_LOAD
records were written before commit aa3f7a107c.
Hence, we will skip these optimizations for tables whose
name starts with FTS_.
This is joint work with Thirunarayanan Balathandayuthapani.
fil_space_t::enable_lsn, file_name_t::enable_lsn: The LSN of the
latest recovered MLOG_INDEX_LOAD record for a tablespace.
mlog_init: Page initialization operations discovered during
redo log scanning. FIXME: This really belongs in recv_sys->addr_hash,
and should be removed in MDEV-19176.
recv_addr_state: Add the new state RECV_WILL_NOT_READ to
indicate that according to mlog_init, the page will be
initialized based on redo log record contents.
recv_add_to_hash_table(): Set the RECV_WILL_NOT_READ state
if appropriate. For now, we do not treat MLOG_ZIP_PAGE_COMPRESS
as page initialization. This works around bugs in the crash
recovery of ROW_FORMAT=COMPRESSED tables.
recv_mark_log_index_load(): Process a MLOG_INDEX_LOAD record
by resetting the state to RECV_NOT_PROCESSED and by updating
the fil_name_t::enable_lsn.
recv_init_crash_recovery_spaces(): Copy fil_name_t::enable_lsn
to fil_space_t::enable_lsn.
recv_recover_page(): Add the parameter init_lsn, to ignore
any log records that precede the page initialization.
Add DBUG output about skipped operations.
buf_page_create(): Initialize FIL_PAGE_LSN, so that
recv_recover_page() will not wrongly skip applying
the page-initialization record due to the field containing
some newer LSN as a leftover from a different page.
Do not invoke ibuf_merge_or_delete_for_page() during
crash recovery.
recv_apply_hashed_log_recs(): Remove some unnecessary lookups.
Note if a corrupted page was found during recovery.
After invoking buf_page_create(), do invoke
ibuf_merge_or_delete_for_page() via mlog_init.ibuf_merge()
in the last recovery batch.
ibuf_merge_or_delete_for_page(): Relax a debug assertion.
innobase_start_or_create_for_mysql(): Abort startup if
a corrupted page was found during recovery. Corrupted pages
will not be flagged if innodb_force_recovery is set.
However, the recv_sys->found_corrupt_fs flag can be set
regardless of innodb_force_recovery if file names are found
to be incorrect (for example, multiple files with the same
tablespace ID).
Similar to what was done in commit aa3f7a107c
for FULLTEXT INDEX, we must ensure that MLOG_INDEX_LOAD records will always
be written if redo logging was disabled.
row_merge_build_indexes(): Invoke row_merge_write_redo() also when
online operation is not being executed or an error occurs.
In case of an error, invoke flush_observer->interrupted() so that
the pages will not be flushed but merely evicted from the buffer pool.
Before resuming redo logging, it is crucial for the correctness of
mariabackup and InnoDB crash recovery to flush or evict all affected pages
and to write MLOG_INDEX_LOAD records.
With the MDEV-15562 instant DROP COLUMN, clustered index records
will contain traces of dropped columns, as follows:
In ROW_FORMAT=REDUNDANT, dropped columns will be stored as 0 bytes,
but they will consume 1 or 2 bytes per column in the record header.
In ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC, dropped columns will
be stored as NULL if allowed. This will consume 1 bit per nullable
column.
In ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC, dropped NOT NULL columns
will be stored as 0 bytes if allowed. This will consume 1 byte per
NOT NULL variable-length column. Fixed-length columns will be stored
using the fixed number of bytes.
The metadata record will be 20 bytes larger than user records, because
it will contain a metadata BLOB pointer.
We must refuse ALGORITHM=INSTANT (and require a table rebuild) if
the metadata record would grow too big to fit in the index page.
If SQL_MODE includes STRICT_TRANS_TABLES or STRICT_ALL_TABLES, we should
refuse ALGORITHM=INSTANT if the maximum length of user records would
exceed the maximum size of an index page, similar to what
row_create_index_for_mysql() does during CREATE TABLE. This limit
would kick in when the default values for any instantly added columns
in the metadata record are NULL or short, but the allowed maximum values
are long.
instant_alter_column_possible(): Add the parameter "bool strict" to
enable checks for the user record size, and always check the metadata
record size.
This is a follow-up to MDEV-18733. As part of that fix, we made
dict_check_sys_tables() skip tables that would be dropped by
row_mysql_drop_garbage_tables().
DICT_ERR_IGNORE_DROP: A new mode where the file should not be attempted
to be opened.
dict_load_tablespace(): Do not try to load the tablespace if
DICT_ERR_IGNORE_DROP has been specified.
row_mysql_drop_garbage_tables(): Pass the DICT_ERR_IGNORE_DROP mode.
fil_space_for_table_exists_in_mem(): Remove a parameter.
The only caller that passed print_error_if_does_not_exist=true
was row_drop_single_table_tablespace().
If InnoDB crash recovery was needed, the InnoDB function srv_start()
would invoke extra validation, reading something from every InnoDB
data file. This should be unnecessary now that MDEV-14717 made
RENAME operations crash-safe inside InnoDB (which can be
disabled in MariaDB 10.2 by setting innodb_safe_truncate=OFF).
dict_check_sys_tables(): Skip tables that would be dropped by
row_mysql_drop_garbage_tables(). Perform extra validation only
if innodb_safe_truncate=OFF, innodb_force_recovery=0 and
crash recovery was needed.
dict_load_table_one(): Validate the root page of the table.
In this way, we can deny access to corrupted or mismatching tables
not only after crash recovery, but also after a clean shutdown.
Just rename index in data dictionary and in InnoDB cache when it's possible.
Introduce ALTER_INDEX_RENAME for that purpose so that engines can optimize
such operation.
Unused code between macro MYSQL_RENAME_INDEX was removed.
compare_keys_but_name(): compare index definitions except for index names
Alter_inplace_info::rename_keys:
ha_innobase_inplace_ctx::rename_keys: vector of rename indexes
fill_alter_inplace_info():: fills Alter_inplace_info::rename_keys
instant_alter_column_possible(): Do not support instantaneous removal
of NOT NULL if the table needs to be rebuilt for removing the hidden
FTS_DOC_ID column. This is not ideal and should ultimately be fixed
properly in MDEV-17459.
This basically is a duplicate of MDEV-18219, proving that the
assertion was not relaxed correctly.
dict_index_t::in_instant_init: A debug flag that will only be set in
btr_cur_instant_init_low() in order to suppress the assertion failure
in rec_init_offsets() for that code path only.
For tablespaces that do not reside on spinning storage, it does
not make sense to attempt to write nearby pages when writing out
dirty pages from the InnoDB buffer pool. It is actually detrimental
to performance and to the life span of flash ROM storage.
With this change, MariaDB will detect whether an InnoDB file resides
on solid-state storage. The detection has been implemented for Linux
and Microsoft Windows. For other systems, we will err on the safe side
and assume that files reside on SSD.
As part of this change, we will reduce the number of fstat() calls
when opening data files on POSIX systems and slightly clean up some
file I/O code.
FIXME: os_is_sparse_file_supported() on POSIX works in a destructive
manner. Thus, we can only invoke it when creating files, not when
opening them.
For diagnostics, we introduce the column ON_SSD to the table
INFORMATION_SCHEMA.INNODB_TABLESPACES_SCRUBBING. The table
INNODB_SYS_TABLESPACES might seem more appropriate, but its purpose
is to reflect the contents of the InnoDB system table SYS_TABLESPACES,
which we would like to remove at some point.
On Microsoft Windows, querying StorageDeviceSeekPenaltyProperty
sometimes returns ERROR_GEN_FAILURE instead of ERROR_INVALID_FUNCTION
or ERROR_NOT_SUPPORTED. We will silently ignore also this error,
and assume that the file does not reside on SSD.
On Linux, the detection will be based on the files
/sys/block/*/queue/rotational and /sys/block/*/dev.
Especially for USB storage, it is possible that
/sys/block/*/queue/rotational will wrongly report 1 instead of 0.
fil_node_t::on_ssd: Whether the InnoDB data file resides on
solid-state storage.
fil_system_t::ssd: Collection of Linux block devices that reside on
non-rotational storage.
fil_system_t::create(): Detect ssd on Linux based on the contents
of /sys/block/*/queue/rotational and /sys/block/*/dev.
fil_system_t::is_ssd(dev_t): Determine if a Linux block device is
non-rotational. Partitions will be identified with the containing
block device by assuming that the least significant 4 bits of the
minor number identify a partition, and that the "partition number"
of the entire device is 0.
The test case for reproducing MDEV-14126 demonstrates that InnoDB can
end up with an index tree where a non-leaf page has only one child page.
The test case innodb.innodb_bug14676111 demonstrates that such pages
are sometimes unavoidable, because InnoDB does not implement any sort
of B-tree rotation.
But, there is no reason to allow a root page with only one child page.
btr_cur_node_ptr_delete(): Replaces btr_node_ptr_delete().
btr_page_get_father(): Declare globally.
btr_discard_only_page_on_level(): Declare with ATTRIBUTE_COLD.
It turns out that this function is not covered by the
innodb.innodb_bug14676111 test case after all.
btr_discard_page(): If the root page ends up having only one child
page, shrink the tree by invoking btr_lift_page_up().
The test innodb.recovery_shutdown would occasionally fail,
because recovered incomplete transactions would be conflicting
with DROP TABLE, causing the background drop table queue to be invoked.
Add a slow shutdown before dropping the tables, so that the
recovered transactions will be rolled back. Starting with MDEV-14705,
normal shutdown would abort the rollback of recovered transactions.
The MDEV-17262 commit 26432e49d3
was skipped. In Galera 4, the implementation would seem to require
changes to the streaming replication.
In the tests archive.rnd_pos main.profiling, disable_ps_protocol
for SHOW STATUS and SHOW PROFILE commands until MDEV-18974
has been fixed.
Rather than add a small extra amount on the size of chunks, keep it
of the specified size. The rest of the chunk initialization code
adapts to this small size reduction. This has been made in the general
case, not just large pages, to keep it simple.
The chunks size is controlled by innodb-buffer-pool-chunk-size. In the
code increasing this by a descriptor table size length makes it
difficult with large pages. With innodb-buffer-pool-chunk-size set to 2M
the code before this commit would of added a small amount extra to this
value when it tried to allocate this. While not normally a problem it is
with large pages, it now requires addition space, a whole extra large
page. With a number of pools, or with 1G or 16G large pages this is
quite significant.
By removing this additional amount, DBAs can set
innodb-buffer-pool-chunk size to the large page size, or a multiple of
it, and actually get that amount allocated. Previously they had to fudge
a value less.
The innodb.test results show how this is fudged over a number of tests. With
this change the values are just between 488 and 500 depending on architecture
and build options.
Tested with --large-pages --innodb-buffer-pool-size=256M
--innodb-buffer-pool-chunk-size=2M on x86_64 with 2M default large page
size. Breaking before buf_pool init, one large page was allocated in
MyISAM, by the end of the function 128 huge pages where allocated as
expected. A further 16 pages where allocated for a 32M log buffer and
during startup 1 page was allocated briefly to the redo log.