1
0
mirror of https://github.com/MariaDB/server.git synced 2025-05-13 01:01:44 +03:00

44 Commits

Author SHA1 Message Date
Marko Mäkelä
f2b4972bd4 Merge 10.11 into 11.0 2023-07-26 15:13:06 +03:00
Marko Mäkelä
c358e216d9 MDEV-31642: Upgrade may crash if innodb_log_file_buffering=OFF
recv_log_recover_10_5(): Make reads aligned by 4096 bytes, to avoid
any trouble in case the file was opened in O_DIRECT mode and
the physical block size is larger than 512 bytes.
Because innodb_log_file_size used to be defined in whole megabytes,
reading multiples of 4096 bytes instead of 512 should not be an issue.
2023-07-10 11:14:54 +03:00
Marko Mäkelä
f27e9c8947 MDEV-29694 Remove the InnoDB change buffer
The purpose of the change buffer was to reduce random disk access,
which could be useful on rotational storage, but maybe less so on
solid-state storage.
When we wished to
(1) insert a record into a non-unique secondary index,
(2) delete-mark a secondary index record,
(3) delete a secondary index record as part of purge (but not ROLLBACK),
and the B-tree leaf page where the record belongs to is not in the buffer
pool, we inserted a record into the change buffer B-tree, indexed by
the page identifier. When the page was eventually read into the buffer
pool, we looked up the change buffer B-tree for any modifications to the
page, applied these upon the completion of the read operation. This
was called the insert buffer merge.

We remove the change buffer, because it has been the source of
various hard-to-reproduce corruption bugs, including those fixed in
commit 5b9ee8d8193a8c7a8ebdd35eedcadc3ae78e7fc1 and
commit 165564d3c33ae3d677d70644a83afcb744bdbf65 but not limited to them.

A downgrade will fail with a clear message starting with
commit db14eb16f9977453467ec4765f481bb2f71814ba (MDEV-30106).

buf_page_t::state: Merge IBUF_EXIST to UNFIXED and
WRITE_FIX_IBUF to WRITE_FIX.

buf_pool_t::watch[]: Remove.

trx_t: Move isolation_level, check_foreigns, check_unique_secondary,
bulk_insert into the same bit-field. The only purpose of
trx_t::check_unique_secondary is to enable bulk insert into an
empty table. It no longer enables insert buffering for UNIQUE INDEX.

btr_cur_t::thr: Remove. This field was originally needed for change
buffering. Later, its use was extended to cover SPATIAL INDEX.
Much of the time, rtr_info::thr holds this field. When it does not,
we will add parameters to SPATIAL INDEX specific functions.

ibuf_upgrade_needed(): Check if the change buffer needs to be updated.

ibuf_upgrade(): Merge and upgrade the change buffer after all redo log
has been applied. Free any pages consumed by the change buffer, and
zero out the change buffer root page to mark the upgrade completed,
and to prevent a downgrade to an earlier version.

dict_load_tablespaces(): Renamed from
dict_check_tablespaces_and_store_max_id(). This needs to be invoked
before ibuf_upgrade().

btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics.
The change buffer merge does not need this function anymore.

btr_page_alloc(): Renamed from btr_page_alloc_low(). We no longer
allocate any change buffer pages.

btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics.
The change buffer merge does not need this function anymore.

row_search_index_entry(), btr_lift_page_up(): Add a parameter thr
for the SPATIAL INDEX case.

rtr_page_split_and_insert(): Specialized from btr_page_split_and_insert().

rtr_root_raise_and_insert(): Specialized from btr_root_raise_and_insert().

Note: The support for upgrading from the MySQL 3.23 or MySQL 4.0
change buffer format that predates the MySQL 4.1 introduction of
the option innodb_file_per_table was removed in MySQL 5.6.5
as part of mysql/mysql-server@69b6241a79
and MariaDB 10.0.11 as part of 1d0f70c2f894b27e98773a282871d32802f67964.

In the tests innodb.log_upgrade and innodb.log_corruption, we create
valid (upgraded) change buffer pages.

Tested by: Matthias Leich
2023-01-11 17:59:36 +02:00
Marko Mäkelä
0751bfbcaf Merge 10.7 into 10.8 2022-11-30 12:12:07 +02:00
Marko Mäkelä
b7ae4d442a Merge 10.6 into 10.7 2022-11-30 12:09:01 +02:00
Marko Mäkelä
c59985fcf5 Merge 10.5 into 10.6 2022-11-30 07:06:41 +02:00
Marko Mäkelä
846112ce36 MDEV-24412: Create a separate test
Some builders in our CI, most notably FreeBSD and IBM AIX, do not support
sparse files. Also, Microsoft Windows requires special means for creating
sparse files. Since these platforms do not run ./mtr --big-test, we will
for now simply move the test to a separate file that requires that option.
2022-11-30 06:57:32 +02:00
Marko Mäkelä
6f854d7cfe Merge 10.7 into 10.8 2022-11-28 13:11:43 +02:00
Marko Mäkelä
f124d71ab7 Merge 10.6 into 10.7 2022-11-28 12:22:12 +02:00
Marko Mäkelä
fdc582fd98 Merge 10.5 into 10.6 2022-11-28 12:20:17 +02:00
Marko Mäkelä
bd694bb7b2 MDEV-24412 InnoDB: Upgrade after a crash is not supported
recv_log_recover_10_4(): Widen the operand of bitwise and to 64 bits,
so that the upgrade check will work when the redo log record is located
more than 4 gigabytes from the start of the first file.
2022-11-28 11:56:09 +02:00
Marko Mäkelä
db14eb16f9 MDEV-30106 InnoDB fails to validate the change buffer on startup
ibuf_init_at_db_start(): Validate the change buffer root page.
A later version may stop creating a change buffer, and this
validation check will prevent a downgrade from such later versions.

ibuf_max_size_update(): If the change buffer was not loaded, do nothing.

dict_boot(): Merge the local variable "error" to "err". Ignore
failures of ibuf_init_at_db_start() if innodb_force_recovery>=4.
2022-11-28 11:34:22 +02:00
Marko Mäkelä
618d820646 Merge 10.7 into 10.8 2022-10-13 10:42:41 +03:00
Marko Mäkelä
588efca237 Merge 10.6 into 10.7 2022-10-13 10:05:29 +03:00
Marko Mäkelä
6dc157f8a6 Merge 10.5 into 10.6 2022-10-06 09:22:39 +03:00
Marko Mäkelä
de078e060e Merge 10.4 into 10.5 2022-10-06 08:29:56 +03:00
Marko Mäkelä
f600690c6b MDEV-29710: Skip some more tests on Valgrind 2022-10-05 20:37:54 +03:00
Marko Mäkelä
a635c40648 MDEV-27774 Reduce scalability bottlenecks in mtr_t::commit()
A prominent bottleneck in mtr_t::commit() is log_sys.mutex between
log_sys.append_prepare() and log_close().

User-visible change: The minimum innodb_log_file_size will be
increased from 1MiB to 4MiB so that some conditions can be
trivially satisfied.

log_sys.latch (log_latch): Replaces log_sys.mutex and
log_sys.flush_order_mutex. Copying mtr_t::m_log to
log_sys.buf is protected by a shared log_sys.latch.
Writes from log_sys.buf to the file system will be protected
by an exclusive log_sys.latch.

log_sys.lsn_lock: Protects the allocation of log buffer
in log_sys.append_prepare().

sspin_lock: A simple spin lock, for log_sys.lsn_lock.

Thanks to Vladislav Vaintroub for suggesting this idea, and for
reviewing these changes.

mariadb-backup: Replace some use of log_sys.mutex with recv_sys.mutex.

buf_pool_t::insert_into_flush_list(): Implement sorting of flush_list
because ordering is otherwise no longer guaranteed. Ordering by LSN
is needed for the proper operation of redo log checkpoints.

log_sys.append_prepare(): Advance log_sys.lsn and log_sys.buf_free by
the length, and return the old values. Also increment write_to_buf,
which was previously done in log_close().

mtr_t::finish_write(): Obtain the buffer pointer from
log_sys.append_prepare().

log_sys.buf_free: Make the field Atomic_relaxed,
to simplify log_flush_margin(). Use only loads and stores
to avoid costly read-modify-write atomic operations.

buf_pool.flush_list_requests: Replaces
export_vars.innodb_buffer_pool_write_requests
and srv_stats.buf_pool_write_requests.
Protected by buf_pool.flush_list_mutex.

buf_pool_t::insert_into_flush_list(): Do not invoke page_cleaner_wakeup().
Let the caller do that after a batch of calls.

recv_recover_page(): Invoke a minimal part of
buf_pool.insert_into_flush_list().

ReleaseBlocks::modified: A number of pages added to buf_pool.flush_list.

ReleaseBlocks::operator(): Merge buf_flush_note_modification() here.

log_t::set_capacity(): Renamed from log_set_capacity().
2022-02-10 16:37:12 +02:00
Marko Mäkelä
685d958e38 MDEV-14425 Improve the redo log for concurrency
The InnoDB redo log used to be formatted in blocks of 512 bytes.
The log blocks were encrypted and the checksum was calculated while
holding log_sys.mutex, creating a serious scalability bottleneck.

We remove the fixed-size redo log block structure altogether and
essentially turn every mini-transaction into a log block of its own.
This allows encryption and checksum calculations to be performed
on local mtr_t::m_log buffers, before acquiring log_sys.mutex.
The mutex only protects a memcpy() of the data to the shared
log_sys.buf, as well as the padding of the log, in case the
to-be-written part of the log would not end in a block boundary of
the underlying storage. For now, the "padding" consists of writing
a single NUL byte, to allow recovery and mariadb-backup to detect
the end of the circular log faster.

Like the previous implementation, we will overwrite the last log block
over and over again, until it has been completely filled. It would be
possible to write only up to the last completed block (if no more
recent write was requested), or to write dummy FILE_CHECKPOINT records
to fill the incomplete block, by invoking the currently disabled
function log_pad(). This would require adjustments to some logic around
log checkpoints, page flushing, and shutdown.

An upgrade after a crash of any previous version is not supported.
Logically empty log files from a previous version will be upgraded.

An attempt to start up InnoDB without a valid ib_logfile0 will be
refused. Previously, the redo log used to be created automatically
if it was missing. Only with with innodb_force_recovery=6, it is
possible to start InnoDB in read-only mode even if the log file
does not exist. This allows the contents of a possibly corrupted
database to be dumped.

Because a prepared backup from an earlier version of mariadb-backup
will create a 0-sized log file, we will allow an upgrade from such
log files, provided that the FIL_PAGE_FILE_FLUSH_LSN in the system
tablespace looks valid.

The 512-byte log checkpoint blocks at 0x200 and 0x600 will be replaced
with 64-byte log checkpoint blocks at 0x1000 and 0x2000.

The start of log records will move from 0x800 to 0x3000. This allows us
to use 4096-byte aligned blocks for all I/O in a future revision.

We extend the MDEV-12353 redo log record format as follows.

(1) Empty mini-transactions or extra NUL bytes will not be allowed.
(2) The end-of-minitransaction marker (a NUL byte) will be replaced
with a 1-bit sequence number, which will be toggled each time when the
circular log file wraps back to the beginning.
(3) After the sequence bit, a CRC-32C checksum of all data
(excluding the sequence bit) will written.
(4) If the log is encrypted, 8 bytes will be written before
the checksum and included in it. This is part of the
initialization vector (IV) of encrypted log data.
(5) File names, page numbers, and checkpoint information will not be
encrypted. Only the payload bytes of page-level log will be encrypted.
The tablespace ID and page number will form part of the IV.
(6) For padding, arbitrary-length FILE_CHECKPOINT records may be written,
with all-zero payload, and with the normal end marker and checksum.
The minimum size is 7 bytes, or 7+8 with innodb_encrypt_log=ON.

In mariadb-backup and in Galera snapshot transfer (SST) scripts, we will
no longer remove ib_logfile0 or create an empty ib_logfile0. Server startup
will require a valid log file. When resizing the log, we will create
a logically empty ib_logfile101 at the current LSN and use an atomic rename
to replace ib_logfile0 with it. See the test innodb.log_file_size.

Because there is no mandatory padding in the log file, we are able
to create a dummy log file as of an arbitrary log sequence number.
See the test mariabackup.huge_lsn.

The parameter innodb_log_write_ahead_size and the
INFORMATION_SCHEMA.INNODB_METRICS counter log_padded will be removed.

The minimum value of innodb_log_buffer_size will be increased to 2MiB
(because log_sys.buf will replace recv_sys.buf) and the increment
adjusted to 4096 bytes (the maximum log block size).

The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed:

os_log_fsyncs
os_log_pending_fsyncs
log_pending_log_flushes
log_pending_checkpoint_writes

The following status variables will be removed:

Innodb_os_log_fsyncs (this is included in Innodb_data_fsyncs)
Innodb_os_log_pending_fsyncs (this was limited to at most 1 by design)

log_sys.get_block_size(): Return the physical block size of the log file.
This is only implemented on Linux and Microsoft Windows for now, and for
the power-of-2 block sizes between 64 and 4096 bytes (the minimum and
maximum size of a checkpoint block). If the block size is anything else,
the traditional 512-byte size will be used via normal file system
buffering.

If the file system buffers can be bypassed, a message like the following
will be issued:

InnoDB: File system buffers for log disabled (block size=512 bytes)
InnoDB: File system buffers for log disabled (block size=4096 bytes)

This has been tested on Linux and Microsoft Windows with both sizes.

On Linux, only enable O_DIRECT on the log for innodb_flush_method=O_DSYNC.
Tests in 3 different environments where the log is stored in a device
with a physical block size of 512 bytes are yielding better throughput
without O_DIRECT. This could be due to the fact that in the event the
last log block is being overwritten (if multiple transactions would
become durable at the same time, and each of will write a small
number of bytes to the last log block), it should be faster to re-copy
data from log_sys.buf or log_sys.flush_buf to the kernel buffer,
to be finally written at fdatasync() time.

The parameter innodb_flush_method=O_DSYNC will imply O_DIRECT for
data files. This option will enable O_DIRECT on the log file on Linux.
It may be unsafe to use when the storage device does not support
FUA (Force Unit Access) mode.

When the server is compiled WITH_PMEM=ON, we will use memory-mapped
I/O for the log file if the log resides on a "mount -o dax" device.
We will identify PMEM in a start-up message:

InnoDB: log sequence number 0 (memory-mapped); transaction id 3

On Linux, we will also invoke mmap() on any ib_logfile0 that resides
in /dev/shm, effectively treating the log file as persistent memory.
This should speed up "./mtr --mem" and increase the test coverage of
PMEM on non-PMEM hardware. It also allows users to estimate how much
the performance would be improved by installing persistent memory.
On other tmpfs file systems such as /run, we will not use mmap().

mariadb-backup: Eliminated several variables. We will refer
directly to recv_sys and log_sys.

backup_wait_for_lsn(): Detect non-progress of
xtrabackup_copy_logfile(). In this new log format with
arbitrary-sized blocks, we can only detect log file overrun
indirectly, by observing that the scanned log sequence number
is not advancing.

xtrabackup_copy_logfile(): On PMEM, do not modify the sequence bit,
because we are not allowed to modify the server's log file, and our
memory mapping is read-only.

trx_flush_log_if_needed_low(): Do not use the callback on pmem.
Using neither flush_lock nor write_lock around PMEM writes seems
to yield the best performance. The pmem_persist() calls may
still be somewhat slower than the pwrite() and fdatasync() based
interface (PMEM mounted without -o dax).

recv_sys_t::buf: Remove. We will use log_sys.buf for parsing.

recv_sys_t::MTR_SIZE_MAX: Replaces RECV_SCAN_SIZE.

recv_sys_t::file_checkpoint: Renamed from mlog_checkpoint_lsn.

recv_sys_t, log_sys_t: Removed many data members.

recv_sys.lsn: Renamed from recv_sys.recovered_lsn.
recv_sys.offset: Renamed from recv_sys.recovered_offset.
log_sys.buf_size: Replaces srv_log_buffer_size.

recv_buf: A smart pointer that wraps log_sys.buf[recv_sys.offset]
when the buffer is being allocated from the memory heap.

recv_ring: A smart pointer that wraps a circular log_sys.buf[] that is
backed by ib_logfile0. The pointer will wrap from recv_sys.len
(log_sys.file_size) to log_sys.START_OFFSET. For the record that
wraps around, we may copy file name or record payload data to
the auxiliary buffer decrypt_buf in order to have a contiguous
block of memory. The maximum size of a record is less than
innodb_page_size bytes.

recv_sys_t::parse(): Take the smart pointer as a template parameter.
Do not temporarily add a trailing NUL byte to FILE_ records, because
we are not supposed to modify the memory-mapped log file. (It is
attached in read-write mode already during recovery.)

recv_sys_t::parse_mtr(): Wrapper for recv_sys_t::parse().

recv_sys_t::parse_pmem(): Like parse_mtr(), but if PREMATURE_EOF would be
returned on PMEM, use recv_ring to wrap around the buffer to the start.

mtr_t::finish_write(), log_close(): Do not enforce log_sys.max_buf_free
on PMEM, because it has no meaning on the mmap-based log.

log_sys.write_to_buf: Count writes to log_sys.buf. Replaces
srv_stats.log_write_requests and export_vars.innodb_log_write_requests.
Protected by log_sys.mutex. Updated consistently in log_close().
Previously, mtr_t::commit() conditionally updated the count,
which was inconsistent.

log_sys.write_to_log: Count swaps of log_sys.buf and log_sys.flush_buf,
for writing to log_sys.log (the ib_logfile0). Replaces
srv_stats.log_writes and export_vars.innodb_log_writes.
Protected by log_sys.mutex.

log_sys.waits: Count waits in append_prepare(). Replaces
srv_stats.log_waits and export_vars.innodb_log_waits.

recv_recover_page(): Do not unnecessarily acquire
log_sys.flush_order_mutex. We are inserting the blocks in arbitary
order anyway, to be adjusted in recv_sys.apply(true).

We will change the definition of flush_lock and write_lock to
avoid potential false sharing. Depending on sizeof(log_sys) and
CPU_LEVEL1_DCACHE_LINESIZE, the flush_lock and write_lock could
share a cache line with each other or with the last data members
of log_sys.

Thanks to Matthias Leich for providing https://rr-project.org traces
for various failures during the development, and to
Thirunarayanan Balathandayuthapani for his help in debugging
some of the recovery code. And thanks to the developers of the
rr debugger for a tool without which extensive changes to InnoDB
would be very challenging to get right.

Thanks to Vladislav Vaintroub for useful feedback and
to him, Axel Schwenke and Krunal Bauskar for testing the performance.
2022-01-21 16:03:47 +02:00
Marko Mäkelä
979b23d5bf Add forgotten changes to the parent commit 2021-12-10 13:04:46 +02:00
Marko Mäkelä
993b8edf3f MDEV-26933 InnoDB fails to detect page number mismatch
mtr_t::page_lock(): Validate the page number.

ibuf_tree_root_get(): Remove assertions that became redundant.

The assertions in btr_validate_level() are kind of redundant as well,
but because they are ut_a(), they are also present in release builds,
while the ones in mtr_t::page_lock() are only present in debug builds.

btr_cur_position(): Do not duplicate an assertion that is part of
page_cur_position().

dict_load_tablespace(): Introduce a new option
DICT_ERR_IGNORE_TABLESPACE that will suppress loading a tablespace
when a table is going to be dropped.
2021-11-01 13:35:47 +02:00
Eugene Kosov
9ef2d29ff4 MDEV-14425 deprecate and ignore innodb_log_files_in_group
Now there can be only one log file instead of several which
logically work as a single file.

Possible names of redo log files: ib_logfile0,
ib_logfile101 (for just created one)

innodb_log_fiels_in_group: value of this variable is not used
by InnoDB. Possible values are still 1..100, to not break upgrade

LOG_FILE_NAME: add constant of value "ib_logfile0"
LOG_FILE_NAME_PREFIX: add constant of value "ib_logfile"

get_log_file_path(): convenience function that returns full
path of a redo log file

SRV_N_LOG_FILES_MAX: removed

srv_n_log_files: we can't remove this for compatibility reasons,
but now server doesn't use this variable

log_sys_t::file::fd: now just one, not std::vector

log_sys_t::log_capacity: removed word 'group'

find_and_check_log_file(): part of logic from huge srv_start()
moved here

recv_sys_t::files: file descriptors of redo log files.
There can be several of those in case we're upgrading
from older MariaDB version.

recv_sys_t::remove_extra_log_files: whether to remove
ib_logfile{1,2,3...} after successfull upgrade.

recv_sys_t::read(): open if needed and read from one
of several log files

recv_sys_t::files_size(): open if needed and return files count

redo_file_sizes_are_correct(): check that redo log files
sizes are equal. Just to log an error for a user.
Corresponding check was moved from srv0start.cc

namespace deprecated: put all deprecated variables here to
prevent usage of it by us, developers
2020-02-19 12:21:59 +03:00
Marko Mäkelä
444c83b2ac MDEV-12353: Test InnoDB upgrade from multi-file redo log 2020-02-14 14:46:44 +02:00
Marko Mäkelä
f8a9f90667 MDEV-12353: Remove support for crash-upgrade
We tighten some assertions regarding dict_index_t::is_dummy
and crash recovery, now that redo log processing will
no longer create dummy objects.
2020-02-13 19:13:45 +02:00
Marko Mäkelä
7ae21b18a6 MDEV-12353: Change the redo log encoding
log_t::FORMAT_10_5: physical redo log format tag

log_phys_t: Buffered records in the physical format.
The log record bytes will follow the last data field,
making use of alignment padding that would otherwise be wasted.
If there are multiple records for the same page, also those
may be appended to an existing log_phys_t object if the memory
is available.

In the physical format, the first byte of a record identifies the
record and its length (up to 15 bytes). For longer records, the
immediately following bytes will encode the remaining length
in a variable-length encoding. Usually, a variable-length-encoded
page identifier will follow, followed by optional payload, whose
length is included in the initially encoded total record length.

When a mini-transaction is updating multiple fields in a page,
it can avoid repeating the tablespace identifier and page number
by setting the same_page flag (most significant bit) in the first
byte of the log record. The byte offset of the record will be
relative to where the previous record for that page ended.

Until MDEV-14425 introduces a separate file-level log for
redo log checkpoints and file operations, we will write the
file-level records in the page-level redo log file.
The record FILE_CHECKPOINT (which replaces MLOG_CHECKPOINT)
will be removed in MDEV-14425, and one sequential scan of the
page recovery log will suffice.

Compared to MLOG_FILE_CREATE2, FILE_CREATE will not include any flags.
If the information is needed, it can be parsed from WRITE records that
modify FSP_SPACE_FLAGS.

MLOG_ZIP_WRITE_STRING: Remove. The record was only introduced temporarily
as part of this work, before being replaced with WRITE (along with
MLOG_WRITE_STRING, MLOG_1BYTE, MLOG_nBYTES).

mtr_buf_t::empty(): Check if the buffer is empty.

mtr_t::m_n_log_recs: Remove. It suffices to check if m_log is empty.

mtr_t::m_last, mtr_t::m_last_offset: End of the latest m_log record,
for the same_page encoding.

page_recv_t::last_offset: Reflects mtr_t::m_last_offset.

Valid values for last_offset during recovery should be 0 or above 8.
(The first 8 bytes of a page are the checksum and the page number,
and neither are ever updated directly by log records.)
Internally, the special value 1 indicates that the same_page form
will not be allowed for the subsequent record.

mtr_t::page_create(): Take the block descriptor as parameter,
so that it can be compared to mtr_t::m_last. The INIT_INDEX_PAGE
record will always followed by a subtype byte, because same_page
records must be longer than 1 byte.

trx_undo_page_init(): Combine the writes in WRITE record.

trx_undo_header_create(): Write 4 bytes using a special MEMSET
record that includes 1 bytes of length and 2 bytes of payload.

flst_write_addr(): Define as a static function. Combine the writes.

flst_zero_both(): Replaces two flst_zero_addr() calls.

flst_init(): Do not inline the function.

fsp_free_seg_inode(): Zerofill the whole inode.

fsp_apply_init_file_page(): Initialize FIL_PAGE_PREV,FIL_PAGE_NEXT
to FIL_NULL when using the physical format.

btr_create(): Assert !page_has_siblings() because fsp_apply_init_file_page()
must have been invoked.

fil_ibd_create(): Do not write FILE_MODIFY after FILE_CREATE.

fil_names_dirty_and_write(): Remove the parameter mtr.
Write the records using a separate mini-transaction object,
because any FILE_ records must be at the start of a mini-transaction log.

recv_recover_page(): Add a fil_space_t* parameter.
After applying log to the a ROW_FORMAT=COMPRESSED page,
invoke buf_zip_decompress() to restore the uncompressed page.

buf_page_io_complete(): Remove the temporary hack to discard the
uncompressed page of a ROW_FORMAT=COMPRESSED page.

page_zip_write_header(): Remove. Use mtr_t::write() or
mtr_t::memset() instead, and update the compressed page frame
separately.

trx_undo_header_add_space_for_xid(): Remove.

trx_undo_seg_create(): Perform the changes that were previously
made by trx_undo_header_add_space_for_xid().

btr_reset_instant(): New function: Reset the table to MariaDB 10.2
or 10.3 format when rolling back an instant ALTER TABLE operation.

page_rec_find_owner_rec(): Merge with the only callers.

page_cur_insert_rec_low(): Combine writes by using a local buffer.
MEMMOVE data from the preceding record whenever feasible
(copying at least 3 bytes).

page_cur_insert_rec_zip(): Combine writes to page header fields.

PageBulk::insertPage(): Issue MEMMOVE records to copy a matching
part from the preceding record.

PageBulk::finishPage(): Combine the writes to the page header
and to the sparse page directory slots.

mtr_t::write(): Only log the least significant (last) bytes
of multi-byte fields that actually differ.

For updating FSP_SIZE, we must always write all 4 bytes to the
redo log, so that the fil_space_set_recv_size() logic in
recv_sys_t::parse() will work.

mtr_t::memcpy(), mtr_t::zmemcpy(): Take a pointer argument
instead of a numeric offset to the page frame. Only log the
last bytes of multi-byte fields that actually differ.

In fil_space_crypt_t::write_page0(), we must log also any
unchanged bytes, so that recovery will recognize the record
and invoke fil_crypt_parse().

Future work:
MDEV-21724 Optimize page_cur_insert_rec_low() redo logging
MDEV-21725 Optimize btr_page_reorganize_low() redo logging
MDEV-21727 Optimize redo logging for ROW_FORMAT=COMPRESSED
2020-02-13 19:12:17 +02:00
Marko Mäkelä
613e9e7d4d MDEV-20907 Set innodb_log_files_in_group=1 by default
Historically, InnoDB split the redo log into at least 2 files.
MDEV-12061 allowed the minimum to be innodb_log_files_in_group=1,
but it kept the default at innodb_log_files_in_group=2.

Because performance seems to be slightly better with only one log file,
and because implementing an append-only variant of the log would require
a single file, let us define the default to be 1, and have
innodb_log_file_size=96M, to retain the same default total size.
2019-10-28 17:11:10 +02:00
Marko Mäkelä
df51dc28f5 Fix tests for innodb_checksum_algorithm=strict_crc32
In tests that directly write InnoDB data file pages,
compute the innodb_checksum_algorithm=crc32 checksums,
instead of writing the 0xdeadbeef value used by
innodb_checksum_algorithm=none. In this way, these tests
will not cause failures when executing
./mtr --mysqld=--loose-innodb-checksum-algorithm=strict_crc32
2019-02-16 12:06:52 +02:00
Marko Mäkelä
745e6498ea After-merge fix to a test 2018-01-18 09:28:37 +02:00
Marko Mäkelä
21239bb0fd After-merge fix to innodb.log_corruption 2018-01-11 22:54:22 +02:00
Marko Mäkelä
6dd302d164 Merge bb-10.2-ext into 10.3 2018-01-11 19:44:41 +02:00
Marko Mäkelä
d1cf9b167c MDEV-14909 MariaDB 10.2 refuses to start up after clean shutdown of MariaDB 10.3
recv_log_recover_10_3(): Determine if a log from MariaDB 10.3 is clean.

recv_find_max_checkpoint(): Allow startup with a clean 10.3 redo log.

srv_prepare_to_delete_redo_log_files(): When starting up with a 10.3 log,
display a "Downgrading redo log" message instead of "Upgrading".
2018-01-10 13:18:02 +02:00
Marko Mäkelä
acd2862e65 MDEV-14848 MariaDB 10.3 refuses InnoDB crash-upgrade from MariaDB 10.2
While the redo log format was changed in MariaDB 10.3.2 and 10.3.3
due to MDEV-12288 and MDEV-11369, it should be technically possible
to upgrade from a crashed MariaDB 10.2 instance.

On a related note, it should be possible for Mariabackup 10.3
to create a backup from a running MariaDB Server 10.2.

mlog_id_t: Put back the 10.2 specific redo log record types
MLOG_UNDO_INSERT, MLOG_UNDO_ERASE_END, MLOG_UNDO_INIT,
MLOG_UNDO_HDR_REUSE.

trx_undo_parse_add_undo_rec(): Parse or apply MLOG_UNDO_INSERT.

trx_undo_erase_page_end(): Apply MLOG_UNDO_ERASE_END.

trx_undo_parse_page_init(): Parse or apply MLOG_UNDO_INIT.

trx_undo_parse_page_header_reuse(): Parse or apply MLOG_UNDO_HDR_REUSE.

recv_log_recover_10_2(): Remove. Always parse the redo log from 10.2.

recv_find_max_checkpoint(), recv_recovery_from_checkpoint_start():
Always parse the redo log from MariaDB 10.2.

recv_parse_or_apply_log_rec_body(): Parse or apply
MLOG_UNDO_INSERT, MLOG_UNDO_ERASE_END, MLOG_UNDO_INIT.

srv_prepare_to_delete_redo_log_files(),
innobase_start_or_create_for_mysql(): Upgrade from a previous (supported)
redo log format.
2018-01-03 19:08:50 +02:00
Marko Mäkelä
bae0844f65 Introduce a new InnoDB redo log format for MariaDB 10.3.1
The redo log format will be changed by MDEV-12288, and it could
be changed further during MariaDB 10.3 development. We will
allow startup from a clean redo log from any earlier InnoDB
version (up to MySQL 5.7 or MariaDB 10.3), but we will refuse
to do crash recovery from older-format redo logs.

recv_log_format_0_recover(): Remove a reference to MySQL documentation,
which may be misleading when it comes to MariaDB.

recv_log_recover_10_2(): Check if a MariaDB 10.2.2/MySQL 5.7.9
redo log is clean.

recv_find_max_checkpoint(): Invoke recv_log_recover_10_2() if the
redo log is in the MariaDB 10.2.2 or MySQL 5.7.9 format.
2017-07-07 12:42:00 +03:00
Sergei Golubchik
b2865a437f search_pattern_in_file.inc changes
1. Special mode to search in error logs: if SEARCH_RANGE is not set,
   the file is considered an error log and the search is performed
   since the last CURRENT_TEST: line
2. Number of matches is printed too. "FOUND 5 /foo/ in bar".
   Use greedy .* at the end of the pattern if number of matches
   isn't stable. If nothing is found it's still "NOT FOUND",
   not "FOUND 0".
3. SEARCH_ABORT specifies the prefix of the output.
   Can be "NOT FOUND" or "FOUND" as before,
   but also "FOUND 5 " if needed.
2017-03-31 19:28:58 +02:00
Marko Mäkelä
23d72bf3aa Close a file handle in a Perl snippet.
This could fix a race condition between mysqld and perl on Windows.
2017-03-24 19:17:23 +02:00
Marko Mäkelä
5da6bd7b95 MDEV-11027 InnoDB log recovery is too noisy
Provide more useful progress reporting of crash recovery.

recv_sys_t::progress_time: The time of the last report.

recv_sys_t::report(ib_time_t): Determine whether progress should
be reported.

recv_scan_print_counter: Remove.

log_group_read_log_seg(): After after each I/O request, invoke
recv_sys_t::report() and report progress if needed.

recv_apply_hashed_log_recs(): Change the return type back to void
(DB_SUCCESS was always returned), and rename the parameter to last_batch.
At the start of each batch, if there are pages to be recovered,
issue a message.
2017-03-08 14:55:11 +02:00
Marko Mäkelä
ab8199f38e MDEV-12103 Reduce the time of looking for MLOG_CHECKPOINT during crash recovery
This fixes MySQL Bug#80788 in MariaDB 10.2.5.

When I made the InnoDB crash recovery more robust by implementing
WL#7142, I also introduced an extra redo log scan pass that can be
shortened.

This fix will slightly extend the InnoDB redo log format that I
introduced in MySQL 5.7.9 by writing the start LSN of the MLOG_CHECKPOINT
mini-transaction to the end of the log checkpoint page, so that recovery
can jump straight to it without scanning all the preceding redo log.

LOG_CHECKPOINT_END_LSN: At the end of the checkpoint page, the start LSN
of the MLOG_CHECKPOINT mini-transaction. Previously, these bytes were
written as 0.

log_write_checkpoint_info(), log_group_checkpoint(): Add the parameter
end_lsn for writing LOG_CHECKPOINT_END_LSN.

log_checkpoint(): Remember the LSN at which the MLOG_CHECKPOINT
mini-transaction is starting (or at which the redo log ends on
shutdown).

recv_init_crash_recovery(): Remove.

recv_group_scan_log_recs(): Add the parameter checkpoint_lsn.

recv_recovery_from_checkpoint_start(): Read LOG_CHECKPOINT_END_LSN
and if it is set, start the first scan from it instead of the
checkpoint LSN. Improve some messages and remove bogus assertions.

recv_parse_log_recs(): Do not skip DBUG_PRINT("ib_log") for some
file-level redo log records.

recv_parse_or_apply_log_rec_body(): If we have not parsed all redo
log between the checkpoint and the corresponding MLOG_CHECKPOINT
record, defer the check for MLOG_FILE_DELETE or MLOG_FILE_NAME records
to recv_init_crash_recovery_spaces().

recv_init_crash_recovery_spaces(): Refuse recovery if
MLOG_FILE_NAME or MLOG_FILE_DELETE records are missing.
2017-03-03 09:38:59 +02:00
Marko Mäkelä
2af28a363c MDEV-11782: Redefine the innodb_encrypt_log format
Write only one encryption key to the checkpoint page.
Use 4 bytes of nonce. Encrypt more of each redo log block,
only skipping the 4-byte field LOG_BLOCK_HDR_NO which the
initialization vector is derived from.

Issue notes, not warning messages for rewriting the redo log files.

recv_recovery_from_checkpoint_finish(): Do not generate any redo log,
because we must avoid that before rewriting the redo log files, or
otherwise a crash during a redo log rewrite (removing or adding
encryption) may end up making the database unrecoverable.
Instead, do these tasks in innobase_start_or_create_for_mysql().

Issue a firm "Missing MLOG_CHECKPOINT" error message. Remove some
unreachable code and duplicated error messages for log corruption.

LOG_HEADER_FORMAT_ENCRYPTED: A flag for identifying an encrypted redo
log format.

log_group_t::is_encrypted(), log_t::is_encrypted(): Determine
if the redo log is in encrypted format.

recv_find_max_checkpoint(): Interpret LOG_HEADER_FORMAT_ENCRYPTED.

srv_prepare_to_delete_redo_log_files(): Display NOTE messages about
adding or removing encryption. Do not issue warnings for redo log
resizing any more.

innobase_start_or_create_for_mysql(): Rebuild the redo logs also when
the encryption changes.

innodb_log_checksums_func_update(): Always use the CRC-32C checksum
if innodb_encrypt_log. If needed, issue a warning
that innodb_encrypt_log implies innodb_log_checksums.

log_group_write_buf(): Compute the checksum on the encrypted
block contents, so that transmission errors or incomplete blocks can be
detected without decrypting.

Rewrite most of the redo log encryption code. Only remember one
encryption key at a time (but remember up to 5 when upgrading from the
MariaDB 10.1 format.)
2017-02-15 08:07:20 +02:00
Marko Mäkelä
b40a1fbc93 MDEV-11782 WIP: Clean up the code, and add a test.
LOG_CHECKPOINT_ARRAY_END, LOG_CHECKPOINT_SIZE: Remove.

Change some error messages to refer to MariaDB 10.2.2 instead of
MySQL 5.7.9.

recv_find_max_checkpoint_0(): Do not abort when decrypting one of the
checkpoint pages fails.
2017-02-07 11:55:16 +02:00
Marko Mäkelä
3ebe08204a MDEV-11782 Work-in-progress (test only).
Test server startup with an empty encrypted redo log from 10.1.21.
FIXME: Pass the encryption parameters. Currently we only test startup
without properly set up encryption.
2017-02-03 12:52:36 +02:00
Marko Mäkelä
5285504857 innodb.log_corruption: Use the main error log. 2017-02-03 12:52:36 +02:00
Marko Mäkelä
650ffcd3a0 Extend the innodb.log_corruption test.
Remove the dependency on unzip. Instead, generate the InnoDB files
with perl.

log_block_checksum_is_ok(): Correct the error message.

recv_scan_log_recs(): Remove the duplicated error message for
log block checksum mismatch.

innobase_start_or_create_for_mysql(): If the server is in read-only
mode or if innodb_force_recovery>=3, do not try to modify the system
tablespace. (If the doublewrite buffer or the non-core system tables
do not exist, do not try to create them.)

innodb_shutdown(): Relax a debug assertion. If the system tablespace
did not contain a doublewrite buffer and if we started up in
innodb_read_only mode or with innodb_force_recovery>=3, it will not
be created.

dict_create_or_check_sys_tablespace(): Set the flag
srv_sys_tablespaces_open when the tables exist.
2017-02-02 10:20:22 +02:00
Sergei Golubchik
a7d6271cbf skip innodb.log_corruption test if no unzip executable is found
also move *.zip files from t/ to std_data/
2017-01-29 13:50:30 +01:00
Marko Mäkelä
2de0e42af5 Import and adjust the InnoDB redo log tests from MySQL 5.7. 2017-01-27 17:53:02 +02:00