1
0
mirror of https://github.com/MariaDB/server.git synced 2025-09-13 13:47:59 +03:00
Commit Graph

1771 Commits

Author SHA1 Message Date
Thirunarayanan Balathandayuthapani
c507678b20 MDEV-28699 Shrink temporary tablespaces without restart
- Introduced the variable "innodb_truncate_temporary_tablespace_now"
to shrink the temporary tablespace.

Steps for shrinking the temporary tablespace:

1) Find the last used extent in temporary tablespace
by iterating through the BITMAP in extent descriptor pages

2) If the last used extent is lesser than user specified size
then set desired target size to user specified size

3) Store the page contents of "to be modified" extent
descriptor pages, latches the "to be modified" extent
descriptor pages and check for buffer pool memory availability

4) Update the FSP_SIZE and FSP_FREE_LIMIT in header page

5) Remove the "to be truncated" pages from FSP_FREE and
FSP_FREE_FRAG list

6) Reset the bitmap in the last descriptor pages for the
"to be truncated" pages.

7) Clear the freed range in temporary tablespace which
are to be truncated.

8) Evict the "to be truncated" temporary tablespace pages
from LRU list.

9) In case of multiple files, calculate the truncated last
file size and do truncation in last file

10) Commit the mini-transaction for shrinking the tablespace
2023-10-27 10:51:37 +03:00
Marko Mäkelä
5b53342a6a MDEV-32588 InnoDB may hang when running out of buffer pool
buf_flush_LRU_list_batch(): Do not skip pages that are actually clean
but in buf_pool.flush_list due to the "lazy removal" optimization of
commit 22b62edaed, but try to evict them.
After acquiring buf_pool.flush_list_mutex, reread oldest_modification
to ensure that the block still remains in buf_pool.flush_list.

In addition to server hangs, this bug could also cause
InnoDB: Failing assertion: list.count > 0
in invocations of UT_LIST_REMOVE(flush_list, ...).

This fixes a regression that was caused by
commit a55b951e60
and possibly made more likely to hit due to
commit aa719b5010.
2023-10-26 15:10:53 +03:00
Marko Mäkelä
39e3ca8bd2 MDEV-31826 InnoDB may fail to recover after being killed in fil_delete_tablespace()
InnoDB was violating the write-ahead-logging protocol when a file
was being deleted, like this:

1. fil_delete_tablespace() set the fil_space_t::STOPPING flag
2. The buf_flush_page_cleaner() thread discards some changed pages for
this tablespace advances the log checkpoint a little.
3. The server process is killed before fil_delete_tablespace() wrote
a FILE_DELETE record.
4. Recovery will try to apply log to pages of the tablespace, because
there was no FILE_DELETE record. This will fail, because some pages
that had been modified since the latest checkpoint had not been written
by the page cleaner.

Page writes must not be stopped before a FILE_DELETE record has been
durably written.

fil_space_t::drop(): Replaces fil_space_t::check_pending_operations().
Add the parameter detached_handle, and return a tablespace pointer
if this thread was the first one to stop I/O on the tablespace.

mtr_t::commit_file(): Remove the parameter detached_handle, and
move some handling to fil_space_t::drop().

fil_space_t: STOPPING_READS, STOPPING_WRITES: Separate flags for STOPPING.
We want to stop reads (and encryption) before stopping page writes.

fil_space_t::is_stopping_writes(), fil_space_t::get_for_write():
Special accessors for the write path.

fil_space_t::flush_low(): Ignore the STOPPING_READS flag and only
stop if STOPPING_WRITES is set, to avoid an infinite loop in
fil_flush_file_spaces(), which was occasionally repeated by
running the test encryption.create_or_replace.

Reviewed by: Vladislav Lesin
Tested by: Matthias Leich
2023-10-26 15:07:59 +03:00
Marko Mäkelä
3036b36f9b Merge 10.10 into 10.11 2023-10-23 18:44:12 +03:00
Marko Mäkelä
5a8fca5a4f Merge 10.6 into 10.10 2023-10-23 18:43:36 +03:00
Marko Mäkelä
b21f52ee73 Merge 10.5 into 10.6 2023-10-23 16:43:48 +03:00
Marko Mäkelä
b5e43a1d35 MDEV-32552 Write-ahead logging is broken for freed pages
buf_page_free(): Flag the freed page as modified if it is found in
the buffer pool.

buf_flush_page(): If the page has been freed, ensure that the log
for it has been durably written, before removing the page
from buf_pool.flush_list.

FindBlockX: Find also MTR_MEMO_PAGE_X_MODIFY in order to avoid an
occasional failure of innodb.innodb_defrag_concurrent, which involves
freeing and reallocating pages in the same mini-transaction.

This fixes a regression that was introduced in
commit a35b4ae898 (MDEV-15528).

This logic was tested by commenting out the $shutdown_timeout line
from a test and running the following:

./mtr --rr innodb.scrub
rr replay var/log/mysqld.1.rr/mariadbd-0

A breakpoint in the modified buf_flush_page() was hit, and the
FIL_PAGE_LSN of that page had been last modified during the
mtr_t::commit() of a mini-transaction where buf_page_free()
had been executed on that page.
2023-10-23 16:13:16 +03:00
Marko Mäkelä
65700edb26 Merge 10.10 into 10.11 2023-10-19 14:50:42 +03:00
Marko Mäkelä
c92d06748a Merge 10.6 into 10.10 2023-10-19 14:35:31 +03:00
Marko Mäkelä
6991b1c47c Merge 10.5 into 10.6 2023-10-19 13:50:00 +03:00
Marko Mäkelä
be24e75229 Merge 10.11 into 11.0 2023-10-19 08:12:16 +03:00
Thirunarayanan Balathandayuthapani
3da5d047b8 MDEV-31851 After crash recovery, undo tablespace fails to open
Problem:
========
- InnoDB fails to open undo tablespace when page0 is corrupted
and fails to throw error.

Solution:
=========
- InnoDB throws DB_CORRUPTION error when InnoDB encounters
page0 corruption of undo tablespace.

- InnoDB restores the page0 of undo tablespace from
doublewrite buffer if it encounters page corruption

- Moved Datafile::restore_from_doublewrite() to
recv_dblwr_t::restore_first_page(). So that undo
tablespace and system tablespace can use this function
instead of duplicating the code

srv_undo_tablespace_open(): Returns 0 if file doesn't exist
or ULINT_UNDEFINED if page0 is corrupted.
2023-10-17 18:41:21 +05:30
Marko Mäkelä
2ecc0443ec Merge 10.10 into 10.11 2023-10-17 16:04:21 +03:00
Marko Mäkelä
d5e15424d8 Merge 10.6 into 10.10
The MDEV-29693 conflict resolution is from Monty, as well as is
a bug fix where ANALYZE TABLE wrongly built histograms for
single-column PRIMARY KEY.
Also includes a fix for safe_malloc error reporting.

Other things:
- Copied main.log_slow from 10.4 to avoid mtr issue

Disabled test:
- spider/bugfix.mdev_27239 because we started to get
  +Error	1429 Unable to connect to foreign data source: localhost
  -Error	1158 Got an error reading communication packets
- main.delayed
  - Bug#54332 Deadlock with two connections doing LOCK TABLE+INSERT DELAYED
    This part is disabled for now as it fails randomly with different
    warnings/errors (no corruption).
2023-10-14 13:36:11 +03:00
Vladislav Vaintroub
e33e2fa949 MDEV-31095 tpool - restrict threadpool concurrency during bufferpool load
Add threadpool functionality to restrict concurrency during "batch"
periods (where tasks are added in rapid succession).
This will throttle thread creation more agressively than usual, while
keeping performance at least on-par.

One of these cases is bufferpool load, where async read IOs are executed
without any throttling. There can be as much as 650K read IOs for
loading 10GB buffer pool.

Another one is recovery, where "fake read" IOs are executed.

Why there are more threads than we expect?
Worker threads are not be recognized as idle, until they return to the
standby list, and to return to that list, they need to acquire
mutex currently held in the submit_task(). In those cases, submit_task()
has no worker to wake, and would create threads until default concurrency
level (2*ncpus) is satisfied. Only after that throttling would happen.
2023-10-04 17:44:02 +02:00
Marko Mäkelä
0f9acce3f2 Merge 10.5 into 10.6 2023-09-14 09:01:15 +03:00
Marko Mäkelä
d20a4da23d MDEV-32150 InnoDB reports corruption on 32-bit platforms with ibd files sizes > 4GB
buf_read_page_low(): Use 64-bit arithmetics when computing the
file byte offset. In other calls to fil_space_t::io() the offset
was being computed correctly, for example by
buf_page_t::physical_offset().
2023-09-12 15:16:31 +03:00
Thirunarayanan Balathandayuthapani
a03b8cd0a2 MDEV-32145 Disable read-ahead for temporary tablespace
- Lifetime of temporary tables is expected to be short, it would
seem to make sense to assume that all temporary tablespace pages
will remain in the buffer pool. It doesn't make sense to have
read-ahead for pages of temporary tablespace
2023-09-11 18:02:53 +05:30
Marko Mäkelä
cdd2fa7fc5 MDEV-32134 InnoDB hang in buf_flush_wait_LRU_batch_end()
buf_flush_page_cleaner(): Before finishing a batch, wake up any threads
that are waiting for buf_pool.done_flush_LRU.

This should fix a hung shutdown that we observed
after SET GLOBAL innodb_buffer_pool_size started was executed
to shrink the InnoDB buffer pool.
2023-09-11 14:54:50 +03:00
Marko Mäkelä
9d1466522e MDEV-32029 Assertion failures in log_sort_flush_list upon crash recovery
In commit 0d175968d1 (MDEV-31354)
we only waited that no buf_pool.flush_list writes are in progress.
The buf_flush_page_cleaner() thread could still initiate page writes
from the buf_pool.LRU list while only holding buf_pool.mutex, not
buf_pool.flush_list_mutex. This is something that was changed in
commit a55b951e60 (MDEV-26827).

log_sort_flush_list(): Wait for the buf_flush_page_cleaner() thread to
be completely idle, including LRU flushing.

buf_flush_page_cleaner(): Always broadcast buf_pool.done_flush_list
when becoming idle, so that log_sort_flush_list() will be woken up.
Also, ensure that buf_pool.n_flush_inc() or
buf_pool.flush_list_set_active() has been invoked before any page
writes are initiated.

buf_flush_try_neighbors(): Release buf_pool.mutex here and not in the
callers, to avoid code duplication. Make innodb_flush_neighbors=ON
obey the innodb_io_capacity limit.
2023-08-30 14:40:13 +03:00
Marko Mäkelä
31ea201ecc MDEV-30986 Slow full index scan for I/O bound case
buf_page_init_for_read(): Test a condition before acquiring a latch,
not while holding it.

buf_read_ahead_linear(): Do not use a memory transaction, because it
could be too large, leading to frequent retries.
Release the hash_lock as early as possible.
2023-08-30 13:20:27 +03:00
Marko Mäkelä
08a549c33d Clean up buf_LRU_remove_hashed()
buf_LRU_block_remove_hashed(): Test for "not ROW_FORMAT=COMPRESSED" first,
because in that case we can assume that an uncompressed page exists.
This removes a condition from the likely code branch.
2023-08-25 13:44:59 +03:00
Marko Mäkelä
a60462d93e Remove bogus references to replaced Google contributions
In commit 03ca6495df and
commit ff5d306e29
we forgot to remove some Google copyright notices related to
a contribution of using atomic memory access in the old InnoDB
mutex_t and rw_lock_t implementation.

The copyright notices had been mostly added in
commit c6232c06fa
due to commit a1bb700fd2.

The following Google contributions remain:
* some logic related to the parameter innodb_io_capacity
* innodb_encrypt_tables, added in MariaDB Server 10.1
2023-08-21 15:51:16 +03:00
Marko Mäkelä
6cc88c3db1 Clean up buf0buf.inl
Let us move some #include directives from buf0buf.inl to
the compilation units where they are really used.
2023-08-21 15:51:10 +03:00
Marko Mäkelä
448c2077fb Merge 10.5 into 10.6 2023-08-21 15:50:31 +03:00
Marko Mäkelä
be5fd3ec35 Remove a stale comment
buf_LRU_block_remove_hashed(): Remove a comment that had been added
in mysql/mysql-server@aad1c7d0dd
and apparently referring to buf_LRU_invalidate_tablespace(),
which was later replaced with buf_LRU_flush_or_remove_pages() and
ultimately with buf_flush_remove_pages() and buf_flush_list_space().
All that code is covered by buf_pool.mutex. The note about releasing
the hash_lock for the buf_pool.page_hash slice would actually apply to
the last reference to hash_lock in buf_LRU_free_page(), for the
case zip=false (retaining a ROW_FORMAT=COMPRESSED page while
discarding the uncompressed one).
2023-08-21 13:28:12 +03:00
Oleksandr Byelkin
51f9d62005 Merge branch '10.11' into 11.0 2023-08-09 07:53:48 +02:00
Oleksandr Byelkin
036df5f970 Merge branch '10.10' into 10.11 2023-08-08 14:57:31 +02:00
Oleksandr Byelkin
34a8e78581 Merge branch '10.6' into 10.9 2023-08-04 08:01:06 +02:00
Marko Mäkelä
72928e640e MDEV-27593: Crashing on I/O error is unhelpful
buf_page_t::write_complete(), buf_page_write_complete(),
IORequest::write_complete(): Add a parameter for passing
an error code. If an error occurred, we will release the
io-fix, buffer-fix and page latch but not reset the
oldest_modification field. The block would remain in
buf_pool.LRU and possibly buf_pool.flush_list, to be written
again later, by buf_flush_page_cleaner(). If all page writes
start consistently failing, all write threads should eventually
hang in log_free_check() because the log checkpoint cannot
be advanced to make room in the circular write-ahead-log ib_logfile0.

IORequest::read_complete(): Add a parameter for passing
an error code. If a read operation fails, we report the error
and discard the page, just like we would do if the page checksum
was not validated or the page could not be decrypted.
This only affects asynchronous reads, due to linear or random read-ahead
or crash recovery. When buf_page_get_low() invokes buf_read_page(),
that will be a synchronous read, not involving this code.

This was tested by randomly injecting errors in
write_io_callback() and read_io_callback(), like this:

  if (!ut_rnd_interval(100))
    cb->m_err= 42;
2023-08-01 14:39:29 +03:00
Marko Mäkelä
96cfdb8710 MDEV-31816 fixup: Relax a debug assertion
buf_LRU_free_page(): The block may also be in the IBUF_EXIST state
when executing the test innodb.innodb_bulk_create_index_debug.
2023-08-01 13:22:16 +03:00
Marko Mäkelä
d794d3484b MDEV-31816 buf_LRU_free_page() does not preserve ROW_FORMAT=COMPRESSED block state
buf_LRU_free_page(): When we are discarding the uncompressed copy of a
ROW_FORMAT=COMPRESSED page, buf_page_t::can_relocate() must have ensured
that the block descriptor state is one of FREED, UNFIXED, REINIT.
Do not overwrite the state with UNFIXED. We do not want to write back
pages that were actually freed, and we want to avoid doublewrite for
pages that were (re)initialized by log records written since the latest
checkpoint. Last but not least, we do not want crashes like those that
commit dc1bd1802a (MDEV-31386)
was supposed to fix.

The test innodb_zip.wl5522_zip should typically cover all 3 states.

This bug is a regression due to
commit aaef2e1d8c (MDEV-27058).
2023-08-01 09:58:15 +03:00
Marko Mäkelä
f2b4972bd4 Merge 10.11 into 11.0 2023-07-26 15:13:06 +03:00
Marko Mäkelä
bce3ee704f Merge 10.10 into 10.11 2023-07-26 14:44:43 +03:00
Marko Mäkelä
7cde5c539b Merge 10.6 into 10.9 2023-07-10 11:22:21 +03:00
Monty
99bd226059 MDEV-31558 Add InnoDB engine information to the slow query log
The new statistics is enabled by adding the "engine", "innodb" or "full"
option to --log-slow-verbosity

Example output:

 # Pages_accessed: 184  Pages_read: 95  Pages_updated: 0  Old_rows_read: 1
 # Pages_read_time: 17.0204  Engine_time: 248.1297

Page_read_time is time doing physical reads inside a storage engine.
(Writes cannot be tracked as these are usually done in the background).
Engine_time is the time spent inside the storage engine for the full
duration of the read/write/update calls. It uses the same code as
'analyze statement' for calculating the time spent.

The engine statistics is done with a generic interface that should be
easy for any engine to use. It can also easily be extended to provide
even more statistics.

Currently only InnoDB has counters for Pages_% and Undo_% status.
Engine_time works for all engines.

Implementation details:

class ha_handler_stats holds all engine stats.  This class is included
in handler and THD classes.
While a query is running, all statistics is updated in the handler. In
close_thread_tables() the statistics is added to the THD.

handler::handler_stats is a pointer to where statistics should be
collected. This is set to point to handler::active_handler_stats if
stats are requested. If not, it is set to 0.
handler_stats has also an element, 'active' that is 1 if stats are
requested. This is to allow engines to avoid doing any 'if's while
updating the statistics.

Cloned or partition tables have the pointer set to the base table if
status are requested.

There is a small performance impact when using --log-slow-verbosity=engine:
- All engine calls in 'select' will be timed.
- IO calls for InnoDB reads will be timed.
- Incrementation of counters are done on local variables and accesses
  are inline, so these should have very little impact.
- Statistics has to be reset for each statement for the THD and each
  used handler. This is only 40 bytes, which should be neglectable.
- For partition tables we have to loop over all partitions to update
  the handler_status as part of table_init(). Can be optimized in the
  future to only do this is log-slow-verbosity changes. For this to work
  we have to update handler_status for all opened partitions and
  also for all partitions opened in the future.

Other things:
- Added options 'engine' and 'full' to log-slow-verbosity.
- Some of the new files in the test suite comes from Percona server, which
  has similar status information.
- buf_page_optimistic_get(): Do not increment any counter, since we are
  only validating a pointer, not performing any buf_pool.page_hash lookup.
- Added THD argument to save_explain_data_intern().
- Switched arguments for save_explain_.*_data() to have
  always THD first (generates better code as other functions also have THD
  first).
2023-07-07 12:53:18 +03:00
Marko Mäkelä
a906046f1f Merge 10.11 into 11.0 2023-07-04 08:20:20 +03:00
Marko Mäkelä
3430767e00 Merge 10.10 into 10.11 2023-07-04 08:19:48 +03:00
Marko Mäkelä
35de8326fb MDEV-31311: The test innodb.log_file_size_online occasionally hangs
log_t::write_checkpoint(), log_t::resize_start(): Invoke buf_flush_ahead()
with buf_pool.get_oldest_modification(0)+1 so that another checkpoint will
be invoked, to complete the log resizing.

Tested by: Oleg Smirnov (on Microsoft Windows, where this hung most often)
2023-07-03 16:50:01 +03:00
Marko Mäkelä
5fb2c031f7 Merge 10.11 into 11.0 2023-06-08 13:49:48 +03:00
Marko Mäkelä
c04284e747 Merge 10.10 into 10.11 2023-06-07 15:01:43 +03:00
Marko Mäkelä
878a86f276 Merge 10.6 into 10.9 2023-06-07 14:32:46 +03:00
Sergei Golubchik
0005f2f06c Merge branch 'bb-10.11-release' into bb-11.0-release 2023-06-05 19:27:00 +02:00
Sergei Golubchik
4e2b93dffe Merge branch 'bb-10.10-release' into bb-10.11-release 2023-06-05 19:04:58 +02:00
Sergei Golubchik
33fd519ca7 Merge branch 'github/bb-10.6-release' into bb-10.9-release 2023-06-05 18:55:26 +02:00
Marko Mäkelä
cf37e44eec MDEV-31350: Hang in innodb.recovery_memory
buf_flush_page_cleaner(): Whenever buf_pool.ran_out(), invoke
buf_pool.get_oldest_modification(0) so that all clean blocks
will be removed from buf_pool.flush_list and buf_flush_LRU_list_batch()
will be able to evict some pages.

This fixes a regression that was likely caused by
commit a55b951e60 (MDEV-26827).
2023-06-03 11:12:32 +02:00
Marko Mäkelä
f569e06e03 MDEV-31385 Change buffer stale entries leads to corruption while reusing page
buf_page_free(): If buffered changes existed for the page, drop them.

Co-developed with Thirunarayanan Balathandayuthapani
2023-06-02 11:06:09 +03:00
Marko Mäkelä
ce547cfc05 MDEV-31350: Hang in innodb.recovery_memory
buf_flush_page_cleaner(): Whenever buf_pool.ran_out(), invoke
buf_pool.get_oldest_modification(0) so that all clean blocks
will be removed from buf_pool.flush_list and buf_flush_LRU_list_batch()
will be able to evict some pages.

This fixes a regression that was likely caused by
commit a55b951e60 (MDEV-26827).
2023-05-26 16:40:07 +03:00
Marko Mäkelä
f2c17cc9d9 MDEV-29911 InnoDB recovery and mariadb-backup --prepare fail to report detailed progress
This is a 10.6 port of commit 2f9e264781
from MariaDB Server 10.9 that is missing some optimization due to a
more complex redo log format and recovery logic
(which was simplified in commit 685d958e38).

The progress reporting of InnoDB crash recovery was rather intermittent.
Nothing was reported during the single-threaded log record parsing, which
could consume minutes when parsing a large log. During log application,
there only was progress reporting in background threads that would be
invoked on data page read completion.

The progress reporting here will be detailed like this:

InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799
InnoDB: Read redo log up to LSN=1963895808
InnoDB: Multi-batch recovery needed at LSN 2534560930
InnoDB: Read redo log up to LSN=3312233472
InnoDB: Read redo log up to LSN=1599646720
InnoDB: Read redo log up to LSN=2160831488
InnoDB: To recover: LSN 2806789376/2806819840; 195082 pages
InnoDB: To recover: LSN 2806789376/2806819840; 63507 pages
InnoDB: Read redo log up to LSN=3195776000
InnoDB: Read redo log up to LSN=3687099392
InnoDB: Read redo log up to LSN=4165315584
InnoDB: To recover: LSN 4374395699/4374440960; 241454 pages
InnoDB: To recover: LSN 4374395699/4374440960; 123701 pages
InnoDB: Read redo log up to LSN=4508724224
InnoDB: Read redo log up to LSN=5094550528
InnoDB: To recover: 205230 pages

The previous messages "Starting a batch to recover" or
"Starting a final batch to recover" will be replaced by
"To recover: ... pages" messages.

If a batch lasts longer than 15 seconds, then there will be
progress reports every 15 seconds, showing the number of remaining pages.
For the non-final batch, the "To recover:" message includes two end LSN:
that of the batch, and of the recovered log. This is the primary measure
of progress. The batch will end once the number of pages to recover
reaches 0.

If recovery is possible in a single batch, the output will look like this,
with a shorter "To recover:" message that counts only the remaining pages:

InnoDB: Starting crash recovery from checkpoint LSN=628599973,5653727799
InnoDB: Read redo log up to LSN=1984539648
InnoDB: Read redo log up to LSN=2710875136
InnoDB: Read redo log up to LSN=3358895104
InnoDB: Read redo log up to LSN=3965299712
InnoDB: Read redo log up to LSN=4557417472
InnoDB: Read redo log up to LSN=5219527680
InnoDB: To recover: 450915 pages

We will also speed up recovery by improving the memory management and
implementing multi-threaded recovery of data pages that will not need
to be read into the buffer pool ("fake read"). Log application in the
"fake read" threads will be protected by an atomic being_recovered field
and exclusive buf_page_t::lock.

Recovery will reserve for data pages two thirds of the buffer pool,
or 256 pages, whichever is smaller. Previously, we could only use at most
one third of the buffer pool for buffered log records. This would typically
mean that with large buffer pools, recovery unnecessary consisted of
multiple batches.

If recovery runs out of memory, it will "roll back" or "rewind" the current
mini-transaction. The recv_sys.recovered_lsn and recv_sys.pages
will correspond to the "out of memory LSN", at the end of the previous
complete mini-transaction.

If recovery runs out of memory while executing the final recovery batch,
we can simply invoke recv_sys.apply(false) to make room, and resume
parsing.

If recovery runs out of memory before the final batch, we will
scan the redo log to the end and check for any missing or inconsistent
files. In this version of the patch, we will throw away any previously
buffered recv_sys.pages and rescan the log from the checkpoint onwards.

recv_sys_t::pages_it: A cached iterator to recv_sys.pages.

recv_sys_t::is_memory_exhausted(): Remove. We will have out-of-memory
handling deep inside recv_sys_t::parse().

recv_sys_t::rewind(), page_recv_t::recs_t::rewind():
Remove all log starting with a specific LSN.

IORequest::write_complete(), IORequest::read_complete():
Replaces fil_aio_callback().

read_io_callback(), write_io_callback(): Replaces io_callback().

IORequest::fake_read_complete(), fake_io_callback(), os_fake_read():
Process a "fake read" request for concurrent recovery.

recv_sys_t::apply_batch(): Choose a number of successive pages
for a recovery batch.

recv_sys_t::erase(recv_sys_t::map::iterator): Remove log records for a
page whose recovery is not in progress. Log application threads
will not invoke this; they will only set being_recovered=-1 to indicate
that the entry is no longer needed.

recv_sys_t::garbage_collect(): Remove all being_recovered=-1 entries.

recv_sys_t::wait_for_pool(): Wait for some space to become available
in the buffer pool.

mlog_init_t::mark_ibuf_exist(): Avoid calls to
recv_sys::recover_low() via ibuf_page_exists() and buf_page_get_low().
Such calls would lead to double locking of recv_sys.mutex, which
depending on implementation could cause a deadlock. We will use
lower-level calls to look up index pages.

buf_LRU_block_remove_hashed(): Disable consistency checks for freed
ROW_FORMAT=COMPRESSED pages. Their contents could be uninitialized garbage.
This fixes an occasional failure of the test
innodb.innodb_bulk_create_index_debug.

Tested by: Matthias Leich
2023-05-19 15:20:07 +03:00
Marko Mäkelä
2f9e264781 MDEV-29911 InnoDB recovery and mariadb-backup --prepare fail to report detailed progress
The progress reporting of InnoDB crash recovery was rather intermittent.
Nothing was reported during the single-threaded log record parsing, which
could consume minutes when parsing a large log. During log application,
there only was progress reporting in background threads that would be
invoked on data page read completion.

The progress reporting here will be detailed like this:

InnoDB: Starting crash recovery from checkpoint LSN=503549688
InnoDB: Parsed redo log up to LSN=1990840177; to recover: 124806 pages
InnoDB: Parsed redo log up to LSN=2729777071; to recover: 186123 pages
InnoDB: Parsed redo log up to LSN=3488599173; to recover: 248397 pages
InnoDB: Parsed redo log up to LSN=4177856618; to recover: 306469 pages
InnoDB: Multi-batch recovery needed at LSN 4189599815
InnoDB: End of log at LSN=4483551634
InnoDB: To recover: LSN 4189599815/4483551634; 307490 pages
InnoDB: To recover: LSN 4189599815/4483551634; 197159 pages
InnoDB: To recover: LSN 4189599815/4483551634; 67623 pages
InnoDB: Parsed redo log up to LSN=4353924218; to recover: 102083 pages
...
InnoDB: log sequence number 4483551634 ...

The previous messages "Starting a batch to recover" or
"Starting a final batch to recover" will be replaced by
"To recover: ... pages" messages.

If a batch lasts longer than 15 seconds, then there will be
progress reports every 15 seconds, showing the number of remaining pages.
For the non-final batch, the "To recover:" message includes two end LSN:
that of the batch, and of the recovered log. This is the primary measure
of progress. The batch will end once the number of pages to recover
reaches 0.

If recovery is possible in a single batch, the output will look like this,
with a shorter "To recover:" message that counts only the remaining pages:

InnoDB: Starting crash recovery from checkpoint LSN=503549688
InnoDB: Parsed redo log up to LSN=1998701027; to recover: 125560 pages
InnoDB: Parsed redo log up to LSN=2734136874; to recover: 186446 pages
InnoDB: Parsed redo log up to LSN=3499505504; to recover: 249378 pages
InnoDB: Parsed redo log up to LSN=4183247844; to recover: 306964 pages
InnoDB: End of log at LSN=4483551634
...
InnoDB: To recover: 331797 pages
...
InnoDB: log sequence number 4483551634 ...

We will also speed up recovery by improving the memory management and
implementing multi-threaded recovery of data pages that will not need
to be read into the buffer pool ("fake read"). Log application in the
"fake read" threads will be protected by an atomic being_recovered field
and exclusive buf_page_t::latch.

Recovery will reserve for data pages two thirds of the buffer pool,
or 256 pages, whichever is smaller. Previously, we could only use at most
one third of the buffer pool for buffered log records. This would typically
mean that with large buffer pools, recovery unnecessary consisted of
multiple batches.

If recovery runs out of memory, it will "roll back" or "rewind" the current
mini-transaction. The recv_sys.lsn and recv_sys.pages will correspond
to the "out of memory LSN", at the end of the previous complete
mini-transaction.

If recovery runs out of memory while executing the final recovery batch,
we can simply invoke recv_sys.apply(false) to make room, and resume
parsing.

If recovery runs out of memory before the final batch, we will scan
the redo log to the end (recv_sys.scanned_lsn) and check for any missing
or inconsistent files. If recv_init_crash_recovery_spaces() does not
report any potentially missing tablespaces, we can make use of the
already stored recv_sys.pages and only rewind to the "out of memory LSN".
Else, we must keep parsing and invoking recv_validate_tablespace()
until an error has been found or everything has been resolved, and
ultimatily rewind to to the checkpoint LSN.

recv_sys_t::pages_it: A cached iterator to recv_sys.pages

recv_sys_t::parse_mtr(): Remove an ATTRIBUTE_NOINLINE that would
prevent tail call optimization in recv_sys_t::parse_pmem().

recv_sys_t::parse(), recv_sys_t::parse_mtr(), recv_sys_t::parse_pmem():
Add template<bool store> parameter. Redo log record parsing
(store=false) is better specialized from store=true
(with bool if_exists) so that we can avoid some conditional branches
in frequently invoked low-level code.

recv_sys_t::is_memory_exhausted(): Remove. The special parse() status
GOT_OOM will report out-of-memory situation at the low level.

recv_sys_t::rewind(), page_recv_t::recs_t::rewind():
Remove all log starting with a specific LSN.

recv_scan_log(): Separate some code for only parsing, not storing log.
In rewound_lsn, remember the LSN at which last_phase=false recovery
ran out of memory. This is where the next call to recv_scan_log()
will resume storing the log. This replaces recv_sys.last_stored_lsn.

recv_sys_t::parse(): Evaluate the template parameter store in a few more
cases, to allow dead code to be eliminated at compile time.

recv_sys_t::scanned_lsn: The end of the log found by recv_scan_log().
The special value 1 means that recv_sys has been initialized but
no log has been parsed.

IORequest::write_complete(), IORequest::read_complete():
Replaces fil_aio_callback().

read_io_callback(), write_io_callback(): Replaces io_callback().

IORequest::fake_read_complete(), fake_io_callback(), os_fake_read():
Process a "fake read" request for concurrent recovery.

recv_sys_t::apply_batch(): Choose a number of successive pages
for a recovery batch.

recv_sys_t::erase(recv_sys_t::map::iterator): Remove log records for a
page whose recovery is not in progress. Log application threads
will not invoke this; they will only set being_recovered=-1 to indicate
that the entry is no longer needed.

recv_sys_t::garbage_collect(): Remove all being_recovered=-1 entries.

recv_sys_t::wait_for_pool(): Wait for some space to become available
in the buffer pool.

mlog_init_t::mark_ibuf_exist(): Avoid calls to
recv_sys::recover_low() via ibuf_page_exists() and buf_page_get_low().
Such calls would lead to double locking of recv_sys.mutex, which
depending on implementation could cause a deadlock. We will use
lower-level calls to look up index pages.

buf_LRU_block_remove_hashed(): Disable consistency checks for freed
ROW_FORMAT=COMPRESSED pages. Their contents could be uninitialized garbage.
This fixes an occasional failure of the test
innodb.innodb_bulk_create_index_debug.

Tested by: Matthias Leich
2023-05-19 15:15:38 +03:00