1
0
mirror of https://github.com/MariaDB/server.git synced 2025-08-29 00:08:14 +03:00
Commit Graph

398 Commits

Author SHA1 Message Date
Marko Mäkelä
9bc874a594 MDEV-23497 Make ROW_FORMAT=COMPRESSED read-only by default
Let us introduce the parameter innodb_read_only_compressed
that is ON by default, making any ROW_FORMAT=COMPRESSED tables
read-only.

I developed the ROW_FORMAT=COMPRESSED format based on
Heikki Tuuri's rough design between 2005 and 2008. It might
have been a good idea back then, but no proper benchmarks were
ever run to validate the design or the implementation.

The format has been more or less obsolete for years.
It limits innodb_page_size to 16384 bytes (the default),
and instant ALTER TABLE is not supported.

This is the first step towards deprecating and removing
write support for ROW_FORMAT=COMPRESSED tables.
2020-11-11 11:15:11 +02:00
Marko Mäkelä
09a1f0075a Merge 10.5 into 10.6 2020-11-02 12:49:19 +02:00
Marko Mäkelä
898521e2dd Merge 10.4 into 10.5 2020-10-30 11:15:30 +02:00
Marko Mäkelä
7b2bb67113 Merge 10.3 into 10.4 2020-10-29 13:38:38 +02:00
Marko Mäkelä
a8de8f261d Merge 10.2 into 10.3 2020-10-28 10:01:50 +02:00
Marko Mäkelä
45ed9dd957 MDEV-23855: Remove fil_system.LRU and reduce fil_system.mutex contention
Also fixes MDEV-23929: innodb_flush_neighbors is not being ignored
for system tablespace on SSD

When the maximum configured number of file is exceeded, InnoDB will
close data files. We used to maintain a fil_system.LRU list and
a counter fil_node_t::n_pending to achieve this, at the huge cost
of multiple fil_system.mutex operations per I/O operation.

fil_node_open_file_low(): Implement a FIFO replacement policy:
The last opened file will be moved to the end of fil_system.space_list,
and files will be closed from the start of the list. However, we will
not move tablespaces in fil_system.space_list while
i_s_tablespaces_encryption_fill_table() is executing
(producing output for INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION)
because it may cause information of some tablespaces to go missing.
We also avoid this in mariabackup --backup because datafiles_iter_next()
assumes that the ordering is not changed.

IORequest: Fold more parameters to IORequest::type.

fil_space_t::io(): Replaces fil_io().

fil_space_t::flush(): Replaces fil_flush().

OS_AIO_IBUF: Remove. We will always issue synchronous reads of the
change buffer pages in buf_read_page_low().

We will always ignore some errors for background reads.

This should reduce fil_system.mutex contention a little.

fil_node_t::complete_write(): Replaces fil_node_t::complete_io().
On both read and write completion, fil_space_t::release_for_io()
will have to be called.

fil_space_t::io(): Do not acquire fil_system.mutex in the normal
code path.

xb_delta_open_matching_space(): Do not try to open the system tablespace
which was already opened. This fixes a file sharing violation in
mariabackup --prepare --incremental.

Reviewed by: Vladislav Vaintroub
2020-10-26 17:09:01 +02:00
Varun Gupta
b94e8e4b25 MDEV-23867: insert... select crash in compute_window_func
There are 2 issues here:

Issue #1: memory allocation.
An IO_CACHE that uses encryption uses a larger buffer (it needs space for the encrypted data,
decrypted data, IO_CACHE_CRYPT struct to describe encryption parameters etc).

Issue #2: IO_CACHE::seek_not_done
When IO_CACHE objects are cloned, they still share the file descriptor.
This means, operation on one IO_CACHE may change the file read position
which will confuse other IO_CACHEs using it.

The fix of these issues would be:
Allocate the buffer to also include the extra size needed for encryption.
Perform seek again after one IO_CACHE reads the file.
2020-10-23 22:36:47 +05:30
Marko Mäkelä
1657b7a583 Merge 10.4 to 10.5 2020-10-22 17:08:49 +03:00
Marko Mäkelä
46957a6a77 Merge 10.3 into 10.4 2020-10-22 13:27:18 +03:00
Marko Mäkelä
e3d692aa09 Merge 10.2 into 10.3 2020-10-22 08:26:28 +03:00
Monty
f9c432c5a9 Disable from valgrind big innodb tests that doesn't run well in valgrind 2020-10-21 03:09:29 +03:00
Marko Mäkelä
db02c458c9 Clean up some encryption tests
Instead of pointlessly waiting for a page flush to occur, take
the matter into our own hands and request an explicit flush.
Also, test with the minimum necessary amount of data (0 or 1 rows)
so that both page encryption and decryption will be exercised.
2020-10-17 13:13:01 +03:00
Marko Mäkelä
7cffb5f6e8 MDEV-23399: Performance regression with write workloads
The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted
the performance bottleneck to the page flushing.

The configuration parameters will be changed as follows:

innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction)
innodb_lru_scan_depth=1536 (old: 1024)
innodb_max_dirty_pages_pct=90 (old: 75)
innodb_max_dirty_pages_pct_lwm=75 (old: 0)

Note: The parameter innodb_lru_scan_depth will only affect LRU
eviction of buffer pool pages when a new page is being allocated. The
page cleaner thread will no longer evict any pages. It used to
guarantee that some pages will remain free in the buffer pool. Now, we
perform that eviction 'on demand' in buf_LRU_get_free_block().
The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows:
 * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks()
 * As a buf_pool.free limit in buf_LRU_list_batch() for terminating
   the flushing that is initiated e.g., by buf_LRU_get_free_block()
The parameter also used to serve as an initial limit for unzip_LRU
eviction (evicting uncompressed page frames while retaining
ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit
of 100 or unlimited for invoking buf_LRU_scan_and_free_block().

The status variables will be changed as follows:

innodb_buffer_pool_pages_flushed: This includes also the count of
innodb_buffer_pool_pages_LRU_flushed and should work reliably,
updated one by one in buf_flush_page() to give more real-time
statistics. The function buf_flush_stats(), which we are removing,
was not called in every code path. For both counters, we will use
regular variables that are incremented in a critical section of
buf_pool.mutex. Note that show_innodb_vars() directly links to the
variables, and reads of the counters will *not* be protected by
buf_pool.mutex, so you cannot get a consistent snapshot of both variables.

The following INFORMATION_SCHEMA.INNODB_METRICS counters will be
removed, because the page cleaner no longer deals with writing or
evicting least recently used pages, and because the single-page writes
have been removed:
* buffer_LRU_batch_flush_avg_time_slot
* buffer_LRU_batch_flush_avg_time_thread
* buffer_LRU_batch_flush_avg_time_est
* buffer_LRU_batch_flush_avg_pass
* buffer_LRU_single_flush_scanned
* buffer_LRU_single_flush_num_scan
* buffer_LRU_single_flush_scanned_per_call

When moving to a single buffer pool instance in MDEV-15058, we missed
some opportunity to simplify the buf_flush_page_cleaner thread. It was
unnecessarily using a mutex and some complex data structures, even
though we always have a single page cleaner thread.

Furthermore, the buf_flush_page_cleaner thread had separate 'recovery'
and 'shutdown' modes where it was waiting to be triggered by some
other thread, adding unnecessary latency and potential for hangs in
relatively rarely executed startup or shutdown code.

The page cleaner was also running two kinds of batches in an
interleaved fashion: "LRU flush" (writing out some least recently used
pages and evicting them on write completion) and the normal batches
that aim to increase the MIN(oldest_modification) in the buffer pool,
to help the log checkpoint advance.

The buf_pool.flush_list flushing was being blocked by
buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN
of a page is ahead of log_sys.get_flushed_lsn(), that is, what has
been persistently written to the redo log, we would trigger a log
flush and then resume the page flushing. This would unnecessarily
limit the performance of the page cleaner thread and trigger the
infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms.
The settings might not be optimal" that were suppressed in
commit d1ab89037a unless log_warnings>2.

Our revised algorithm will make log_sys.get_flushed_lsn() advance at
the start of buf_flush_lists(), and then execute a 'best effort' to
write out all pages. The flush batches will skip pages that were modified
since the log was written, or are are currently exclusively locked.
The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message
will be removed, because by design, the buf_flush_page_cleaner() should
not be blocked during a batch for extended periods of time.

We will remove the single-page flushing altogether. Related to this,
the debug parameter innodb_doublewrite_batch_size will be removed,
because all of the doublewrite buffer will be used for flushing
batches. If a page needs to be evicted from the buffer pool and all
100 least recently used pages in the buffer pool have unflushed
changes, buf_LRU_get_free_block() will execute buf_flush_lists() to
write out and evict innodb_lru_flush_size pages. At most one thread
will execute buf_flush_lists() in buf_LRU_get_free_block(); other
threads will wait for that LRU flushing batch to finish.

To improve concurrency, we will replace the InnoDB ib_mutex_t and
os_event_t native mutexes and condition variables in this area of code.
Most notably, this means that the buffer pool mutex (buf_pool.mutex)
is no longer instrumented via any InnoDB interfaces. It will continue
to be instrumented via PERFORMANCE_SCHEMA.

For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be
declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical
sections of buf_pool.flush_list_mutex should be shorter than those for
buf_pool.mutex, because in the worst case, they cover a linear scan of
buf_pool.flush_list, while the worst case of a critical section of
buf_pool.mutex covers a linear scan of the potentially much longer
buf_pool.LRU list.

mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable
with SAFE_MUTEX. Some InnoDB debug assertions need this predicate
instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner().

buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list:
Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[].
The number of active flush operations.

buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t
instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA
and SAFE_MUTEX instrumentation.

buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU.

buf_pool_t::done_flush_list: Condition variable for !n_flush_list.

buf_pool_t::do_flush_list: Condition variable to wake up the
buf_flush_page_cleaner when a log checkpoint needs to be written
or the server is being shut down. Replaces buf_flush_event.
We will keep using timed waits (the page cleaner thread will wake
_at least_ once per second), because the calculations for
innodb_adaptive_flushing depend on fixed time intervals.

buf_dblwr: Allocate statically, and move all code to member functions.
Use a native mutex and condition variable. Remove code to deal with
single-page flushing.

buf_dblwr_check_block(): Make the check debug-only. We were spending
a significant amount of execution time in page_simple_validate_new().

flush_counters_t::unzip_LRU_evicted: Remove.

IORequest: Make more members const. FIXME: m_fil_node should be removed.

buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex
(which we are removing).

page_cleaner_slot_t, page_cleaner_t: Remove many redundant members.

pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot().

recv_writer_thread: Remove. Recovery works just fine without it, if we
simply invoke buf_flush_sync() at the end of each batch in
recv_sys_t::apply().

recv_recovery_from_checkpoint_finish(): Remove. We can simply call
recv_sys.debug_free() directly.

srv_started_redo: Replaces srv_start_state.

SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown()
can communicate with the normal page cleaner loop via the new function
flush_buffer_pool().

buf_flush_remove(): Assert that the calling thread is holding
buf_pool.flush_list_mutex. This removes unnecessary mutex operations
from buf_flush_remove_pages() and buf_flush_dirty_pages(),
which replace buf_LRU_flush_or_remove_pages().

buf_flush_lists(): Renamed from buf_flush_batch(), with simplified
interface. Return the number of flushed pages. Clarified comments and
renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions
buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this
function, which was their only caller, and remove 2 unnecessary
buf_pool.mutex release/re-acquisition that we used to perform around
the buf_flush_batch() call. At the start, if not all log has been
durably written, wait for a background task to do it, or start a new
task to do it. This allows the log write to run concurrently with our
page flushing batch. Any pages that were skipped due to too recent
FIL_PAGE_LSN or due to them being latched by a writer should be flushed
during the next batch, unless there are further modifications to those
pages. It is possible that a page that we must flush due to small
oldest_modification also carries a recent FIL_PAGE_LSN or is being
constantly modified. In the worst case, all writers would then end up
waiting in log_free_check() to allow the flushing and the checkpoint
to complete.

buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.

buf_flush_space(): Auxiliary function to look up a tablespace for
page flushing.

buf_flush_page(): Defer the computation of space->full_crc32(). Never
call log_write_up_to(), but instead skip persistent pages whose latest
modification (FIL_PAGE_LSN) is newer than the redo log. Also skip
pages on which we cannot acquire a shared latch without waiting.

buf_flush_try_neighbors(): Do not bother checking buf_fix_count
because buf_flush_page() will no longer wait for the page latch.
Take the tablespace as a parameter, and only execute this function
when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold().

buf_flush_relocate_on_flush_list(): Declare as cold, and push down
a condition from the callers.

buf_flush_check_neighbor(): Take id.fold() as a parameter.

buf_flush_sync(): Ensure that the buf_pool.flush_list is empty,
because the flushing batch will skip pages whose modifications have
not yet been written to the log or were latched for modification.

buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables.

buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize
the counters, and report n->evicted.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.

buf_do_LRU_batch(): Return the number of pages flushed.

buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if
adaptive hash index entries are pointing to the block.

buf_LRU_get_free_block(): Do not wake up the page cleaner, because it
will no longer perform any useful work for us, and we do not want it
to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0)
writes out and evicts at most innodb_lru_flush_size pages. (The
function buf_do_LRU_batch() may complete after writing fewer pages if
more than innodb_lru_scan_depth pages end up in buf_pool.free list.)
Eliminate some mutex release-acquire cycles, and wait for the LRU
flush batch to complete before rescanning.

buf_LRU_check_size_of_non_data_objects(): Simplify the code.

buf_page_write_complete(): Remove the parameter evict, and always
evict pages that were part of an LRU flush.

buf_page_create(): Take a pre-allocated page as a parameter.

buf_pool_t::free_block(): Free a pre-allocated block.

recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block
while not holding recv_sys.mutex. During page allocation, we may
initiate a page flush, which in turn may initiate a log flush, which
would require acquiring log_sys.mutex, which should always be acquired
before recv_sys.mutex in order to avoid deadlocks. Therefore, we must
not be holding recv_sys.mutex while allocating a buffer pool block.

BtrBulk::logFreeCheck(): Skip a redundant condition.

row_undo_step(): Do not invoke srv_inc_activity_count() for every row
that is being rolled back. It should suffice to invoke the function in
trx_flush_log_if_needed() during trx_t::commit_in_memory() when the
rollback completes.

sync_check_enable(): Remove. We will enable innodb_sync_debug from the
very beginning.

Reviewed by: Vladislav Vaintroub
2020-10-15 17:04:56 +03:00
Marko Mäkelä
6ce0a6f9ad Merge 10.5 into 10.6 2020-09-24 10:21:26 +03:00
Marko Mäkelä
882ce206db Merge 10.4 into 10.5 2020-09-23 11:32:43 +03:00
Marko Mäkelä
55e48b7722 Merge 10.3 into 10.4 2020-09-22 15:12:59 +03:00
Marko Mäkelä
fde3d895d9 Merge 10.2 into 10.3 2020-09-22 14:33:03 +03:00
Marko Mäkelä
2af8f712de MDEV-23776: Re-apply the fix and make the test more robust
The test that was added in commit e05650e686
would break a subsequent run of a test encryption.innodb-bad-key-change
because some pages in the system tablespace would be encrypted with
a different key.

The failure was repeatable with the following invocation:

./mtr --no-reorder \
encryption.create_or_replace,cbc \
encryption.innodb-bad-key-change,cbc

Because the crash was unrelated to the code changes that we reverted
in commit eb38b1f703
we can safely re-apply those fixes.
2020-09-22 13:08:09 +03:00
Marko Mäkelä
952a028a52 Merge 10.3 into 10.4
We omit commit a3bdce8f1e
and commit a0e2a293bc
because they would make the test galera_3nodes.galera_gtid_2_cluster
fail and disable it.
2020-09-21 17:42:02 +03:00
Marko Mäkelä
2cf489d430 Merge 10.2 into 10.3 2020-09-21 16:39:23 +03:00
Marko Mäkelä
e05650e686 MDEV-23776: Add the reduced encryption.create_or_replace test
This was forgotten in commit a9d8f5c1a5.
2020-09-21 16:35:28 +03:00
Marko Mäkelä
a9d8f5c1a5 MDEV-23776: Split encryption.create_or_replace
Let us shrink the test encryption.create_or_replace so that it can
run on the CI system, also on the embedded server.

encryption.create_or_replace_big: Renamed from the original test,
with the subset of encryption.create_or_replace omitted.
2020-09-21 16:14:35 +03:00
Andrei Elkin
a19cb3884f MDEV-23511 shutdown_server 10 times out, causing server kill at shutdown
Shutdown of mtr tests may be too impatient, esp on CI environment where
10 seconds of `arg` of `shutdown_server arg` may not be enough for the clean
shutdown to complete.

This is fixed to remove explicit non-zero timeout argument to
`shutdown_server` from all mtr tests. mysqltest computes 60 seconds default
value for the timeout for the argless `shutdown_server` command.
This policy is additionally ensured with a compile time assert.
2020-08-21 14:48:53 +03:00
Marko Mäkelä
2fa9f8c53a Merge 10.3 into 10.4 2020-08-20 11:01:47 +03:00
Marko Mäkelä
de0e7cd72a Merge 10.2 into 10.3 2020-08-20 09:12:16 +03:00
Marko Mäkelä
4c50120d14 MDEV-23474 InnoDB fails to restart after SET GLOBAL innodb_log_checksums=OFF
Regretfully, the parameter innodb_log_checksums was introduced
in MySQL 5.7.9 (the first GA release of that series) by
mysql/mysql-server@af0acedd88
which partly replaced a parameter that had been introduced in 5.7.8
mysql/mysql-server@22ba38218e
as innodb_log_checksum_algorithm.

Given that the CRC-32C operations are accelerated on many processor
implementations (AMD64 with SSE4.2; since MDEV-22669 also on IA-32
with SSE4.2, POWER 8 and later, ARMv8 with some extensions)
and by lookup tables when only generic SISD instructions are available,
there should be no valid reason to disable checksums.

In MariaDB 10.5.2, as a preparation for MDEV-12353, MDEV-19543 deprecated
and ignored the parameter innodb_log_checksums altogether. This should
imply that after a clean shutdown with innodb_log_checksums=OFF one
cannot upgrade to MariaDB Server 10.5 at all.

Due to these problems, let us deprecate the parameter innodb_log_checksums
and honor it only during server startup.
The command SET GLOBAL innodb_log_checksums will always set the
parameter to ON.
2020-08-18 16:46:07 +03:00
Marko Mäkelä
e685809a3b MDEV-23397 Remove deprecated InnoDB options in 10.6 2020-08-04 12:51:59 +03:00
Marko Mäkelä
50a11f396a Merge 10.4 into 10.5 2020-08-01 14:42:51 +03:00
Marko Mäkelä
9216114ce7 Merge 10.3 into 10.4 2020-07-31 18:09:08 +03:00
Marko Mäkelä
66ec3a770f Merge 10.2 into 10.3 2020-07-31 13:51:28 +03:00
Marko Mäkelä
6307b17aa1 MDEV-20142 encryption.innodb_encrypt_temporary_tables failed in buildbot with wrong result
Let us read both encrypted temporary tables to increase the changes of
page flushing and eviction.
2020-07-28 13:00:59 +03:00
Thirunarayanan Balathandayuthapani
c92f7e287f MDEV-8139 Fix Scrubbing
fil_space_t::freed_ranges: Store ranges of freed page numbers.

fil_space_t::last_freed_lsn: Store the most recent LSN of
freeing a page.

fil_space_t::freed_mutex: Protects freed_ranges, last_freed_lsn.

fil_space_create(): Initialize the freed_range mutex.

fil_space_free_low(): Frees the freed_range mutex.

range_set: Ranges of page numbers.

buf_page_create(): Removes the page from freed_ranges when page
is being reused.

btr_free_root(): Remove the PAGE_INDEX_ID invalidation. Because
btr_free_root() and dict_drop_index_tree() are executed in
the same atomic mini-transaction, there is no need to
invalidate the root page.

buf_release_freed_page(): Split from buf_flush_freed_page().
Skip any I/O

buf_flush_freed_pages(): Get the freed ranges from tablespace and
Write punch-hole or zeroes of the freed ranges.

buf_flush_try_neighbors(): Handles the flushing of freed ranges.

mtr_t::freed_pages: Variable to store the list of freed pages.

mtr_t::add_freed_pages(): To add freed pages.

mtr_t::clear_freed_pages(): To clear the freed pages.

mtr_t::m_freed_in_system_tablespace: Variable to indicate whether page has
been freed in system tablespace.

mtr_t::m_trim_pages: Variable to indicate whether the space has been trimmed.

mtr_t::commit(): Add the freed page and update the last freed lsn
in the tablespace and clear the tablespace freed range if space is
trimmed.

file_name_t::freed_pages: Store the freed pages during recovery.

file_name_t::add_freed_page(), file_name_t::remove_freed_page(): To
add and remove freed page during recovery.

store_freed_or_init_rec(): Store or remove the freed pages while
encountering FREE_PAGE or INIT_PAGE redo log record.

recv_init_crash_recovery_spaces(): Add the freed page encountered
during recovery to respective tablespace.
2020-06-12 09:17:51 +05:30
Marko Mäkelä
b1ab211dee MDEV-15053 Reduce buf_pool_t::mutex contention
User-visible changes: The INFORMATION_SCHEMA views INNODB_BUFFER_PAGE
and INNODB_BUFFER_PAGE_LRU will report a dummy value FLUSH_TYPE=0
and will no longer report the PAGE_STATE value READY_FOR_USE.

We will remove some fields from buf_page_t and move much code to
member functions of buf_pool_t and buf_page_t, so that the access
rules of data members can be enforced consistently.

Evicting or adding pages in buf_pool.LRU will remain covered by
buf_pool.mutex.

Evicting or adding pages in buf_pool.page_hash will remain
covered by both buf_pool.mutex and the buf_pool.page_hash X-latch.

After this fix, buf_pool.page_hash lookups can entirely
avoid acquiring buf_pool.mutex, only relying on
buf_pool.hash_lock_get() S-latch.

Similarly, buf_flush_check_neighbors() can will rely solely on
buf_pool.mutex, no buf_pool.page_hash latch at all.

The buf_pool.mutex is rather contended in I/O heavy benchmarks,
especially when the workload does not fit in the buffer pool.

The first attempt to alleviate the contention was the
buf_pool_t::mutex split in
commit 4ed7082eef
which introduced buf_block_t::mutex, which we are now removing.

Later, multiple instances of buf_pool_t were introduced
in commit c18084f71b
and recently removed by us in
commit 1a6f708ec5 (MDEV-15058).

UNIV_BUF_DEBUG: Remove. This option to enable some buffer pool
related debugging in otherwise non-debug builds has not been used
for years. Instead, we have been using UNIV_DEBUG, which is enabled
in CMAKE_BUILD_TYPE=Debug.

buf_block_t::mutex, buf_pool_t::zip_mutex: Remove. We can mainly rely on
std::atomic and the buf_pool.page_hash latches, and in some cases
depend on buf_pool.mutex or buf_pool.flush_list_mutex just like before.
We must always release buf_block_t::lock before invoking
unfix() or io_unfix(), to prevent a glitch where a block that was
added to the buf_pool.free list would apper X-latched. See
commit c5883debd6 how this glitch
was finally caught in a debug environment.

We move some buf_pool_t::page_hash specific code from the
ha and hash modules to buf_pool, for improved readability.

buf_pool_t::close(): Assert that all blocks are clean, except
on aborted startup or crash-like shutdown.

buf_pool_t::validate(): No longer attempt to validate
n_flush[] against the number of BUF_IO_WRITE fixed blocks,
because buf_page_t::flush_type no longer exists.

buf_pool_t::watch_set(): Replaces buf_pool_watch_set().
Reduce mutex contention by separating the buf_pool.watch[]
allocation and the insert into buf_pool.page_hash.

buf_pool_t::page_hash_lock<bool exclusive>(): Acquire a
buf_pool.page_hash latch.
Replaces and extends buf_page_hash_lock_s_confirm()
and buf_page_hash_lock_x_confirm().

buf_pool_t::READ_AHEAD_PAGES: Renamed from BUF_READ_AHEAD_PAGES.

buf_pool_t::curr_size, old_size, read_ahead_area, n_pend_reads:
Use Atomic_counter.

buf_pool_t::running_out(): Replaces buf_LRU_buf_pool_running_out().

buf_pool_t::LRU_remove(): Remove a block from the LRU list
and return its predecessor. Incorporates buf_LRU_adjust_hp(),
which was removed.

buf_page_get_gen(): Remove a redundant call of fsp_is_system_temporary(),
for mode == BUF_GET_IF_IN_POOL_OR_WATCH, which is only used by
BTR_DELETE_OP (purge), which is never invoked on temporary tables.

buf_free_from_unzip_LRU_list_batch(): Avoid redundant assignments.

buf_LRU_free_from_unzip_LRU_list(): Simplify the loop condition.

buf_LRU_free_page(): Clarify the function comment.

buf_flush_check_neighbor(), buf_flush_check_neighbors():
Rewrite the construction of the page hash range. We will hold
the buf_pool.mutex for up to buf_pool.read_ahead_area (at most 64)
consecutive lookups of buf_pool.page_hash.

buf_flush_page_and_try_neighbors(): Remove.
Merge to its only callers, and remove redundant operations in
buf_flush_LRU_list_batch().

buf_read_ahead_random(), buf_read_ahead_linear(): Rewrite.
Do not acquire buf_pool.mutex, and iterate directly with page_id_t.

ut_2_power_up(): Remove. my_round_up_to_next_power() is inlined
and avoids any loops.

fil_page_get_prev(), fil_page_get_next(), fil_addr_is_null(): Remove.

buf_flush_page(): Add a fil_space_t* parameter. Minimize the
buf_pool.mutex hold time. buf_pool.n_flush[] is no longer updated
atomically with the io_fix, and we will protect most buf_block_t
fields with buf_block_t::lock. The function
buf_flush_write_block_low() is removed and merged here.

buf_page_init_for_read(): Use static linkage. Initialize the newly
allocated block and acquire the exclusive buf_block_t::lock while not
holding any mutex.

IORequest::IORequest(): Remove the body. We only need to invoke
set_punch_hole() in buf_flush_page() and nowhere else.

buf_page_t::flush_type: Remove. Replaced by IORequest::flush_type.
This field is only used during a fil_io() call.
That function already takes IORequest as a parameter, so we had
better introduce  for the rarely changing field.

buf_block_t::init(): Replaces buf_page_init().

buf_page_t::init(): Replaces buf_page_init_low().

buf_block_t::initialise(): Initialise many fields, but
keep the buf_page_t::state(). Both buf_pool_t::validate() and
buf_page_optimistic_get() requires that buf_page_t::in_file()
be protected atomically with buf_page_t::in_page_hash
and buf_page_t::in_LRU_list.

buf_page_optimistic_get(): Now that buf_block_t::mutex
no longer exists, we must check buf_page_t::io_fix()
after acquiring the buf_pool.page_hash lock, to detect
whether buf_page_init_for_read() has been initiated.
We will also check the io_fix() before acquiring hash_lock
in order to avoid unnecessary computation.
The field buf_block_t::modify_clock (protected by buf_block_t::lock)
allows buf_page_optimistic_get() to validate the block.

buf_page_t::real_size: Remove. It was only used while flushing
pages of page_compressed tables.

buf_page_encrypt(): Add an output parameter that allows us ot eliminate
buf_page_t::real_size. Replace a condition with debug assertion.

buf_page_should_punch_hole(): Remove.

buf_dblwr_t::add_to_batch(): Replaces buf_dblwr_add_to_batch().
Add the parameter size (to replace buf_page_t::real_size).

buf_dblwr_t::write_single_page(): Replaces buf_dblwr_write_single_page().
Add the parameter size (to replace buf_page_t::real_size).

fil_system_t::detach(): Replaces fil_space_detach().
Ensure that fil_validate() will not be violated even if
fil_system.mutex is released and reacquired.

fil_node_t::complete_io(): Renamed from fil_node_complete_io().

fil_node_t::close_to_free(): Replaces fil_node_close_to_free().
Avoid invoking fil_node_t::close() because fil_system.n_open
has already been decremented in fil_space_t::detach().

BUF_BLOCK_READY_FOR_USE: Remove. Directly use BUF_BLOCK_MEMORY.

BUF_BLOCK_ZIP_DIRTY: Remove. Directly use BUF_BLOCK_ZIP_PAGE,
and distinguish dirty pages by buf_page_t::oldest_modification().

BUF_BLOCK_POOL_WATCH: Remove. Use BUF_BLOCK_NOT_USED instead.
This state was only being used for buf_page_t that are in
buf_pool.watch.

buf_pool_t::watch[]: Remove pointer indirection.

buf_page_t::in_flush_list: Remove. It was set if and only if
buf_page_t::oldest_modification() is nonzero.

buf_page_decrypt_after_read(), buf_corrupt_page_release(),
buf_page_check_corrupt(): Change the const fil_space_t* parameter
to const fil_node_t& so that we can report the correct file name.

buf_page_monitor(): Declare as an ATTRIBUTE_COLD global function.

buf_page_io_complete(): Split to buf_page_read_complete() and
buf_page_write_complete().

buf_dblwr_t::in_use: Remove.

buf_dblwr_t::buf_block_array: Add IORequest::flush_t.

buf_dblwr_sync_datafiles(): Remove. It was a useless wrapper of
os_aio_wait_until_no_pending_writes().

buf_flush_write_complete(): Declare static, not global.
Add the parameter IORequest::flush_t.

buf_flush_freed_page(): Simplify the code.

recv_sys_t::flush_lru: Renamed from flush_type and changed to bool.

fil_read(), fil_write(): Replaced with direct use of fil_io().

fil_buffering_disabled(): Remove. Check srv_file_flush_method directly.

fil_mutex_enter_and_prepare_for_io(): Return the resolved
fil_space_t* to avoid a duplicated lookup in the caller.

fil_report_invalid_page_access(): Clean up the parameters.

fil_io(): Return fil_io_t, which comprises fil_node_t and error code.
Always invoke fil_space_t::acquire_for_io() and let either the
sync=true caller or fil_aio_callback() invoke
fil_space_t::release_for_io().

fil_aio_callback(): Rewrite to replace buf_page_io_complete().

fil_check_pending_operations(): Remove a parameter, and remove some
redundant lookups.

fil_node_close_to_free(): Wait for n_pending==0. Because we no longer
do an extra lookup of the tablespace between fil_io() and the
completion of the operation, we must give fil_node_t::complete_io() a
chance to decrement the counter.

fil_close_tablespace(): Remove unused parameter trx, and document
that this is only invoked during the error handling of IMPORT TABLESPACE.

row_import_discard_changes(): Merged with the only caller,
row_import_cleanup(). Do not lock up the data dictionary while
invoking fil_close_tablespace().

logs_empty_and_mark_files_at_shutdown(): Do not invoke
fil_close_all_files(), to avoid a !needs_flush assertion failure
on fil_node_t::close().

innodb_shutdown(): Invoke os_aio_free() before fil_close_all_files().

fil_close_all_files(): Invoke fil_flush_file_spaces()
to ensure proper durability.

thread_pool::unbind(): Fix a crash that would occur on Windows
after srv_thread_pool->disable_aio() and os_file_close().
This fix was submitted by Vladislav Vaintroub.

Thanks to Matthias Leich and Axel Schwenke for extensive testing,
Vladislav Vaintroub for helpful comments, and Eugene Kosov for a review.
2020-06-05 12:35:46 +03:00
Marko Mäkelä
23047d3ed4 Merge 10.4 into 10.5 2020-05-18 17:30:02 +03:00
Varun Gupta
0a5668f512 MDEV-22556: Incorrect result for window function when using encrypt-tmp-files=ON
The issue here is that end_of_file for encrypted temporary IO_CACHE (used by filesort) is updated
using lseek.
Encryption adds storage overhead and hides it from the caller by recalculating offsets and lengths.
Two different IO_CACHE cannot possibly modify the same file
because the encryption key is randomly generated and stored in the IO_CACHE.
So when the tempfiles are encrypted DO NOT use lseek to change end_of_file.

Further observations about updating end_of_file using lseek
1) The end_of_file update is only used for binlog index files
2) The whole point is to update file length when the file was modified via a different file descriptor.
3) The temporary IO_CACHE files can never be modified via a different file descriptor.
4) For encrypted temporary IO_CACHE, end_of_file should not be updated with lseek
2020-05-17 15:26:18 +05:30
Marko Mäkelä
9e6e43551f Merge 10.3 into 10.4
We will expose some more std::atomic internals in Atomic_counter,
so that dict_index_t::lock will support the default assignment operator.
2020-05-16 07:39:15 +03:00
Marko Mäkelä
6a6bcc53b8 Merge 10.2 into 10.3 2020-05-15 17:55:01 +03:00
Monty
277aa85c9b Fixed bugs found by valgrind
Other things:
- Removed innodb_encryption_tables.test from valgrind as it
  takes a REALLY long time
2020-05-15 10:44:05 +03:00
Marko Mäkelä
2f7d91bb6c MDEV-22242 B-trees can become extremely skewed
The test innodb.innodb_wl6326 that had been disabled in 10.4 due to
MDEV-21535 is failing on 10.5 due to a different reason: the removal
of the MLOG_COMP_END_COPY_CREATED operations in MDEV-12353
commit 276f996af9 caused PAGE_LAST_INSERT
to be set to something nonzero by the function page_copy_rec_list_end().

This in turn would cause btr_page_get_split_rec_to_right() to behave
differently: we would not attempt to split the page at all, but simply
insert the new record into the new, empty, right leaf page.

Even though the change reduced the sizes of some tables, it is better
to aim for balanced trees.

page_copy_rec_list_end(), PageBulk::finishPage():
Preserve PAGE_LAST_INSERT, PAGE_N_DIRECTION, PAGE_DIRECTION.

PageBulk::finish(): Move some common code from PageBulk::finishPage().
2020-04-14 18:43:03 +03:00
Marko Mäkelä
0eab87cef2 MDEV-22010: Allow mariadbd in mtr suppressions 2020-04-07 17:20:38 +03:00
Daniel Black
d848fcad69 MDEV-22010: mtr, "mariadbd" exists in mysys error messages
Also executable can change for jemalloc.
2020-04-07 23:39:10 +10:00
Marko Mäkelä
64be4ab4a8 MDEV-21870 Deprecate and ignore innodb_scrub_log and innodb_scrub_log_speed
The configuration parameter innodb_scrub_log never really worked, as
reported in MDEV-13019 and MDEV-18370.

Because MDEV-14425 is changing the redo log format, the innodb_scrub_log
feature would have to be adjusted for it. Due to the known problems,
it is easier to remove the feature for now, and to ignore and deprecate
the parameters.

If old log contents should be kept secret, then enabling innodb_encrypt_log
or setting a smaller innodb_log_file_size could help.
2020-03-04 19:01:09 +02:00
Eugene Kosov
9ef2d29ff4 MDEV-14425 deprecate and ignore innodb_log_files_in_group
Now there can be only one log file instead of several which
logically work as a single file.

Possible names of redo log files: ib_logfile0,
ib_logfile101 (for just created one)

innodb_log_fiels_in_group: value of this variable is not used
by InnoDB. Possible values are still 1..100, to not break upgrade

LOG_FILE_NAME: add constant of value "ib_logfile0"
LOG_FILE_NAME_PREFIX: add constant of value "ib_logfile"

get_log_file_path(): convenience function that returns full
path of a redo log file

SRV_N_LOG_FILES_MAX: removed

srv_n_log_files: we can't remove this for compatibility reasons,
but now server doesn't use this variable

log_sys_t::file::fd: now just one, not std::vector

log_sys_t::log_capacity: removed word 'group'

find_and_check_log_file(): part of logic from huge srv_start()
moved here

recv_sys_t::files: file descriptors of redo log files.
There can be several of those in case we're upgrading
from older MariaDB version.

recv_sys_t::remove_extra_log_files: whether to remove
ib_logfile{1,2,3...} after successfull upgrade.

recv_sys_t::read(): open if needed and read from one
of several log files

recv_sys_t::files_size(): open if needed and return files count

redo_file_sizes_are_correct(): check that redo log files
sizes are equal. Just to log an error for a user.
Corresponding check was moved from srv0start.cc

namespace deprecated: put all deprecated variables here to
prevent usage of it by us, developers
2020-02-19 12:21:59 +03:00
Marko Mäkelä
444c83b2ac MDEV-12353: Test InnoDB upgrade from multi-file redo log 2020-02-14 14:46:44 +02:00
Marko Mäkelä
f8a9f90667 MDEV-12353: Remove support for crash-upgrade
We tighten some assertions regarding dict_index_t::is_dummy
and crash recovery, now that redo log processing will
no longer create dummy objects.
2020-02-13 19:13:45 +02:00
Marko Mäkelä
7ae21b18a6 MDEV-12353: Change the redo log encoding
log_t::FORMAT_10_5: physical redo log format tag

log_phys_t: Buffered records in the physical format.
The log record bytes will follow the last data field,
making use of alignment padding that would otherwise be wasted.
If there are multiple records for the same page, also those
may be appended to an existing log_phys_t object if the memory
is available.

In the physical format, the first byte of a record identifies the
record and its length (up to 15 bytes). For longer records, the
immediately following bytes will encode the remaining length
in a variable-length encoding. Usually, a variable-length-encoded
page identifier will follow, followed by optional payload, whose
length is included in the initially encoded total record length.

When a mini-transaction is updating multiple fields in a page,
it can avoid repeating the tablespace identifier and page number
by setting the same_page flag (most significant bit) in the first
byte of the log record. The byte offset of the record will be
relative to where the previous record for that page ended.

Until MDEV-14425 introduces a separate file-level log for
redo log checkpoints and file operations, we will write the
file-level records in the page-level redo log file.
The record FILE_CHECKPOINT (which replaces MLOG_CHECKPOINT)
will be removed in MDEV-14425, and one sequential scan of the
page recovery log will suffice.

Compared to MLOG_FILE_CREATE2, FILE_CREATE will not include any flags.
If the information is needed, it can be parsed from WRITE records that
modify FSP_SPACE_FLAGS.

MLOG_ZIP_WRITE_STRING: Remove. The record was only introduced temporarily
as part of this work, before being replaced with WRITE (along with
MLOG_WRITE_STRING, MLOG_1BYTE, MLOG_nBYTES).

mtr_buf_t::empty(): Check if the buffer is empty.

mtr_t::m_n_log_recs: Remove. It suffices to check if m_log is empty.

mtr_t::m_last, mtr_t::m_last_offset: End of the latest m_log record,
for the same_page encoding.

page_recv_t::last_offset: Reflects mtr_t::m_last_offset.

Valid values for last_offset during recovery should be 0 or above 8.
(The first 8 bytes of a page are the checksum and the page number,
and neither are ever updated directly by log records.)
Internally, the special value 1 indicates that the same_page form
will not be allowed for the subsequent record.

mtr_t::page_create(): Take the block descriptor as parameter,
so that it can be compared to mtr_t::m_last. The INIT_INDEX_PAGE
record will always followed by a subtype byte, because same_page
records must be longer than 1 byte.

trx_undo_page_init(): Combine the writes in WRITE record.

trx_undo_header_create(): Write 4 bytes using a special MEMSET
record that includes 1 bytes of length and 2 bytes of payload.

flst_write_addr(): Define as a static function. Combine the writes.

flst_zero_both(): Replaces two flst_zero_addr() calls.

flst_init(): Do not inline the function.

fsp_free_seg_inode(): Zerofill the whole inode.

fsp_apply_init_file_page(): Initialize FIL_PAGE_PREV,FIL_PAGE_NEXT
to FIL_NULL when using the physical format.

btr_create(): Assert !page_has_siblings() because fsp_apply_init_file_page()
must have been invoked.

fil_ibd_create(): Do not write FILE_MODIFY after FILE_CREATE.

fil_names_dirty_and_write(): Remove the parameter mtr.
Write the records using a separate mini-transaction object,
because any FILE_ records must be at the start of a mini-transaction log.

recv_recover_page(): Add a fil_space_t* parameter.
After applying log to the a ROW_FORMAT=COMPRESSED page,
invoke buf_zip_decompress() to restore the uncompressed page.

buf_page_io_complete(): Remove the temporary hack to discard the
uncompressed page of a ROW_FORMAT=COMPRESSED page.

page_zip_write_header(): Remove. Use mtr_t::write() or
mtr_t::memset() instead, and update the compressed page frame
separately.

trx_undo_header_add_space_for_xid(): Remove.

trx_undo_seg_create(): Perform the changes that were previously
made by trx_undo_header_add_space_for_xid().

btr_reset_instant(): New function: Reset the table to MariaDB 10.2
or 10.3 format when rolling back an instant ALTER TABLE operation.

page_rec_find_owner_rec(): Merge with the only callers.

page_cur_insert_rec_low(): Combine writes by using a local buffer.
MEMMOVE data from the preceding record whenever feasible
(copying at least 3 bytes).

page_cur_insert_rec_zip(): Combine writes to page header fields.

PageBulk::insertPage(): Issue MEMMOVE records to copy a matching
part from the preceding record.

PageBulk::finishPage(): Combine the writes to the page header
and to the sparse page directory slots.

mtr_t::write(): Only log the least significant (last) bytes
of multi-byte fields that actually differ.

For updating FSP_SIZE, we must always write all 4 bytes to the
redo log, so that the fil_space_set_recv_size() logic in
recv_sys_t::parse() will work.

mtr_t::memcpy(), mtr_t::zmemcpy(): Take a pointer argument
instead of a numeric offset to the page frame. Only log the
last bytes of multi-byte fields that actually differ.

In fil_space_crypt_t::write_page0(), we must log also any
unchanged bytes, so that recovery will recognize the record
and invoke fil_crypt_parse().

Future work:
MDEV-21724 Optimize page_cur_insert_rec_low() redo logging
MDEV-21725 Optimize btr_page_reorganize_low() redo logging
MDEV-21727 Optimize redo logging for ROW_FORMAT=COMPRESSED
2020-02-13 19:12:17 +02:00
Marko Mäkelä
276f996af9 MDEV-12353: Replace MLOG_*_END_COPY_CREATED
Instead of writing the high-level redo log records
MLOG_LIST_END_COPY_CREATED, MLOG_COMP_LIST_END_COPY_CREATED
write log for each individual insert of a record.

page_copy_rec_list_end_to_created_page(): Remove.

This will improve the fill factor of some pages.
Adjust some tests accordingly.

PageBulk::init(), PageBulk::finish(): Avoid setting bogus limits
to PAGE_HEAP_TOP and PAGE_N_DIR_SLOTS. Avoid accessor functions
that would enforce these limits before the correct ones are set
at the end of PageBulk::finish().
2020-02-13 18:19:14 +02:00
Marko Mäkelä
9cd6b230ac MDEV-20839: Re-enable the test
The merge commit 68fe5f534c missed this.
2020-01-09 09:03:30 +02:00
Marko Mäkelä
68fe5f534c Merge 10.4 into 10.5 2020-01-07 14:10:15 +02:00
Marko Mäkelä
d2697dfbc6 MDEV-20839 encryption.innodb-redo-badkey sporadically fails on buildbot with page dump
The test encryption.innodb-redo-badkey will by design cause access
to pages that appear corrupted (due to incorrect encryption key).
Let us disable the page dumps by requiring the test to be run on
a debug server. Page dumps on debug builds were already disabled
in MDEV-19766.
2020-01-07 13:35:32 +02:00