This reverts commit 21b2fada7a
and commit 81d71ee6b2.
The MDEV-18464 change introduces a few data race issues. Contrary to
the documentation, the field trx_t::victim is not always being protected
by lock_sys_t::mutex and trx_t::mutex. Most importantly, it seems
that KILL QUERY could wrongly avoid acquiring both mutexes when
invoking lock_trx_handle_wait_low(), in case another thread had
already set trx->victim=true.
We also revert MDEV-12009, because it should depend on the MDEV-18464
fix being present.
As noted on kill_one_thread SUPER should be able to kill even
system threads i.e. threads/query flagged as high priority or
wsrep applier thread. Normal user, should not able to kill
threads/query flagged as high priority (BF) or wsrep applier
thread.
now we can afford it. Fix -Werror errors. Note:
* old gcc is bad at detecting uninit variables, disable it.
* time_t is int or long, cast it for printf's
This patch contains a fix for the MDEV-17262/17243 issues and
new mtr test.
These issues (MDEV-17262/17243) have two reasons:
1) After an intermediate commit, a transaction loses its status
of "transaction that registered in the MySQL for 2pc coordinator"
(in the InnoDB) due to the fact that since version 10.2 the
write_row() function (which located in the ha_innodb.cc) does
not call trx_register_for_2pc(m_prebuilt->trx) during the processing
of split transactions. It is necessary to restore this call inside
the write_row() when an intermediate commit was made (for a split
transaction).
Similarly, we need to set the flag of the started transaction
(m_prebuilt->sql_stat_start) after intermediate commit.
The table->file->extra(HA_EXTRA_FAKE_START_STMT) called from the
wsrep_load_data_split() function (which located in sql_load.cc)
will also do this, but it will be too late. As a result, the call
to the wsrep_append_keys() function from the InnoDB engine may be
lost or function may be called with invalid transaction identifier.
2) If a transaction with the LOAD DATA statement is divided into
logical mini-transactions (of the 10K rows) and binlog is rotated,
then in rare cases due to the wsrep handler re-registration at the
boundary of the split, the last portion of data may be lost. Since
splitting of the LOAD DATA into mini-transactions is technical,
I believe that we should not allow these mini-transactions to fall
into separate binlogs. Therefore, it is necessary to prohibit the
rotation of binlog in the middle of processing LOAD DATA statement.
https://jira.mariadb.org/browse/MDEV-17262 and
https://jira.mariadb.org/browse/MDEV-17243
* MDEV-16509 Improve wsrep commit performance with binlog disabled
Release commit order critical section early after trx_commit_low() if
binlog is not transaction coordinator. In order to avoid two phase commit,
binlog_hton is not registered for THD during IO_CACHE population.
Implemented a test which verifies that the transactions release
commit order early.
This optimization will change behavior during recovery as the commit
is not two phase when binlog is off. Fixed and recorded wsrep-recover-v25
and wsrep-recover to match the behavior.
* MDEV-18730 Ordering for wsrep binlog group commit
Previously out of order execution was allowed for wsrep commits.
Established proper ordering by populating wait_for_commit
for every wsrep THD and making group commit leader to wait for
prior commits before proceeding to trx_group_commit_leader().
* MDEV-18730 Added a test case to verify correct commit ordering
* MDEV-16509, MDEV-18730 Review fixes
Use WSREP_EMULATE_BINLOG() macro to decide if the binlog_hton
should be registered. Whitespace/syntax fixes and cleanups.
* MDEV-16509 Require binlog for galera_var_innodb_disallow_writes test
If the commit to InnoDB is done in one phase, the native InnoDB behavior
is that the transaction is committed in memory before it is persisted to
disk. This means that the innodb_disallow_writes=ON may not prevent
transaction to become visible to other readers before commit is completely
over. On the other hand, if the commit is two phase (as it is with binlog),
the transaction will be blocked in prepare phase.
Fixed the test to use binlog, which enforces two phase commit, which
in turn makes commit to block before the changes become visible to
other connections. This guarantees that the test produces expected
result.
Problem was that we skipped background persistent statistics calculation
on applier nodes if thread is marked as high priority (a.k.a BF).
However, on applier nodes all DDL which is replicate will be executed
as high priority i.e BF.
Fixed by allowing background persistent statistics calculation on
applier nodes even when thread is marked as BF. This could lead
BF lock waits but for queries on that node needs that statistics.
If we have a 2+ node cluster which is replicating from an async master
and the binlog_format is set to STATEMENT and multi-row inserts are executed
on a table with an auto_increment column such that values are automatically
generated by MySQL, then the server node generates wrong auto_increment
values, which are different from what was generated on the async master.
In the title of the MDEV-9519 it was proposed to ban start slave on a Galera
if master binlog_format = statement and wsrep_auto_increment_control = 1,
but the problem can be solved without such a restriction.
The causes and fixes:
1. We need to improve processing of changing the auto-increment values
after changing the cluster size.
2. If wsrep auto_increment_control switched on during operation of
the node, then we should immediately update the auto_increment_increment
and auto_increment_offset global variables, without waiting of the next
invocation of the wsrep_view_handler_cb() callback. In the current version
these variables retain its initial values if wsrep_auto_increment_control
is switched on during operation of the node, which leads to inconsistent
results on the different nodes in some scenarios.
3. If wsrep auto_increment_control switched off during operation of the node,
then we must return the original values of the auto_increment_increment and
auto_increment_offset global variables, as the user has set. To make this
possible, we need to add a "shadow copies" of these variables (which stores
the latest values set by the user).
https://jira.mariadb.org/browse/MDEV-9519
If we have a 2+ node cluster which is replicating from an async master
and the binlog_format is set to STATEMENT and multi-row inserts are executed
on a table with an auto_increment column such that values are automatically
generated by MySQL, then the server node generates wrong auto_increment
values, which are different from what was generated on the async master.
In the title of the MDEV-9519 it was proposed to ban start slave on a Galera
if master binlog_format = statement and wsrep_auto_increment_control = 1,
but the problem can be solved without such a restriction.
The causes and fixes:
1. We need to improve processing of changing the auto-increment values
after changing the cluster size.
2. If wsrep auto_increment_control switched on during operation of
the node, then we should immediately update the auto_increment_increment
and auto_increment_offset global variables, without waiting of the next
invocation of the wsrep_view_handler_cb() callback. In the current version
these variables retain its initial values if wsrep_auto_increment_control
is switched on during operation of the node, which leads to inconsistent
results on the different nodes in some scenarios.
3. If wsrep auto_increment_control switched off during operation of the node,
then we must return the original values of the auto_increment_increment and
auto_increment_offset global variables, as the user has set. To make this
possible, we need to add a "shadow copies" of these variables (which stores
the latest values set by the user).
https://jira.mariadb.org/browse/MDEV-9519
If we have a 2+ node cluster which is replicating from an async master
and the binlog_format is set to STATEMENT and multi-row inserts are executed
on a table with an auto_increment column such that values are automatically
generated by MySQL, then the server node generates wrong auto_increment
values, which are different from what was generated on the async master.
In the title of the MDEV-9519 it was proposed to ban start slave on a Galera
if master binlog_format = statement and wsrep_auto_increment_control = 1,
but the problem can be solved without such a restriction.
The causes and fixes:
1. We need to improve processing of changing the auto-increment values
after changing the cluster size.
2. If wsrep auto_increment_control switched on during operation of
the node, then we should immediately update the auto_increment_increment
and auto_increment_offset global variables, without waiting of the next
invocation of the wsrep_view_handler_cb() callback. In the current version
these variables retain its initial values if wsrep_auto_increment_control
is switched on during operation of the node, which leads to inconsistent
results on the different nodes in some scenarios.
3. If wsrep auto_increment_control switched off during operation of the node,
then we must return the original values of the auto_increment_increment and
auto_increment_offset global variables, as the user has set. To make this
possible, we need to add a "shadow copies" of these variables (which stores
the latest values set by the user).
https://jira.mariadb.org/browse/MDEV-9519
Global variable wsrep_debug now can be used to filter wsrep-lib messages based on debug level provided.
Type of wsrep_debug is now set to be unsigned int, so tests and configuration files changed accordingly.
main.derived_cond_pushdown: Move all 10.3 tests to the end,
trim trailing white space, and add an "End of 10.3 tests" marker.
Add --sorted_result to tests where the ordering is not deterministic.
main.win_percentile: Add --sorted_result to tests where the
ordering is no longer deterministic.
Support SET PASSWORD for authentication plugins.
Authentication plugin API is extended with two optional methods:
* hash_password() is used to compute a password hash (or digest)
from the plain-text password. This digest will be stored in mysql.user
table
* preprocess_hash() is used to convert this digest into some memory
representation that can be later used to authenticate a user.
Build-in plugins convert the hash from hexadecimal or base64 to binary,
to avoid doing it on every authentication attempt.
Note a change in behavior: when loading privileges (on startup or on
FLUSH PRIVILEGES) an account with an unknown plugin was loaded with a
warning (e.g. "Plugin 'foo' is not loaded"). But such an account could
not be used for authentication until the plugin is installed. Now an
account like that will not be loaded at all (with a warning, still).
Indeed, without plugin's preprocess_hash() method the server cannot know
how to load an account. Thus, if a new authentication plugin is
installed run-time, one might need FLUSH PRIVILEGES to activate all
existing accounts that were using this new plugin.
- Backported the MYSQL_SYSVAR_SIZE_T to 10.0
- The parameter innodb_ft_result_cache_limit was only 32 bits wide
also on 64-bit systems. Make it size_t, so that it will be 64 bits
on 64-bit systems.
- Added a test case that show how innodb_ft_result_cache_limit variables
behaves in 32bit and 64 bit system.
Performance schema likely/unlikely assume that performance schema
is enabled by default, which causes a performance degradation for
default installations that doesn't have performance schema enabled.
Fixed by changing the likely/unlikely in PS to assume it's
not enabled. This can be changed by compiling with
-DPSI_ON_BY_DEFAULT
Other changes:
- Added psi_likely/psi_unlikely that is depending on
PSI_ON_BY_DEFAULT. psi_likely() is assumed to be true
if PS is enabled.
- Added likely/unlikely to some PS interface code.
- Moved pfs_enabled to mysys (was initialized but not used before)
- Added "if (pfs_likely(pfs_enabled))" around calls to PS to avoid
an extra call if PS is not enabled.
- Moved checking flag_global_instrumention before other flags
to speed up the case when PS is not enabled.
In MDEV-8743, the port/socket of mysqld was changed to set FD_CLOEXEC.
The existing mysql_socket_socket function already set that with
SOCK_CLOEXEC when the socket was created. So here we move the fcntl
functionality to the mysql_socket_socket as port/socket are the only
callers.
Preprocessor checks of SOCK_CLOEXEC cannot be done as its a 0 if not
there and SOCK_CLOEXEC (being the value of the enum in bits/socket_type.h)
Preprocesssor logic for arithmetic and non-arithmetic defines are
hard/nonportable/ugly to read. As such we just check in my_global.h
and define HAVE_SOCK_CLOEXEC if we have it.
There was a disparity in behaviour between defined(WITH_WSREP) and
not depending on the OS, so the WITH_WSREP condition was removed
from setting calling fcntl.
All sockets are now maked SOCK_CLOEXEC/FD_CLOEXEC.
strace of mysqld with SOCK_CLOEXEC:
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 10
write(2, "180419 14:52:40 [Note] Server socket created on IP: '127.0.0.1'.\n", 65180419 14:52:40 [Note] Server socket created on IP: '127.0.0.1'.
) = 65
setsockopt(10, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(10, {sa_family=AF_INET, sin_port=htons(16020), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
listen(10, 150) = 0
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 11
write(2, "180419 14:52:40 [Note] Server socket created on IP: '127.0.0.1'.\n", 65180419 14:52:40 [Note] Server socket created on IP: '127.0.0.1'.
) = 65
setsockopt(11, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(11, {sa_family=AF_INET, sin_port=htons(16021), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
listen(11, 150) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 12
unlink("/home/dan/repos/build-mariadb-server-10.0/mysql-test/var/tmp/mysqld.1.sock") = -1 ENOENT (No such file or directory)
setsockopt(12, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
umask(000) = 006
bind(12, {sa_family=AF_UNIX, sun_path="/home/dan/repos/build-mariadb-server-10.0/mysql-test/var/tmp/mysqld.1.sock"}, 110) = 0
umask(006) = 000
listen(12, 150) = 0
Removed including wsrep_api.h from service_wsrep.h. This caused
various kinds of collisions with definitions when wsrep is
not supposed to be built in. Defined functions wsrep_xid_seqno()
and wsrep_xid_uuid() in wsrep_dummy.cc. Replaced wsrep_seqno_t
with long long where wsrep_api.h is not included.
Removed wsrep_xid_seqno() macro from wsrep_mysqld.h and made
wsrep code using wsrep_xid_seqno() in handler.cc to be compiled
in only if WITH_WSREP is ON.
Included wsrep_api.h for mariabackup if WITH_WSREP is ON.
A new wsrep XID format was added to keep the XID implementation
backwards compatible. Original version always reads XID seqno
part in host byte order, the new version in little endian byte
order. Wsrep XID will always be written in the new format.
Included wsrep_api.h from service_wsrep.h for wsrep type definitions.
Removed redundant wsrep XID code from mariabackup and included
service_wsrep.h in order to use
The problem is that the seqno part of wsrep XID is always
stored in host byte order. This may cause issues when a physical
backup is restored on a host with different architecture, the
seqno part with XID may have incorrect value.
In order to fix this, wsrep XID seqno is always written into
XID data buffer in little endian byte order using int8store()
and read from data buffer using sint8korr(). For backwards
compatibility the seqno is read from TRX_SYS page in host
byte order during upgrade.
This patch implements byte ordering in wsrep_xid_init(),
wsrep_xid_seqno(), and exposes functions to read wsrep
XID uuid and seqno in wsrep_service_st. Backwards compatibility
for upgrade is provided in trx_rseg_init_wsrep_xid().
Handle string length as size_t, consistently (almost always:))
Change function prototypes to accept size_t, where in the past
ulong or uint were used. change local/member variables to size_t
when appropriate.
This fix excludes rocksdb, spider,spider, sphinx and connect for now.
This was done in, among other things:
- thd->db and thd->db_length
- TABLE_LIST tablename, db, alias and schema_name
- Audit plugin database name
- lex->db
- All db and table names in Alter_table_ctx
- st_select_lex db
Other things:
- Changed a lot of functions to take const LEX_CSTRING* as argument
for db, table_name and alias. See init_one_table() as an example.
- Changed some function arguments from LEX_CSTRING to const LEX_CSTRING
- Changed some lists from LEX_STRING to LEX_CSTRING
- threads_mysql.result changed because process list_db wasn't always
correctly updated
- New append_identifier() function that takes LEX_CSTRING* as arguments
- Added new element tmp_buff to Alter_table_ctx to separate temp name
handling from temporary space
- Ensure we store the length after my_casedn_str() of table/db names
- Removed not used version of rename_table_in_stat_tables()
- Changed Natural_join_column::table_name and db_name() to never return
NULL (used for print)
- thd->get_db() now returns db as a printable string (thd->db.str or "")