In general the position where the storage recovers after a SST
cannot be known untile the recovery process is over. This in turn
means that the position cannot be known when the server_state
sst_received() method is called. Worked around the problem by
introducing get_position() method into server service which
can be used to get the position from stable storage after SST
has completed and the state has been recovered.
Instead of handling error case at the beginning, execute the middle
of method body in case of success, leaving only single call to
provider().sst_received() at the end.
Intruduced server_state::interrupt_state_waiters() to interrupt
all waiters inside server_state::wait_until_state(). This mechanism
is needed when an error is encountered during state change processing
and waiting threads may need to be interrupted to check and handle
the error condition.
Made server_state::wait_until_state() to throw exception if the
wait was interrupted and the new server state is either disconnecting
or disconnected, which usually indicates error condition.
Init first join crashed in
server: s1 unallowed state transition: joined -> joined
This was due to missing state check for state in on_primary_view()
before changing to joined state. Added appropriate check.
Implemented unit tests for simple IST scenarios.
Replaced all references to provider_ in server_state methods to
provider() call which is virtual and can be overridden by test classes.
Provider pointer may not be initialized during unit tests yet.
Transition joiner - disconnecting may happen when the joiner failed
to receive SST succesfully. Because the system is at undefined state
at this point, skip most of the processing in sst_received()
and return control to caller after notifying the provider about
failure.
Transition from server_state connected state to disconnecting must
be allowed to deal with errors during server startup.
Added SST first test cases for server_state transitions:
* Successful join via SST
* Error in connect state
* Error in joiner state
Provider desync may return an error if the provider cannot communicate
with rest of the cluster. However, this is acceptable for example
if the node has dropped from primary view. Instead of returning
error immediately after failed desync(), attempt to pause the provider
regardless of the error. If pause operation fails, error is returned.
In order to avoid resync in resume_and_resync() in the case desync
failed in desync_and_pause(), new member variable desynced_on_pause_
was introduced to decide whether to resync or not in resume_and_resync().
This variable is protected by pause()/resume() calls since they do
not allow concurrent pause/resume operations.
Method wsrep::server_state::convert_streaming_client_to_applier() may
insert an entry in streaming_appliers_ map which contains undefined
server_id. This happens if the method is called while in non-primary
state, and server_state::id_ is undefined.
The fix is to use the server_id which is recorded in client's
tansaction object.
- fixed node ID assertion in on_connect() method,
fixed "sanity checks" to allow reconnection to primary component
- fixed code duplication in on_view() method
Added a call to log_view() to do the internal initializations that
need to be done on receiveing a new view. Note however that it is not
a view *event*. Here we only need to configure the application to
comply with a new state that it has received, so that it can go on
to apply replication events and catch up with the cluster.
When member joins the group and needs to receive an SST it won't
receive the corresponding menbership view event because the SST
happens after the event and will already include the effects of
all events ordered before it. The view then must be recovered from
the received state.
Minor renames and cleanups.
References codership/wsrep-lib#18
it on disconnect.
- Don't rely on own index from the view because the view may come from
another member (IST/SST), instead always determine own index from own ID.
Refs codership/wsrep-lib#13
Moved SR fragment removal for total order BFd SR transactions
into after_rollback() call to avoid deadlocking while trying
to access storage before rolling back the transaction.
* Release server lock temporarily when BF aborting local SR
transaction during view event processing
* Check transaction state for BF aborts in before_prepare() after
the lock has acquired after fragment removal
* Send rollback fragment only from streaming_rollback()
* Check fragment removal error code in prepare phase. It is possible
that the transaction gets BF aborted during fragment removal.
* Mark fragment certified in certify_fragment() even if the provider
returns cert failed error. With current wsrep-API error codes
it may not be possible to distinquish certification failure
and BF abort during fragment replication. This may also be a
provider bug. As a result rollback fragment may sometimes be
replicated when it would not be necessary.
* Count separately fragments certified and fragments stored in
streaming context. Storing the fragment may ultimately fail
due to BF abort even if the fragment was succesfully certified.
Therefore we need to have separate counter for certified fragments
to determine if the transaction is streaming and seqnos of fragments
which have been succesfully stored.
* Provider release is called only after succesful fragment certification
and fragment store.
* Fixed handling of write sets with rollback flag set in apply_write_set()
SR tranasctions are BF aborted or rolled back on primary view
changes according to the following rules:
* Ongoing local SR transactions are BF aborted if the processing
server is not found from the current view.
* All remote SR transactions whose origin server is not included in the
current view are rolled back.
* Enable codepath to BF abort high priority SR applier
* Pass ws_handle, ws_meta to high priority service rollback
call to allow total ordering of rollback process
* Added server_id into transaction in order to be able to stop
streaming applier during high priority BF abort
* Added missing commit fragment applying
* Don't clear fragments for replaying SR transaction
* Handle BF rollback also in after_statement() call.
* Added missing after_apply() call when handling rollback fragment.
* Fixed state changes when rollback is starated during preparing state.
The write set handle and meta data are needed for SR transactions
where the commit context is not known when the transaction starts.
The passed handle and meta data can be set through client_state
prepare_for_ordering() call before performing commit.
The interface method can be used to notify the DBMS implementation
about state changes in well defined order. The call will be done
under server_state mutex protection.