1
0
mirror of https://github.com/redis/go-redis.git synced 2025-11-27 18:21:10 +03:00

fix(conn): conn to have state machine (#3559)

* wip

* wip, used and unusable states

* polish state machine

* correct handling OnPut

* better errors for tests, hook should work now

* fix linter

* improve reauth state management. fix tests

* Update internal/pool/conn.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update internal/pool/conn.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* better timeouts

* empty endpoint handoff case

* fix handoff state when queued for handoff

* try to detect the deadlock

* try to detect the deadlock x2

* delete should be called

* improve tests

* fix mark on uninitialized connection

* Update internal/pool/conn_state_test.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update internal/pool/conn_state_test.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update internal/pool/pool.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update internal/pool/conn_state.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update internal/pool/conn.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix error from copilot

* address copilot comment

* fix(pool): pool performance  (#3565)

* perf(pool): replace hookManager RWMutex with atomic.Pointer and add predefined state slices

- Replace hookManager RWMutex with atomic.Pointer for lock-free reads in hot paths
- Add predefined state slices to avoid allocations (validFromInUse, validFromCreatedOrIdle, etc.)
- Add Clone() method to PoolHookManager for atomic updates
- Update AddPoolHook/RemovePoolHook to use copy-on-write pattern
- Update all hookManager access points to use atomic Load()

Performance improvements:
- Eliminates RWMutex contention in Get/Put/Remove hot paths
- Reduces allocations by reusing predefined state slices
- Lock-free reads allow better CPU cache utilization

* perf(pool): eliminate mutex overhead in state machine hot path

The state machine was calling notifyWaiters() on EVERY Get/Put operation,
which acquired a mutex even when no waiters were present (the common case).

Fix: Use atomic waiterCount to check for waiters BEFORE acquiring mutex.
This eliminates mutex contention in the hot path (Get/Put operations).

Implementation:
- Added atomic.Int32 waiterCount field to ConnStateMachine
- Increment when adding waiter, decrement when removing
- Check waiterCount atomically before acquiring mutex in notifyWaiters()

Performance impact:
- Before: mutex lock/unlock on every Get/Put (even with no waiters)
- After: lock-free atomic check, only acquire mutex if waiters exist
- Expected improvement: ~30-50% for Get/Put operations

* perf(pool): use predefined state slices to eliminate allocations in hot path

The pool was creating new slice literals on EVERY Get/Put operation:
- popIdle(): []ConnState{StateCreated, StateIdle}
- putConn(): []ConnState{StateInUse}
- CompareAndSwapUsed(): []ConnState{StateIdle} and []ConnState{StateInUse}
- MarkUnusableForHandoff(): []ConnState{StateInUse, StateIdle, StateCreated}

These allocations were happening millions of times per second in the hot path.

Fix: Use predefined global slices defined in conn_state.go:
- validFromInUse
- validFromCreatedOrIdle
- validFromCreatedInUseOrIdle

Performance impact:
- Before: 4 slice allocations per Get/Put cycle
- After: 0 allocations (use predefined slices)
- Expected improvement: ~30-40% reduction in allocations and GC pressure

* perf(pool): optimize TryTransition to reduce atomic operations

Further optimize the hot path by:
1. Remove redundant GetState() call in the loop
2. Only check waiterCount after successful CAS (not before loop)
3. Inline the waiterCount check to avoid notifyWaiters() call overhead

This reduces atomic operations from 4-5 per Get/Put to 2-3:
- Before: GetState() + CAS + waiterCount.Load() + notifyWaiters mutex check
- After: CAS + waiterCount.Load() (only if CAS succeeds)

Performance impact:
- Eliminates 1-2 atomic operations per Get/Put
- Expected improvement: ~10-15% for Get/Put operations

* perf(pool): add fast path for Get/Put to match master performance

Introduced TryTransitionFast() for the hot path (Get/Put operations):
- Single CAS operation (same as master's atomic bool)
- No waiter notification overhead
- No loop through valid states
- No error allocation

Hot path flow:
1. popIdle(): Try IDLE → IN_USE (fast), fallback to CREATED → IN_USE
2. putConn(): Try IN_USE → IDLE (fast)

This matches master's performance while preserving state machine for:
- Background operations (handoff/reauth use UNUSABLE state)
- State validation (TryTransition still available)
- Waiter notification (AwaitAndTransition for blocking)

Performance comparison per Get/Put cycle:
- Master: 2 atomic CAS operations
- State machine (before): 5 atomic operations (2.5x slower)
- State machine (after): 2 atomic CAS operations (same as master!)

Expected improvement: Restore to baseline ~11,373 ops/sec

* combine cas

* fix linter

* try faster approach

* fast semaphore

* better inlining for hot path

* fix linter issues

* use new semaphore in auth as well

* linter should be happy now

* add comments

* Update internal/pool/conn_state.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* address comment

* slight reordering

* try to cache time if for non-critical calculation

* fix wrong benchmark

* add concurrent test

* fix benchmark report

* add additional expect to check output

* comment and variable rename

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* initConn sets IDLE state

- Handle unexpected conn state changes

* fix precision of time cache and usedAt

* allow e2e tests to run longer

* Fix broken initialization of idle connections

* optimize push notif

* 100ms -> 50ms

* use correct timer for last health check

* verify pass auth on conn creation

* fix assertion

* fix unsafe test

* fix benchmark test

* improve remove conn

* re doesn't support requirepass

* wait more in e2e test

* flaky test

* add missed method in interface

* fix test assertions

* silence logs and faster hooks manager

* address linter comment

* fix flaky test

* use read instad of control

* use pool size for semsize

* CAS instead of reading the state

* preallocate errors and states

* preallocate state slices

* fix flaky test

* fix fast semaphore that could have been starved

* try to fix the semaphore

* should properly notify the waiters

- this way a waiter that timesout at the same time
a releaser is releasing, won't throw token. the releaser
will fail to notify and will pick another waiter.

this hybrid approach should be faster than channels and maintains FIFO

* waiter may double-release (if closed/times out)

* priority of operations

* use simple approach of fifo waiters

* use simple channel based semaphores

* address linter and tests

* remove unused benchs

* change log message

* address pr comments

* address pr comments

* fix data race

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This commit is contained in:
Nedyalko Dyakov
2025-11-11 17:38:29 +02:00
committed by GitHub
parent 0f83314750
commit 042610b79d
38 changed files with 3221 additions and 569 deletions

View File

@@ -91,6 +91,7 @@ func (m *mockPooler) CloseConn(*pool.Conn) error { return n
func (m *mockPooler) Get(ctx context.Context) (*pool.Conn, error) { return nil, nil }
func (m *mockPooler) Put(ctx context.Context, conn *pool.Conn) {}
func (m *mockPooler) Remove(ctx context.Context, conn *pool.Conn, reason error) {}
func (m *mockPooler) RemoveWithoutTurn(ctx context.Context, conn *pool.Conn, reason error) {}
func (m *mockPooler) Len() int { return 0 }
func (m *mockPooler) IdleLen() int { return 0 }
func (m *mockPooler) Stats() *pool.Stats { return &pool.Stats{} }

View File

@@ -34,9 +34,10 @@ type ReAuthPoolHook struct {
shouldReAuth map[uint64]func(error)
shouldReAuthLock sync.RWMutex
// workers is a semaphore channel limiting concurrent re-auth operations
// workers is a semaphore limiting concurrent re-auth operations
// Initialized with poolSize tokens to prevent pool exhaustion
workers chan struct{}
// Uses FastSemaphore for better performance with eventual fairness
workers *internal.FastSemaphore
// reAuthTimeout is the maximum time to wait for acquiring a connection for re-auth
reAuthTimeout time.Duration
@@ -59,16 +60,10 @@ type ReAuthPoolHook struct {
// The poolSize parameter is used to initialize the worker semaphore, ensuring that
// re-auth operations don't exhaust the connection pool.
func NewReAuthPoolHook(poolSize int, reAuthTimeout time.Duration) *ReAuthPoolHook {
workers := make(chan struct{}, poolSize)
// Initialize the workers channel with tokens (semaphore pattern)
for i := 0; i < poolSize; i++ {
workers <- struct{}{}
}
return &ReAuthPoolHook{
shouldReAuth: make(map[uint64]func(error)),
scheduledReAuth: make(map[uint64]bool),
workers: workers,
workers: internal.NewFastSemaphore(int32(poolSize)),
reAuthTimeout: reAuthTimeout,
}
}
@@ -162,10 +157,10 @@ func (r *ReAuthPoolHook) OnPut(_ context.Context, conn *pool.Conn) (bool, bool,
r.scheduledLock.Unlock()
r.shouldReAuthLock.Unlock()
go func() {
<-r.workers
r.workers.AcquireBlocking()
// safety first
if conn == nil || (conn != nil && conn.IsClosed()) {
r.workers <- struct{}{}
r.workers.Release()
return
}
defer func() {
@@ -176,44 +171,31 @@ func (r *ReAuthPoolHook) OnPut(_ context.Context, conn *pool.Conn) (bool, bool,
r.scheduledLock.Lock()
delete(r.scheduledReAuth, connID)
r.scheduledLock.Unlock()
r.workers <- struct{}{}
r.workers.Release()
}()
var err error
timeout := time.After(r.reAuthTimeout)
// Create timeout context for connection acquisition
// This prevents indefinite waiting if the connection is stuck
ctx, cancel := context.WithTimeout(context.Background(), r.reAuthTimeout)
defer cancel()
// Try to acquire the connection
// We need to ensure the connection is both Usable and not Used
// to prevent data races with concurrent operations
const baseDelay = 10 * time.Microsecond
acquired := false
attempt := 0
for !acquired {
select {
case <-timeout:
// Timeout occurred, cannot acquire connection
err = pool.ErrConnUnusableTimeout
reAuthFn(err)
return
default:
// Try to acquire: set Usable=false, then check Used
if conn.CompareAndSwapUsable(true, false) {
if !conn.IsUsed() {
acquired = true
} else {
// Release Usable and retry with exponential backoff
// todo(ndyakov): think of a better way to do this without the need
// to release the connection, but just wait till it is not used
conn.SetUsable(true)
}
}
if !acquired {
// Exponential backoff: 10, 20, 40, 80... up to 5120 microseconds
delay := baseDelay * time.Duration(1<<uint(attempt%10)) // Cap exponential growth
time.Sleep(delay)
attempt++
}
}
// Try to acquire the connection for re-authentication
// We need to ensure the connection is IDLE (not IN_USE) before transitioning to UNUSABLE
// This prevents re-authentication from interfering with active commands
// Use AwaitAndTransition to wait for the connection to become IDLE
stateMachine := conn.GetStateMachine()
if stateMachine == nil {
// No state machine - should not happen, but handle gracefully
reAuthFn(pool.ErrConnUnusableTimeout)
return
}
// Use predefined slice to avoid allocation
_, err := stateMachine.AwaitAndTransition(ctx, pool.ValidFromIdle(), pool.StateUnusable)
if err != nil {
// Timeout or other error occurred, cannot acquire connection
reAuthFn(err)
return
}
// safety first
@@ -222,8 +204,8 @@ func (r *ReAuthPoolHook) OnPut(_ context.Context, conn *pool.Conn) (bool, bool,
reAuthFn(nil)
}
// Release the connection
conn.SetUsable(true)
// Release the connection: transition from UNUSABLE back to IDLE
stateMachine.Transition(pool.StateIdle)
}()
}

View File

@@ -0,0 +1,241 @@
package streaming
import (
"sync"
"sync/atomic"
"testing"
"time"
"github.com/redis/go-redis/v9/internal/pool"
)
// TestReAuthOnlyWhenIdle verifies that re-authentication only happens when
// a connection is in IDLE state, not when it's IN_USE.
func TestReAuthOnlyWhenIdle(t *testing.T) {
// Create a connection
cn := pool.NewConn(nil)
// Initialize to IDLE state
cn.GetStateMachine().Transition(pool.StateInitializing)
cn.GetStateMachine().Transition(pool.StateIdle)
// Simulate connection being acquired (IDLE → IN_USE)
if !cn.CompareAndSwapUsed(false, true) {
t.Fatal("Failed to acquire connection")
}
// Verify state is IN_USE
if state := cn.GetStateMachine().GetState(); state != pool.StateInUse {
t.Errorf("Expected state IN_USE, got %s", state)
}
// Try to transition to UNUSABLE (for reauth) - should fail
_, err := cn.GetStateMachine().TryTransition([]pool.ConnState{pool.StateIdle}, pool.StateUnusable)
if err == nil {
t.Error("Expected error when trying to transition IN_USE → UNUSABLE, but got none")
}
// Verify state is still IN_USE
if state := cn.GetStateMachine().GetState(); state != pool.StateInUse {
t.Errorf("Expected state to remain IN_USE, got %s", state)
}
// Release connection (IN_USE → IDLE)
if !cn.CompareAndSwapUsed(true, false) {
t.Fatal("Failed to release connection")
}
// Verify state is IDLE
if state := cn.GetStateMachine().GetState(); state != pool.StateIdle {
t.Errorf("Expected state IDLE, got %s", state)
}
// Now try to transition to UNUSABLE - should succeed
_, err = cn.GetStateMachine().TryTransition([]pool.ConnState{pool.StateIdle}, pool.StateUnusable)
if err != nil {
t.Errorf("Failed to transition IDLE → UNUSABLE: %v", err)
}
// Verify state is UNUSABLE
if state := cn.GetStateMachine().GetState(); state != pool.StateUnusable {
t.Errorf("Expected state UNUSABLE, got %s", state)
}
}
// TestReAuthWaitsForConnectionToBeIdle verifies that the re-auth worker
// waits for a connection to become IDLE before performing re-authentication.
func TestReAuthWaitsForConnectionToBeIdle(t *testing.T) {
// Create a connection
cn := pool.NewConn(nil)
// Initialize to IDLE state
cn.GetStateMachine().Transition(pool.StateInitializing)
cn.GetStateMachine().Transition(pool.StateIdle)
// Simulate connection being acquired (IDLE → IN_USE)
if !cn.CompareAndSwapUsed(false, true) {
t.Fatal("Failed to acquire connection")
}
// Track re-auth attempts
var reAuthAttempts atomic.Int32
var reAuthSucceeded atomic.Bool
// Start a goroutine that tries to acquire the connection for re-auth
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
// Try to acquire for re-auth with timeout
timeout := time.After(2 * time.Second)
acquired := false
for !acquired {
select {
case <-timeout:
t.Error("Timeout waiting to acquire connection for re-auth")
return
default:
reAuthAttempts.Add(1)
// Try to atomically transition from IDLE to UNUSABLE
_, err := cn.GetStateMachine().TryTransition([]pool.ConnState{pool.StateIdle}, pool.StateUnusable)
if err == nil {
// Successfully acquired
acquired = true
reAuthSucceeded.Store(true)
} else {
// Connection is still IN_USE, wait a bit
time.Sleep(10 * time.Millisecond)
}
}
}
// Release the connection
cn.GetStateMachine().Transition(pool.StateIdle)
}()
// Keep connection IN_USE for 500ms
time.Sleep(500 * time.Millisecond)
// Verify re-auth hasn't succeeded yet (connection is still IN_USE)
if reAuthSucceeded.Load() {
t.Error("Re-auth succeeded while connection was IN_USE")
}
// Verify there were multiple attempts
attempts := reAuthAttempts.Load()
if attempts < 2 {
t.Errorf("Expected multiple re-auth attempts, got %d", attempts)
}
// Release connection (IN_USE → IDLE)
if !cn.CompareAndSwapUsed(true, false) {
t.Fatal("Failed to release connection")
}
// Wait for re-auth to complete
wg.Wait()
// Verify re-auth succeeded after connection became IDLE
if !reAuthSucceeded.Load() {
t.Error("Re-auth did not succeed after connection became IDLE")
}
// Verify final state is IDLE
if state := cn.GetStateMachine().GetState(); state != pool.StateIdle {
t.Errorf("Expected final state IDLE, got %s", state)
}
}
// TestConcurrentReAuthAndUsage verifies that re-auth and normal usage
// don't interfere with each other.
func TestConcurrentReAuthAndUsage(t *testing.T) {
// Create a connection
cn := pool.NewConn(nil)
// Initialize to IDLE state
cn.GetStateMachine().Transition(pool.StateInitializing)
cn.GetStateMachine().Transition(pool.StateIdle)
var wg sync.WaitGroup
var usageCount atomic.Int32
var reAuthCount atomic.Int32
// Goroutine 1: Simulate normal usage (acquire/release)
wg.Add(1)
go func() {
defer wg.Done()
for i := 0; i < 100; i++ {
// Try to acquire
if cn.CompareAndSwapUsed(false, true) {
usageCount.Add(1)
// Simulate work
time.Sleep(1 * time.Millisecond)
// Release
cn.CompareAndSwapUsed(true, false)
}
time.Sleep(1 * time.Millisecond)
}
}()
// Goroutine 2: Simulate re-auth attempts
wg.Add(1)
go func() {
defer wg.Done()
for i := 0; i < 50; i++ {
// Try to acquire for re-auth
_, err := cn.GetStateMachine().TryTransition([]pool.ConnState{pool.StateIdle}, pool.StateUnusable)
if err == nil {
reAuthCount.Add(1)
// Simulate re-auth work
time.Sleep(2 * time.Millisecond)
// Release
cn.GetStateMachine().Transition(pool.StateIdle)
}
time.Sleep(2 * time.Millisecond)
}
}()
wg.Wait()
// Verify both operations happened
if usageCount.Load() == 0 {
t.Error("No successful usage operations")
}
if reAuthCount.Load() == 0 {
t.Error("No successful re-auth operations")
}
t.Logf("Usage operations: %d, Re-auth operations: %d", usageCount.Load(), reAuthCount.Load())
// Verify final state is IDLE
if state := cn.GetStateMachine().GetState(); state != pool.StateIdle {
t.Errorf("Expected final state IDLE, got %s", state)
}
}
// TestReAuthRespectsClosed verifies that re-auth doesn't happen on closed connections.
func TestReAuthRespectsClosed(t *testing.T) {
// Create a connection
cn := pool.NewConn(nil)
// Initialize to IDLE state
cn.GetStateMachine().Transition(pool.StateInitializing)
cn.GetStateMachine().Transition(pool.StateIdle)
// Close the connection
cn.GetStateMachine().Transition(pool.StateClosed)
// Try to transition to UNUSABLE - should fail
_, err := cn.GetStateMachine().TryTransition([]pool.ConnState{pool.StateIdle}, pool.StateUnusable)
if err == nil {
t.Error("Expected error when trying to transition CLOSED → UNUSABLE, but got none")
}
// Verify state is still CLOSED
if state := cn.GetStateMachine().GetState(); state != pool.StateClosed {
t.Errorf("Expected state to remain CLOSED, got %s", state)
}
}