mirror of
https://github.com/redis/go-redis.git
synced 2025-11-27 18:21:10 +03:00
* wip * wip, used and unusable states * polish state machine * correct handling OnPut * better errors for tests, hook should work now * fix linter * improve reauth state management. fix tests * Update internal/pool/conn.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update internal/pool/conn.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * better timeouts * empty endpoint handoff case * fix handoff state when queued for handoff * try to detect the deadlock * try to detect the deadlock x2 * delete should be called * improve tests * fix mark on uninitialized connection * Update internal/pool/conn_state_test.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update internal/pool/conn_state_test.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update internal/pool/pool.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update internal/pool/conn_state.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update internal/pool/conn.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix error from copilot * address copilot comment * fix(pool): pool performance (#3565) * perf(pool): replace hookManager RWMutex with atomic.Pointer and add predefined state slices - Replace hookManager RWMutex with atomic.Pointer for lock-free reads in hot paths - Add predefined state slices to avoid allocations (validFromInUse, validFromCreatedOrIdle, etc.) - Add Clone() method to PoolHookManager for atomic updates - Update AddPoolHook/RemovePoolHook to use copy-on-write pattern - Update all hookManager access points to use atomic Load() Performance improvements: - Eliminates RWMutex contention in Get/Put/Remove hot paths - Reduces allocations by reusing predefined state slices - Lock-free reads allow better CPU cache utilization * perf(pool): eliminate mutex overhead in state machine hot path The state machine was calling notifyWaiters() on EVERY Get/Put operation, which acquired a mutex even when no waiters were present (the common case). Fix: Use atomic waiterCount to check for waiters BEFORE acquiring mutex. This eliminates mutex contention in the hot path (Get/Put operations). Implementation: - Added atomic.Int32 waiterCount field to ConnStateMachine - Increment when adding waiter, decrement when removing - Check waiterCount atomically before acquiring mutex in notifyWaiters() Performance impact: - Before: mutex lock/unlock on every Get/Put (even with no waiters) - After: lock-free atomic check, only acquire mutex if waiters exist - Expected improvement: ~30-50% for Get/Put operations * perf(pool): use predefined state slices to eliminate allocations in hot path The pool was creating new slice literals on EVERY Get/Put operation: - popIdle(): []ConnState{StateCreated, StateIdle} - putConn(): []ConnState{StateInUse} - CompareAndSwapUsed(): []ConnState{StateIdle} and []ConnState{StateInUse} - MarkUnusableForHandoff(): []ConnState{StateInUse, StateIdle, StateCreated} These allocations were happening millions of times per second in the hot path. Fix: Use predefined global slices defined in conn_state.go: - validFromInUse - validFromCreatedOrIdle - validFromCreatedInUseOrIdle Performance impact: - Before: 4 slice allocations per Get/Put cycle - After: 0 allocations (use predefined slices) - Expected improvement: ~30-40% reduction in allocations and GC pressure * perf(pool): optimize TryTransition to reduce atomic operations Further optimize the hot path by: 1. Remove redundant GetState() call in the loop 2. Only check waiterCount after successful CAS (not before loop) 3. Inline the waiterCount check to avoid notifyWaiters() call overhead This reduces atomic operations from 4-5 per Get/Put to 2-3: - Before: GetState() + CAS + waiterCount.Load() + notifyWaiters mutex check - After: CAS + waiterCount.Load() (only if CAS succeeds) Performance impact: - Eliminates 1-2 atomic operations per Get/Put - Expected improvement: ~10-15% for Get/Put operations * perf(pool): add fast path for Get/Put to match master performance Introduced TryTransitionFast() for the hot path (Get/Put operations): - Single CAS operation (same as master's atomic bool) - No waiter notification overhead - No loop through valid states - No error allocation Hot path flow: 1. popIdle(): Try IDLE → IN_USE (fast), fallback to CREATED → IN_USE 2. putConn(): Try IN_USE → IDLE (fast) This matches master's performance while preserving state machine for: - Background operations (handoff/reauth use UNUSABLE state) - State validation (TryTransition still available) - Waiter notification (AwaitAndTransition for blocking) Performance comparison per Get/Put cycle: - Master: 2 atomic CAS operations - State machine (before): 5 atomic operations (2.5x slower) - State machine (after): 2 atomic CAS operations (same as master!) Expected improvement: Restore to baseline ~11,373 ops/sec * combine cas * fix linter * try faster approach * fast semaphore * better inlining for hot path * fix linter issues * use new semaphore in auth as well * linter should be happy now * add comments * Update internal/pool/conn_state.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * address comment * slight reordering * try to cache time if for non-critical calculation * fix wrong benchmark * add concurrent test * fix benchmark report * add additional expect to check output * comment and variable rename --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * initConn sets IDLE state - Handle unexpected conn state changes * fix precision of time cache and usedAt * allow e2e tests to run longer * Fix broken initialization of idle connections * optimize push notif * 100ms -> 50ms * use correct timer for last health check * verify pass auth on conn creation * fix assertion * fix unsafe test * fix benchmark test * improve remove conn * re doesn't support requirepass * wait more in e2e test * flaky test * add missed method in interface * fix test assertions * silence logs and faster hooks manager * address linter comment * fix flaky test * use read instad of control * use pool size for semsize * CAS instead of reading the state * preallocate errors and states * preallocate state slices * fix flaky test * fix fast semaphore that could have been starved * try to fix the semaphore * should properly notify the waiters - this way a waiter that timesout at the same time a releaser is releasing, won't throw token. the releaser will fail to notify and will pick another waiter. this hybrid approach should be faster than channels and maintains FIFO * waiter may double-release (if closed/times out) * priority of operations * use simple approach of fifo waiters * use simple channel based semaphores * address linter and tests * remove unused benchs * change log message * address pr comments * address pr comments * fix data race --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
242 lines
8.8 KiB
Go
242 lines
8.8 KiB
Go
package streaming
|
|
|
|
import (
|
|
"context"
|
|
"sync"
|
|
"time"
|
|
|
|
"github.com/redis/go-redis/v9/internal"
|
|
"github.com/redis/go-redis/v9/internal/pool"
|
|
)
|
|
|
|
// ReAuthPoolHook is a pool hook that manages background re-authentication of connections
|
|
// when credentials change via a streaming credentials provider.
|
|
//
|
|
// The hook uses a semaphore-based worker pool to limit concurrent re-authentication
|
|
// operations and prevent pool exhaustion. When credentials change, connections are
|
|
// marked for re-authentication and processed asynchronously in the background.
|
|
//
|
|
// The re-authentication process:
|
|
// 1. OnPut: When a connection is returned to the pool, check if it needs re-auth
|
|
// 2. If yes, schedule it for background processing (move from shouldReAuth to scheduledReAuth)
|
|
// 3. A worker goroutine acquires the connection (waits until it's not in use)
|
|
// 4. Executes the re-auth function while holding the connection
|
|
// 5. Releases the connection back to the pool
|
|
//
|
|
// The hook ensures that:
|
|
// - Only one re-auth operation runs per connection at a time
|
|
// - Connections are not used for commands during re-authentication
|
|
// - Re-auth operations timeout if they can't acquire the connection
|
|
// - Resources are properly cleaned up on connection removal
|
|
type ReAuthPoolHook struct {
|
|
// shouldReAuth maps connection ID to re-auth function
|
|
// Connections in this map need re-authentication but haven't been scheduled yet
|
|
shouldReAuth map[uint64]func(error)
|
|
shouldReAuthLock sync.RWMutex
|
|
|
|
// workers is a semaphore limiting concurrent re-auth operations
|
|
// Initialized with poolSize tokens to prevent pool exhaustion
|
|
// Uses FastSemaphore for better performance with eventual fairness
|
|
workers *internal.FastSemaphore
|
|
|
|
// reAuthTimeout is the maximum time to wait for acquiring a connection for re-auth
|
|
reAuthTimeout time.Duration
|
|
|
|
// scheduledReAuth maps connection ID to scheduled status
|
|
// Connections in this map have a background worker attempting re-authentication
|
|
scheduledReAuth map[uint64]bool
|
|
scheduledLock sync.RWMutex
|
|
|
|
// manager is a back-reference for cleanup operations
|
|
manager *Manager
|
|
}
|
|
|
|
// NewReAuthPoolHook creates a new re-authentication pool hook.
|
|
//
|
|
// Parameters:
|
|
// - poolSize: Maximum number of concurrent re-auth operations (typically matches pool size)
|
|
// - reAuthTimeout: Maximum time to wait for acquiring a connection for re-authentication
|
|
//
|
|
// The poolSize parameter is used to initialize the worker semaphore, ensuring that
|
|
// re-auth operations don't exhaust the connection pool.
|
|
func NewReAuthPoolHook(poolSize int, reAuthTimeout time.Duration) *ReAuthPoolHook {
|
|
return &ReAuthPoolHook{
|
|
shouldReAuth: make(map[uint64]func(error)),
|
|
scheduledReAuth: make(map[uint64]bool),
|
|
workers: internal.NewFastSemaphore(int32(poolSize)),
|
|
reAuthTimeout: reAuthTimeout,
|
|
}
|
|
}
|
|
|
|
// MarkForReAuth marks a connection for re-authentication.
|
|
//
|
|
// This method is called when credentials change and a connection needs to be
|
|
// re-authenticated. The actual re-authentication happens asynchronously when
|
|
// the connection is returned to the pool (in OnPut).
|
|
//
|
|
// Parameters:
|
|
// - connID: The connection ID to mark for re-authentication
|
|
// - reAuthFn: Function to call for re-authentication, receives error if acquisition fails
|
|
//
|
|
// Thread-safe: Can be called concurrently from multiple goroutines.
|
|
func (r *ReAuthPoolHook) MarkForReAuth(connID uint64, reAuthFn func(error)) {
|
|
r.shouldReAuthLock.Lock()
|
|
defer r.shouldReAuthLock.Unlock()
|
|
r.shouldReAuth[connID] = reAuthFn
|
|
}
|
|
|
|
// OnGet is called when a connection is retrieved from the pool.
|
|
//
|
|
// This hook checks if the connection needs re-authentication or has a scheduled
|
|
// re-auth operation. If so, it rejects the connection (returns accept=false),
|
|
// causing the pool to try another connection.
|
|
//
|
|
// Returns:
|
|
// - accept: false if connection needs re-auth, true otherwise
|
|
// - err: always nil (errors are not used in this hook)
|
|
//
|
|
// Thread-safe: Called concurrently by multiple goroutines getting connections.
|
|
func (r *ReAuthPoolHook) OnGet(_ context.Context, conn *pool.Conn, _ bool) (accept bool, err error) {
|
|
connID := conn.GetID()
|
|
r.shouldReAuthLock.RLock()
|
|
_, shouldReAuth := r.shouldReAuth[connID]
|
|
r.shouldReAuthLock.RUnlock()
|
|
// This connection was marked for reauth while in the pool,
|
|
// reject the connection
|
|
if shouldReAuth {
|
|
// simply reject the connection, it will be re-authenticated in OnPut
|
|
return false, nil
|
|
}
|
|
r.scheduledLock.RLock()
|
|
_, hasScheduled := r.scheduledReAuth[connID]
|
|
r.scheduledLock.RUnlock()
|
|
// has scheduled reauth, reject the connection
|
|
if hasScheduled {
|
|
// simply reject the connection, it currently has a reauth scheduled
|
|
// and the worker is waiting for slot to execute the reauth
|
|
return false, nil
|
|
}
|
|
return true, nil
|
|
}
|
|
|
|
// OnPut is called when a connection is returned to the pool.
|
|
//
|
|
// This hook checks if the connection needs re-authentication. If so, it schedules
|
|
// a background goroutine to perform the re-auth asynchronously. The goroutine:
|
|
// 1. Waits for a worker slot (semaphore)
|
|
// 2. Acquires the connection (waits until not in use)
|
|
// 3. Executes the re-auth function
|
|
// 4. Releases the connection and worker slot
|
|
//
|
|
// The connection is always pooled (not removed) since re-auth happens in background.
|
|
//
|
|
// Returns:
|
|
// - shouldPool: always true (connection stays in pool during background re-auth)
|
|
// - shouldRemove: always false
|
|
// - err: always nil
|
|
//
|
|
// Thread-safe: Called concurrently by multiple goroutines returning connections.
|
|
func (r *ReAuthPoolHook) OnPut(_ context.Context, conn *pool.Conn) (bool, bool, error) {
|
|
if conn == nil {
|
|
// noop
|
|
return true, false, nil
|
|
}
|
|
connID := conn.GetID()
|
|
// Check if reauth is needed and get the function with proper locking
|
|
r.shouldReAuthLock.RLock()
|
|
reAuthFn, ok := r.shouldReAuth[connID]
|
|
r.shouldReAuthLock.RUnlock()
|
|
|
|
if ok {
|
|
// Acquire both locks to atomically move from shouldReAuth to scheduledReAuth
|
|
// This prevents race conditions where OnGet might miss the transition
|
|
r.shouldReAuthLock.Lock()
|
|
r.scheduledLock.Lock()
|
|
r.scheduledReAuth[connID] = true
|
|
delete(r.shouldReAuth, connID)
|
|
r.scheduledLock.Unlock()
|
|
r.shouldReAuthLock.Unlock()
|
|
go func() {
|
|
r.workers.AcquireBlocking()
|
|
// safety first
|
|
if conn == nil || (conn != nil && conn.IsClosed()) {
|
|
r.workers.Release()
|
|
return
|
|
}
|
|
defer func() {
|
|
if rec := recover(); rec != nil {
|
|
// once again - safety first
|
|
internal.Logger.Printf(context.Background(), "panic in reauth worker: %v", rec)
|
|
}
|
|
r.scheduledLock.Lock()
|
|
delete(r.scheduledReAuth, connID)
|
|
r.scheduledLock.Unlock()
|
|
r.workers.Release()
|
|
}()
|
|
|
|
// Create timeout context for connection acquisition
|
|
// This prevents indefinite waiting if the connection is stuck
|
|
ctx, cancel := context.WithTimeout(context.Background(), r.reAuthTimeout)
|
|
defer cancel()
|
|
|
|
// Try to acquire the connection for re-authentication
|
|
// We need to ensure the connection is IDLE (not IN_USE) before transitioning to UNUSABLE
|
|
// This prevents re-authentication from interfering with active commands
|
|
// Use AwaitAndTransition to wait for the connection to become IDLE
|
|
stateMachine := conn.GetStateMachine()
|
|
if stateMachine == nil {
|
|
// No state machine - should not happen, but handle gracefully
|
|
reAuthFn(pool.ErrConnUnusableTimeout)
|
|
return
|
|
}
|
|
|
|
// Use predefined slice to avoid allocation
|
|
_, err := stateMachine.AwaitAndTransition(ctx, pool.ValidFromIdle(), pool.StateUnusable)
|
|
if err != nil {
|
|
// Timeout or other error occurred, cannot acquire connection
|
|
reAuthFn(err)
|
|
return
|
|
}
|
|
|
|
// safety first
|
|
if !conn.IsClosed() {
|
|
// Successfully acquired the connection, perform reauth
|
|
reAuthFn(nil)
|
|
}
|
|
|
|
// Release the connection: transition from UNUSABLE back to IDLE
|
|
stateMachine.Transition(pool.StateIdle)
|
|
}()
|
|
}
|
|
|
|
// the reauth will happen in background, as far as the pool is concerned:
|
|
// pool the connection, don't remove it, no error
|
|
return true, false, nil
|
|
}
|
|
|
|
// OnRemove is called when a connection is removed from the pool.
|
|
//
|
|
// This hook cleans up all state associated with the connection:
|
|
// - Removes from shouldReAuth map (pending re-auth)
|
|
// - Removes from scheduledReAuth map (active re-auth)
|
|
// - Removes credentials listener from manager
|
|
//
|
|
// This prevents memory leaks and ensures that removed connections don't have
|
|
// lingering re-auth operations or listeners.
|
|
//
|
|
// Thread-safe: Called when connections are removed due to errors, timeouts, or pool closure.
|
|
func (r *ReAuthPoolHook) OnRemove(_ context.Context, conn *pool.Conn, _ error) {
|
|
connID := conn.GetID()
|
|
r.shouldReAuthLock.Lock()
|
|
r.scheduledLock.Lock()
|
|
delete(r.scheduledReAuth, connID)
|
|
delete(r.shouldReAuth, connID)
|
|
r.scheduledLock.Unlock()
|
|
r.shouldReAuthLock.Unlock()
|
|
if r.manager != nil {
|
|
r.manager.RemoveListener(connID)
|
|
}
|
|
}
|
|
|
|
var _ pool.PoolHook = (*ReAuthPoolHook)(nil)
|