mirror of
https://github.com/redis/go-redis.git
synced 2025-11-26 06:23:09 +03:00
* wip * wip, used and unusable states * polish state machine * correct handling OnPut * better errors for tests, hook should work now * fix linter * improve reauth state management. fix tests * Update internal/pool/conn.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update internal/pool/conn.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * better timeouts * empty endpoint handoff case * fix handoff state when queued for handoff * try to detect the deadlock * try to detect the deadlock x2 * delete should be called * improve tests * fix mark on uninitialized connection * Update internal/pool/conn_state_test.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update internal/pool/conn_state_test.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update internal/pool/pool.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update internal/pool/conn_state.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update internal/pool/conn.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix error from copilot * address copilot comment * fix(pool): pool performance (#3565) * perf(pool): replace hookManager RWMutex with atomic.Pointer and add predefined state slices - Replace hookManager RWMutex with atomic.Pointer for lock-free reads in hot paths - Add predefined state slices to avoid allocations (validFromInUse, validFromCreatedOrIdle, etc.) - Add Clone() method to PoolHookManager for atomic updates - Update AddPoolHook/RemovePoolHook to use copy-on-write pattern - Update all hookManager access points to use atomic Load() Performance improvements: - Eliminates RWMutex contention in Get/Put/Remove hot paths - Reduces allocations by reusing predefined state slices - Lock-free reads allow better CPU cache utilization * perf(pool): eliminate mutex overhead in state machine hot path The state machine was calling notifyWaiters() on EVERY Get/Put operation, which acquired a mutex even when no waiters were present (the common case). Fix: Use atomic waiterCount to check for waiters BEFORE acquiring mutex. This eliminates mutex contention in the hot path (Get/Put operations). Implementation: - Added atomic.Int32 waiterCount field to ConnStateMachine - Increment when adding waiter, decrement when removing - Check waiterCount atomically before acquiring mutex in notifyWaiters() Performance impact: - Before: mutex lock/unlock on every Get/Put (even with no waiters) - After: lock-free atomic check, only acquire mutex if waiters exist - Expected improvement: ~30-50% for Get/Put operations * perf(pool): use predefined state slices to eliminate allocations in hot path The pool was creating new slice literals on EVERY Get/Put operation: - popIdle(): []ConnState{StateCreated, StateIdle} - putConn(): []ConnState{StateInUse} - CompareAndSwapUsed(): []ConnState{StateIdle} and []ConnState{StateInUse} - MarkUnusableForHandoff(): []ConnState{StateInUse, StateIdle, StateCreated} These allocations were happening millions of times per second in the hot path. Fix: Use predefined global slices defined in conn_state.go: - validFromInUse - validFromCreatedOrIdle - validFromCreatedInUseOrIdle Performance impact: - Before: 4 slice allocations per Get/Put cycle - After: 0 allocations (use predefined slices) - Expected improvement: ~30-40% reduction in allocations and GC pressure * perf(pool): optimize TryTransition to reduce atomic operations Further optimize the hot path by: 1. Remove redundant GetState() call in the loop 2. Only check waiterCount after successful CAS (not before loop) 3. Inline the waiterCount check to avoid notifyWaiters() call overhead This reduces atomic operations from 4-5 per Get/Put to 2-3: - Before: GetState() + CAS + waiterCount.Load() + notifyWaiters mutex check - After: CAS + waiterCount.Load() (only if CAS succeeds) Performance impact: - Eliminates 1-2 atomic operations per Get/Put - Expected improvement: ~10-15% for Get/Put operations * perf(pool): add fast path for Get/Put to match master performance Introduced TryTransitionFast() for the hot path (Get/Put operations): - Single CAS operation (same as master's atomic bool) - No waiter notification overhead - No loop through valid states - No error allocation Hot path flow: 1. popIdle(): Try IDLE → IN_USE (fast), fallback to CREATED → IN_USE 2. putConn(): Try IN_USE → IDLE (fast) This matches master's performance while preserving state machine for: - Background operations (handoff/reauth use UNUSABLE state) - State validation (TryTransition still available) - Waiter notification (AwaitAndTransition for blocking) Performance comparison per Get/Put cycle: - Master: 2 atomic CAS operations - State machine (before): 5 atomic operations (2.5x slower) - State machine (after): 2 atomic CAS operations (same as master!) Expected improvement: Restore to baseline ~11,373 ops/sec * combine cas * fix linter * try faster approach * fast semaphore * better inlining for hot path * fix linter issues * use new semaphore in auth as well * linter should be happy now * add comments * Update internal/pool/conn_state.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * address comment * slight reordering * try to cache time if for non-critical calculation * fix wrong benchmark * add concurrent test * fix benchmark report * add additional expect to check output * comment and variable rename --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * initConn sets IDLE state - Handle unexpected conn state changes * fix precision of time cache and usedAt * allow e2e tests to run longer * Fix broken initialization of idle connections * optimize push notif * 100ms -> 50ms * use correct timer for last health check * verify pass auth on conn creation * fix assertion * fix unsafe test * fix benchmark test * improve remove conn * re doesn't support requirepass * wait more in e2e test * flaky test * add missed method in interface * fix test assertions * silence logs and faster hooks manager * address linter comment * fix flaky test * use read instad of control * use pool size for semsize * CAS instead of reading the state * preallocate errors and states * preallocate state slices * fix flaky test * fix fast semaphore that could have been starved * try to fix the semaphore * should properly notify the waiters - this way a waiter that timesout at the same time a releaser is releasing, won't throw token. the releaser will fail to notify and will pick another waiter. this hybrid approach should be faster than channels and maintains FIFO * waiter may double-release (if closed/times out) * priority of operations * use simple approach of fifo waiters * use simple channel based semaphores * address linter and tests * remove unused benchs * change log message * address pr comments * address pr comments * fix data race --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
183 lines
6.2 KiB
Go
183 lines
6.2 KiB
Go
package maintnotifications
|
|
|
|
import (
|
|
"context"
|
|
"net"
|
|
"sync"
|
|
"time"
|
|
|
|
"github.com/redis/go-redis/v9/internal"
|
|
"github.com/redis/go-redis/v9/internal/maintnotifications/logs"
|
|
"github.com/redis/go-redis/v9/internal/pool"
|
|
)
|
|
|
|
// OperationsManagerInterface defines the interface for completing handoff operations
|
|
type OperationsManagerInterface interface {
|
|
TrackMovingOperationWithConnID(ctx context.Context, newEndpoint string, deadline time.Time, seqID int64, connID uint64) error
|
|
UntrackOperationWithConnID(seqID int64, connID uint64)
|
|
}
|
|
|
|
// HandoffRequest represents a request to handoff a connection to a new endpoint
|
|
type HandoffRequest struct {
|
|
Conn *pool.Conn
|
|
ConnID uint64 // Unique connection identifier
|
|
Endpoint string
|
|
SeqID int64
|
|
Pool pool.Pooler // Pool to remove connection from on failure
|
|
}
|
|
|
|
// PoolHook implements pool.PoolHook for Redis-specific connection handling
|
|
// with maintenance notifications support.
|
|
type PoolHook struct {
|
|
// Base dialer for creating connections to new endpoints during handoffs
|
|
// args are network and address
|
|
baseDialer func(context.Context, string, string) (net.Conn, error)
|
|
|
|
// Network type (e.g., "tcp", "unix")
|
|
network string
|
|
|
|
// Worker manager for background handoff processing
|
|
workerManager *handoffWorkerManager
|
|
|
|
// Configuration for the maintenance notifications
|
|
config *Config
|
|
|
|
// Operations manager interface for operation completion tracking
|
|
operationsManager OperationsManagerInterface
|
|
|
|
// Pool interface for removing connections on handoff failure
|
|
pool pool.Pooler
|
|
}
|
|
|
|
// NewPoolHook creates a new pool hook
|
|
func NewPoolHook(baseDialer func(context.Context, string, string) (net.Conn, error), network string, config *Config, operationsManager OperationsManagerInterface) *PoolHook {
|
|
return NewPoolHookWithPoolSize(baseDialer, network, config, operationsManager, 0)
|
|
}
|
|
|
|
// NewPoolHookWithPoolSize creates a new pool hook with pool size for better worker defaults
|
|
func NewPoolHookWithPoolSize(baseDialer func(context.Context, string, string) (net.Conn, error), network string, config *Config, operationsManager OperationsManagerInterface, poolSize int) *PoolHook {
|
|
// Apply defaults if config is nil or has zero values
|
|
if config == nil {
|
|
config = config.ApplyDefaultsWithPoolSize(poolSize)
|
|
}
|
|
|
|
ph := &PoolHook{
|
|
// baseDialer is used to create connections to new endpoints during handoffs
|
|
baseDialer: baseDialer,
|
|
network: network,
|
|
config: config,
|
|
operationsManager: operationsManager,
|
|
}
|
|
|
|
// Create worker manager
|
|
ph.workerManager = newHandoffWorkerManager(config, ph)
|
|
|
|
return ph
|
|
}
|
|
|
|
// SetPool sets the pool interface for removing connections on handoff failure
|
|
func (ph *PoolHook) SetPool(pooler pool.Pooler) {
|
|
ph.pool = pooler
|
|
}
|
|
|
|
// GetCurrentWorkers returns the current number of active workers (for testing)
|
|
func (ph *PoolHook) GetCurrentWorkers() int {
|
|
return ph.workerManager.getCurrentWorkers()
|
|
}
|
|
|
|
// IsHandoffPending returns true if the given connection has a pending handoff
|
|
func (ph *PoolHook) IsHandoffPending(conn *pool.Conn) bool {
|
|
return ph.workerManager.isHandoffPending(conn)
|
|
}
|
|
|
|
// GetPendingMap returns the pending map for testing purposes
|
|
func (ph *PoolHook) GetPendingMap() *sync.Map {
|
|
return ph.workerManager.getPendingMap()
|
|
}
|
|
|
|
// GetMaxWorkers returns the max workers for testing purposes
|
|
func (ph *PoolHook) GetMaxWorkers() int {
|
|
return ph.workerManager.getMaxWorkers()
|
|
}
|
|
|
|
// GetHandoffQueue returns the handoff queue for testing purposes
|
|
func (ph *PoolHook) GetHandoffQueue() chan HandoffRequest {
|
|
return ph.workerManager.getHandoffQueue()
|
|
}
|
|
|
|
// GetCircuitBreakerStats returns circuit breaker statistics for monitoring
|
|
func (ph *PoolHook) GetCircuitBreakerStats() []CircuitBreakerStats {
|
|
return ph.workerManager.getCircuitBreakerStats()
|
|
}
|
|
|
|
// ResetCircuitBreakers resets all circuit breakers (useful for testing)
|
|
func (ph *PoolHook) ResetCircuitBreakers() {
|
|
ph.workerManager.resetCircuitBreakers()
|
|
}
|
|
|
|
// OnGet is called when a connection is retrieved from the pool
|
|
func (ph *PoolHook) OnGet(_ context.Context, conn *pool.Conn, _ bool) (accept bool, err error) {
|
|
// Check if connection is marked for handoff
|
|
// This prevents using connections that have received MOVING notifications
|
|
if conn.ShouldHandoff() {
|
|
return false, ErrConnectionMarkedForHandoffWithState
|
|
}
|
|
|
|
// Check if connection is usable (not in UNUSABLE or CLOSED state)
|
|
// This ensures we don't return connections that are currently being handed off or re-authenticated.
|
|
if !conn.IsUsable() {
|
|
return false, ErrConnectionMarkedForHandoff
|
|
}
|
|
|
|
return true, nil
|
|
}
|
|
|
|
// OnPut is called when a connection is returned to the pool
|
|
func (ph *PoolHook) OnPut(ctx context.Context, conn *pool.Conn) (shouldPool bool, shouldRemove bool, err error) {
|
|
// first check if we should handoff for faster rejection
|
|
if !conn.ShouldHandoff() {
|
|
// Default behavior (no handoff): pool the connection
|
|
return true, false, nil
|
|
}
|
|
|
|
// check pending handoff to not queue the same connection twice
|
|
if ph.workerManager.isHandoffPending(conn) {
|
|
// Default behavior (pending handoff): pool the connection
|
|
return true, false, nil
|
|
}
|
|
|
|
if err := ph.workerManager.queueHandoff(conn); err != nil {
|
|
// Failed to queue handoff, remove the connection
|
|
internal.Logger.Printf(ctx, logs.FailedToQueueHandoff(conn.GetID(), err))
|
|
// Don't pool, remove connection, no error to caller
|
|
return false, true, nil
|
|
}
|
|
|
|
// Check if handoff was already processed by a worker before we can mark it as queued
|
|
if !conn.ShouldHandoff() {
|
|
// Handoff was already processed - this is normal and the connection should be pooled
|
|
return true, false, nil
|
|
}
|
|
|
|
if err := conn.MarkQueuedForHandoff(); err != nil {
|
|
// If marking fails, check if handoff was processed in the meantime
|
|
if !conn.ShouldHandoff() {
|
|
// Handoff was processed - this is normal, pool the connection
|
|
return true, false, nil
|
|
}
|
|
// Other error - remove the connection
|
|
return false, true, nil
|
|
}
|
|
internal.Logger.Printf(ctx, logs.MarkedForHandoff(conn.GetID()))
|
|
return true, false, nil
|
|
}
|
|
|
|
func (ph *PoolHook) OnRemove(_ context.Context, _ *pool.Conn, _ error) {
|
|
// Not used
|
|
}
|
|
|
|
// Shutdown gracefully shuts down the processor, waiting for workers to complete
|
|
func (ph *PoolHook) Shutdown(ctx context.Context) error {
|
|
return ph.workerManager.shutdownWorkers(ctx)
|
|
}
|