fix(pool): pool performance (#3565)

* perf(pool): replace hookManager RWMutex with atomic.Pointer and add predefined state slices - Replace hookManager RWMutex with atomic.Pointer for lock-free reads in hot paths - Add predefined state slices to avoid allocations (validFromInUse, validFromCreatedOrIdle, etc.) - Add Clone() method to PoolHookManager for atomic updates - Update AddPoolHook/RemovePoolHook to use copy-on-write pattern - Update all hookManager access points to use atomic Load() Performance improvements: - Eliminates RWMutex contention in Get/Put/Remove hot paths - Reduces allocations by reusing predefined state slices - Lock-free reads allow better CPU cache utilization * perf(pool): eliminate mutex overhead in state machine hot path The state machine was calling notifyWaiters() on EVERY Get/Put operation, which acquired a mutex even when no waiters were present (the common case). Fix: Use atomic waiterCount to check for waiters BEFORE acquiring mutex. This eliminates mutex contention in the hot path (Get/Put operations). Implementation: - Added atomic.Int32 waiterCount field to ConnStateMachine - Increment when adding waiter, decrement when removing - Check waiterCount atomically before acquiring mutex in notifyWaiters() Performance impact: - Before: mutex lock/unlock on every Get/Put (even with no waiters) - After: lock-free atomic check, only acquire mutex if waiters exist - Expected improvement: ~30-50% for Get/Put operations * perf(pool): use predefined state slices to eliminate allocations in hot path The pool was creating new slice literals on EVERY Get/Put operation: - popIdle(): []ConnState{StateCreated, StateIdle} - putConn(): []ConnState{StateInUse} - CompareAndSwapUsed(): []ConnState{StateIdle} and []ConnState{StateInUse} - MarkUnusableForHandoff(): []ConnState{StateInUse, StateIdle, StateCreated} These allocations were happening millions of times per second in the hot path. Fix: Use predefined global slices defined in conn_state.go: - validFromInUse - validFromCreatedOrIdle - validFromCreatedInUseOrIdle Performance impact: - Before: 4 slice allocations per Get/Put cycle - After: 0 allocations (use predefined slices) - Expected improvement: ~30-40% reduction in allocations and GC pressure * perf(pool): optimize TryTransition to reduce atomic operations Further optimize the hot path by: 1. Remove redundant GetState() call in the loop 2. Only check waiterCount after successful CAS (not before loop) 3. Inline the waiterCount check to avoid notifyWaiters() call overhead This reduces atomic operations from 4-5 per Get/Put to 2-3: - Before: GetState() + CAS + waiterCount.Load() + notifyWaiters mutex check - After: CAS + waiterCount.Load() (only if CAS succeeds) Performance impact: - Eliminates 1-2 atomic operations per Get/Put - Expected improvement: ~10-15% for Get/Put operations * perf(pool): add fast path for Get/Put to match master performance Introduced TryTransitionFast() for the hot path (Get/Put operations): - Single CAS operation (same as master's atomic bool) - No waiter notification overhead - No loop through valid states - No error allocation Hot path flow: 1. popIdle(): Try IDLE → IN_USE (fast), fallback to CREATED → IN_USE 2. putConn(): Try IN_USE → IDLE (fast) This matches master's performance while preserving state machine for: - Background operations (handoff/reauth use UNUSABLE state) - State validation (TryTransition still available) - Waiter notification (AwaitAndTransition for blocking) Performance comparison per Get/Put cycle: - Master: 2 atomic CAS operations - State machine (before): 5 atomic operations (2.5x slower) - State machine (after): 2 atomic CAS operations (same as master!) Expected improvement: Restore to baseline ~11,373 ops/sec * combine cas * fix linter * try faster approach * fast semaphore * better inlining for hot path * fix linter issues * use new semaphore in auth as well * linter should be happy now * add comments * Update internal/pool/conn_state.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * address comment * slight reordering * try to cache time if for non-critical calculation * fix wrong benchmark * add concurrent test * fix benchmark report * add additional expect to check output * comment and variable rename --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-12-02 06:22:31 +03:00 · 2025-10-27 15:06:30 +02:00
parent 07e665f7af
commit 080a33c3a8
12 changed files with 585 additions and 169 deletions
--- a/internal/semaphore.go
+++ b/internal/semaphore.go
@@ -0,0 +1,161 @@
+package internal
+
+import (
+	"context"
+	"sync"
+	"sync/atomic"
+	"time"
+)
+
+var semTimers = sync.Pool{
+	New: func() interface{} {
+		t := time.NewTimer(time.Hour)
+		t.Stop()
+		return t
+	},
+}
+
+// FastSemaphore is a counting semaphore implementation using atomic operations.
+// It's optimized for the fast path (no blocking) while still supporting timeouts and context cancellation.
+//
+// Performance characteristics:
+// - Fast path (no blocking): Single atomic CAS operation
+// - Slow path (blocking): Falls back to channel-based waiting
+// - Release: Single atomic decrement + optional channel notification
+//
+// This is significantly faster than a pure channel-based semaphore because:
+// 1. The fast path avoids channel operations entirely (no scheduler involvement)
+// 2. Atomic operations are much cheaper than channel send/receive
+type FastSemaphore struct {
+	// Current number of acquired tokens (atomic)
+	count atomic.Int32
+
+	// Maximum number of tokens (capacity)
+	max int32
+
+	// Channel for blocking waiters
+	// Only used when fast path fails (semaphore is full)
+	waitCh chan struct{}
+}
+
+// NewFastSemaphore creates a new fast semaphore with the given capacity.
+func NewFastSemaphore(capacity int32) *FastSemaphore {
+	return &FastSemaphore{
+		max:    capacity,
+		waitCh: make(chan struct{}, capacity),
+	}
+}
+
+// TryAcquire attempts to acquire a token without blocking.
+// Returns true if successful, false if the semaphore is full.
+//
+// This is the fast path - just a single CAS operation.
+func (s *FastSemaphore) TryAcquire() bool {
+	for {
+		current := s.count.Load()
+		if current >= s.max {
+			return false // Semaphore is full
+		}
+		if s.count.CompareAndSwap(current, current+1) {
+			return true // Successfully acquired
+		}
+		// CAS failed due to concurrent modification, retry
+	}
+}
+
+// Acquire acquires a token, blocking if necessary until one is available or the context is cancelled.
+// Returns an error if the context is cancelled or the timeout expires.
+// Returns timeoutErr when the timeout expires.
+//
+// Performance optimization:
+// 1. First try fast path (no blocking)
+// 2. If that fails, fall back to channel-based waiting
+func (s *FastSemaphore) Acquire(ctx context.Context, timeout time.Duration, timeoutErr error) error {
+	// Fast path: try to acquire without blocking
+	select {
+	case <-ctx.Done():
+		return ctx.Err()
+	default:
+	}
+
+	// Try fast acquire first
+	if s.TryAcquire() {
+		return nil
+	}
+
+	// Fast path failed, need to wait
+	// Use timer pool to avoid allocation
+	timer := semTimers.Get().(*time.Timer)
+	defer semTimers.Put(timer)
+	timer.Reset(timeout)
+
+	start := time.Now()
+
+	for {
+		select {
+		case <-ctx.Done():
+			if !timer.Stop() {
+				<-timer.C
+			}
+			return ctx.Err()
+
+		case <-s.waitCh:
+			// Someone released a token, try to acquire it
+			if s.TryAcquire() {
+				if !timer.Stop() {
+					<-timer.C
+				}
+				return nil
+			}
+			// Failed to acquire (race with another goroutine), continue waiting
+
+		case <-timer.C:
+			return timeoutErr
+		}
+
+		// Periodically check if we can acquire (handles race conditions)
+		if time.Since(start) > timeout {
+			return timeoutErr
+		}
+	}
+}
+
+// AcquireBlocking acquires a token, blocking indefinitely until one is available.
+// This is useful for cases where you don't need timeout or context cancellation.
+// Returns immediately if a token is available (fast path).
+func (s *FastSemaphore) AcquireBlocking() {
+	// Try fast path first
+	if s.TryAcquire() {
+		return
+	}
+
+	// Slow path: wait for a token
+	for {
+		<-s.waitCh
+		if s.TryAcquire() {
+			return
+		}
+		// Failed to acquire (race with another goroutine), continue waiting
+	}
+}
+
+// Release releases a token back to the semaphore.
+// This wakes up one waiting goroutine if any are blocked.
+func (s *FastSemaphore) Release() {
+	s.count.Add(-1)
+
+	// Try to wake up a waiter (non-blocking)
+	// If no one is waiting, this is a no-op
+	select {
+	case s.waitCh <- struct{}{}:
+		// Successfully notified a waiter
+	default:
+		// No waiters, that's fine
+	}
+}
+
+// Len returns the current number of acquired tokens.
+// Used by tests to check semaphore state.
+func (s *FastSemaphore) Len() int32 {
+	return s.count.Load()
+}