mirror of
https://github.com/facebook/rocksdb.git
synced 2025-09-22 02:22:02 +03:00
Summary: After seeing more people hit issues with thrashing small LRUCache shards and AutoHCC running fully in production for a while on a very large service, here I make these updates: * In the public API, mark the case of `estimated_entry_charge = 0` (which is how you select AutoHCC) as production-ready and generally preferred. That means devoting a lot less space to how to tune FixedHCC (`estimated_entry_charge > 0`) because it is not generally recommended anymore even though in theory it is the fastest (conditional on a fragile configuration). * In the public API, add more detail about potential problems with LRUCache and explicitly endorse HCC. * When a default block cache is created, use AutoHCC instead of LRUCache. It's still a 32MB cache but that's just one cache shard for AutoHCC so the risk of issues with small cache shards is dramatically reduced. And a single AutoHCC shard is still essentially wait-free. * Improve the handling of the hypothetical scenario of a failed anonymous mmap. This is hardly a concern for 64-bit Linux and likely most other OSes. It would in theory be possible to fall back on LRUCache in that case but the code structure makes that annoying/challenging. Instead we crash with an appropriate message. * Cleaned up some includes * Fixed some previously unreported leaks (better assertions on HCC perhaps, some subtle behavior changes) * Added a new mode to cache_bench (detailed below) * Avoid a particularly costly sanity check in `~AutoHyperClockTable()` even in debug builds so that unit testing, etc., isn't bogged down, except keep it in ASAN build. Planned follow-up: * Update HCC implementation to use my new "bit field atomics" API introduced in https://github.com/facebook/rocksdb/issues/13910 to make it easier to read and maintain Possible follow-up: * Re-engineer table cache to use AutoHCC also, instead of LRUCache and a single mutex to ensure no duplication across threads. (a) Pad table cache key to 128 bits for AutoHCC. (b) Stripe/shard the no-duplication mutex. (HCC's consistency model is too weak for concurrent threads to use its API to agree on a winner, even if entries could be inserted in an "open in progress" state.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/13964 Test Plan: existing tests. ClockCacheTest.ClockEvictionEffortCapTest caught a regression during my development, and the crash test has a history of finding subtle HCC bugs. ## Performance Although we've validated AutoHCC performance under high load, etc., before we haven't really considered whether there will be unacceptable overheads for small DBs and CFs, e.g. in unit tests. For this, I have added a new mode to cache_bench: with the -stress_cache_instances=n parameter, it will create and destroy n empty cache instances several times. In the debug build, this found that a particular check in `~AutoHyperClockTable()` was extremely costly for short-lived caches (fixed). Beyond that, we can answer the question of whether it is feasible for a single process to host 1000 DBs each with 1000 CFs with default block cache instances, after moving LRUCache -> AutoHCC, for example: ``` /usr/bin/time ./cache_bench -stress_ cache_instances=1000000 -cache_type=auto_hyper_clock_cache -cache_size=33554432 ``` Release build: Average 9.8 us per 32MB LRUCache creation, 2.9 us per destruction, 24.6GB max RSS (~25KB each) -> Average 4.3 us per 32MB AutoHCC creation, 4.9 us per destruction, 4.8GB max RSS (~5KB each) Debug build: Average 10.9 us per 32MB LRUCache creation, 3.5 us per destruction, 28.7GB max RSS (~29KB each) -> Average 4.5 us per 32MB AutoHCC creation, 4.9 us per destruction, 4.7GB max RSS (~5KB each) Despite the anonymous mmaps, it's apparently more efficient for default/small/empty structures. This is likely due to the dramatically low number of cache shards at this size. If we switch to `-stress_cache_instances=10000 -cache_size=1073741824`: Release build: Average 10.6 us per 1GB LRUCache, 2.8 us per destruction, 2.3 GB max RSS (~230KB each) -> Average 130 us per 1GB AutoHCC creation, 153 us per destruction, 1.5 GB max RSS (~150KB each) Debug build: Average 11.2 us per 1GB LRUCache, 3.6 us per destruction, 2.4 GB max RSS (~240KB each) -> Average 130 us per 1GB AutoHCC creation, 150 us per destruction, 1.6 GB max RSS (~160KB each) Here it's clear that we are paying a price in time for setting up all those mmaps for the good number of cache shards and potential table growth, even though the RSS is well under control. However, I am not concerned about this at all, as it's unlikely to slow down anything notably such as unit tests. Before and after full testsuite runs confirm: 3327.73user 5188.71system 3:38.88elapsed -> 3312.07user 5704.77system 3:41.61elapsed There is increased kernel time but acceptable. With ASAN+UBSAN: 11618.70user 15671.30system 5:54.68elapsed -> 12595.81user 16159.67system 6:32.77elapsed Acceptable given that our ASAN+UBSAN builds are not the slowest in CI Reviewed By: hx235 Differential Revision: D82661067 Pulled By: pdillinger fbshipit-source-id: ab25c766ca70f2b8664849c2a838b9e1b4e72d3b