Only glibc and not musl currently supports the mechanisms of IFUNC.
This fixes 11.8 branch build on Alpine Linux.
Build error was:
mariadb-11.8.2/sql/vector_mhnsw.cc: In static member function 'static const FVector* FVector::create(metric_type, void*, const void*, size_t)':
mariadb-11.8.2/sql/vector_mhnsw.cc:299:19: error: multiversioning needs 'ifunc' which is not supported on this target
299 | static FVector *align_ptr(void *ptr) { return (FVector*)ptr; }
| ^~~~~~~~~
mariadb-11.8.2/sql/vector_mhnsw.cc:113:3: error: use of multiversioned function without a default
This patch optimises the dot_product function by leveraging
vectorisation through SIMD intrinsics. This transformation enables
parallel execution of multiple operations, significantly improving the
performance of dot product computation on supported architectures.
The original dot_product function does undergo auto-vectorisation when
compiled with -O3. However, performance analysis has shown that the
newly optimised implementation performs better on Power10 and achieves
comparable performance on Power9 machines.
Benchmark tests were conducted on both Power9 and Power10 machines,
comparing the time taken by the original (auto-vectorized) code and the
new vectorised code. GCC 11.5.0 on RHEL 9.5 operating system with -O3
were used. The benchmarks were performed using a sample test code with
a vector size of 4096 and 10⁷ loop iterations. Here are the average
execution times (in seconds) over multiple runs:
Power9:
Before change: ~16.364 s
After change: ~16.180 s
Performance gain is modest but measurable.
Power10:
Before change: ~8.989 s
After change: ~6.446 s
Significant improvement, roughly 28–30% faster.
Signed-off-by: Manjul Mohan <manjul.mohan@ibm.com>
SIMD implementations of bloom filters and dot product calculation.
A microbenchmark shows 1.7x dot product performance improvement compared to
regular -O2/-O3 builds and 2.4x compared to builds with auto-vectorization
disabled.
Performance improvement (microbenchmark) for bloom filters is less exciting,
within 10-30% ballpark depending on compiler options and load.
Misc implementation notes:
CalcHash: no _mm256_shuffle_epi8(), use explicit XOR/shift.
CalcHash: no 64bit multiplication, do scalar multiplication.
ConstructMask/Query: no _mm256_i64gather_epi64, access array elements explicitly.
Query: no _mm256_movemask_epi8, accumulate bits manually.
Closes#3671
* preserve the graph in memory between statements
* keep it in a TABLE_SHARE, available for concurrent searches
* nodes are generally read-only, walking the graph doesn't change them
* distance to target is cached, calculated only once
* SIMD-optimized bloom filter detects visited nodes
* nodes are stored in an array, not List, to better utilize bloom filter
* auto-adjusting heuristic to estimate the number of visited nodes
(to configure the bloom filter)
* many threads can concurrently walk the graph. MEM_ROOT and Hash_set
are protected with a mutex, but walking doesn't need them
* up to 8 threads can concurrently load nodes into the cache,
nodes are partitioned into 8 mutexes (8 is chosen arbitrarily, might
need tuning)
* concurrent editing is not supported though
* this is fine for MyISAM, TL_WRITE protects the TABLE_SHARE and the
graph (note that TL_WRITE_CONCURRENT_INSERT is not allowed, because an
INSERT into the main table means multiple UPDATEs in the graph)
* InnoDB uses secondary transaction-level caches linked in a list in
in thd->ha_data via a fake handlerton
* on rollback the secondary cache is discarded, on commit nodes
from the secondary cache are invalidated in the shared cache
while it is exclusively locked
* on savepoint rollback both caches are flushed. this can be improved
in the future with a row visibility callback
* graph size is controlled by @@mhnsw_cache_size, the cache is flushed
when it reaches the threshold