* feat(PrimProc): MCOL-5950 Improve disk-based aggregation finalization
Iterate over the rows in the plain vector of RGData instead of
iterating over the hashmap. This reduces the complexity and speeds
up finalization (by up to the twice in the certain cases)
* replace magic constant with muggle constant
* move GROUP_CONCAT/JSON_ARRAYAGG storage to the RowGroup from
the RowAggregation*
* internal data structures (de)serialization
* get rid of a specialized classes for processing JSON_ARRAYAGG
* move the memory accounting to disk-based aggregation classes
* allow aggregation generations to be used for queries with
GROUP_CONCAT/JSON_ARRAYAGG
* Remove the thread id from the error message as it interferes with the mtr
The buffer can utilize > 4GB RAM that is necessary for PM side join.
RGData ctor uses uint32_t allocating data buffer.
This fact causes implicit heap overflow.
Given that idx is a RH hashmap bucket number and info is intra-bucket idx
the root cause is triggered by the difference of idx/hash pair
calculation for a certain GROUP BY generation and for generation
aggregations merging that takes place in RowAggStorage::finalize.
This patch generalizes rowHashToIdx to leverage it in both cases
mentioned above.
in aggregation code
The patch disables padding that forces hasher to calculate over the whole 2k buffer. This patch also moves hashing code
into the common place where it belongs.
exact functionality that does not use MDB hash function.
This patch also takes a bit from Robin Hood hash map implementation forgotten
that reduces hash function collision rate.
* Introduce multigeneration aggregation
* Do not save unused part of RGDatas to disk
* Add IO error explanation (strerror)
* Reduce memory usage while aggregating
* introduce in-memory generations to better memory utilization
* Try to limit the qty of buckets at a low limit
* Refactor disk aggregation a bit
* pass calculated hash into RowAggregation
* try to keep some RGData with free space in memory
* do not dump more than half of rowgroups to disk if generations are
allowed, instead start a new generation
* for each thread shift the first processed bucket at each iteration,
so the generations start more evenly
* Unify temp data location
* Explicitly create temp subdirectories
whether disk aggregation/join are enabled or not