* fix(rowgroup): RGData now uses uint64_t counter for the fixed sizes columns data buf.
The buffer can utilize > 4GB RAM that is necessary for PM side join.
RGData ctor uses uint32_t allocating data buffer.
This fact causes implicit heap overflow.
* feat(bytestream,serdes): BS buffer size type is uint64_t
This necessary to handle 64bit RGData, that comes as
a separate patch. The pair of patches would allow to
have PM joins when SmallSide size > 4GB.
* feat(bytestream,serdes): Distribute BS buf size data type change to avoid implicit data type narrowing
* feat(rowgroup): this returns bits lost during cherry-pick. The bits lost caused the first RGData::serialize to crash a process
The large side read errors mentioned there can be due to failure to
close file stream properly. Some of the data may still reside in the
file stream buffers, closing must flush it. The flush is an I/O
operation and can fail, leading to partial write and subsequent partial
read.
This patch tries to provide better diagnostics.
This patch:
1. Handles corner case when the bucket exceeded the memory limit, but we cannot redistribute the data in this bucket into new buckets based on a hash algorithm, because the rows have the same values.
2. Adds force option for disk join step.
3. Add a option to contol the depth of the partition tree.
* Adds CompressInterfaceLZ4 which uses LZ4 API for compress/uncompress.
* Adds CMake machinery to search LZ4 on running host.
* All methods which use static data and do not modify any internal data - become `static`,
so we can use them without creation of the specific object. This is possible, because
the header specification has not been modified. We still use 2 sections in header, first
one with file meta data, the second one with pointers for compressed chunks.
* Methods `compress`, `uncompress`, `maxCompressedSize`, `getUncompressedSize` - become
pure virtual, so we can override them for the other compression algos.
* Adds method `getChunkMagicNumber`, so we can verify chunk magic number
for each compression algo.
* Renames "s/IDBCompressInterface/CompressInterface/g" according to requirement.
* Introduce multigeneration aggregation
* Do not save unused part of RGDatas to disk
* Add IO error explanation (strerror)
* Reduce memory usage while aggregating
* introduce in-memory generations to better memory utilization
* Try to limit the qty of buckets at a low limit
* Refactor disk aggregation a bit
* pass calculated hash into RowAggregation
* try to keep some RGData with free space in memory
* do not dump more than half of rowgroups to disk if generations are
allowed, instead start a new generation
* for each thread shift the first processed bucket at each iteration,
so the generations start more evenly
* Unify temp data location
* Explicitly create temp subdirectories
whether disk aggregation/join are enabled or not