This patch improves handling of NULLs in textual fields in ColumnStore.
Previously empty strings were considered NULLs and it could be a problem
if data scheme allows for empty strings. It was also one of major
reasons of behavior difference between ColumnStore and other engines in
MariaDB family.
Also, this patch fixes some other bugs and incorrect behavior, for
example, incorrect comparison for "column <= ''" which evaluates to
constant True for all purposes before this patch.
ARM platforms. This patch resolves this issue and unifies AUX column
processing at x86 and ARM using tempate class SimdProcessor.
The patch also replaces uint16_t mask previously used in column.cpp and
SimProcessor code with a native masks that platform uses, e.g. __m128i
or __m128 on x86 and variety of masks on ARM.
To unify the processing I introduced a new filtering Compare Operator - COMPARE_NULLEQ.
with a 'c1 IS NULL semantics'.
by default), which when enabled, indiscriminately invalidates all
column extents and performs the actual DELETE only on the AUX
column. The trade-off with this approach would now be that the
first SELECT for certain query patterns (those containing a WHERE
predicate) after the DELETE operation will slow down as the
invalidated column extent would need to be scanned again to set
the min/max values.
In the joblist code, in addition to sending the lbid of the SCAN
column, we also send the corresponding lbid of the AUX column to PrimProc.
In the primitives processor code in PrimProc, we load the AUX column
block (8192 rows since the AUX column is implemented as a 1-byte
UNSIGNED TINYINT) into memory and then pass it down to the low-level
scanning (vectorized scanning as applicable) routine to build a non-Empty
mask for the block being processed to filter out DELETED rows based on
comparison of the AUX block row to the empty magic value for the AUX column.
Short CHAR/VARCHAR column values contain integer-encoded strings.
After certain manipulations(orderSwap(strnxfrm(str))) the values
become integers that preserve original strings order relation
according to a certain translation rules(collation). Prepared
values are ready to be SIMD-processed.
The idea is relatively simple - encode prefixes of collated strings as
integers and use them to compute extents' ranges. Then we can eliminate
extents with strings.
The actual patch does have all the code there but miss one important
step: we do not keep collation index, we keep charset index. Because of
this, some of the tests in the bugfix suite fail and thus main
functionality is turned off.
The reason of this patch to be put into PR at all is that it contains
changes that made CHAR/VARCHAR columns unsigned. This change is needed in
vectorization work.
* Fix clang warnings
* Remove vim tab guides
* initialize variables
* 'strncpy' output truncated before terminating nul copying as many bytes from a string as its length
* Fix ISO C++17 does not allow 'register' storage class specifier for outdated bison
* chars are unsigned on ARM, having if (ival < 0) always false
* chars are unsigned by default on ARM and comparison with -1 if always true
data types TEXT, CHAR, VARCHAR, FLOAT and DOUBLE are not yet supported by vectorized path
This patch introduces an example for Google benchmarking suite to measure a perf diff
b/w legacy scan/filtering code and the templated version
- MCOL-4527 Simple query performace is degraded between 5.4 and 5.5
xxx_nopad_bin collations are now around 30% faster on simple queries like:
SELECT * FROM t1 WHERE short_char_column_nopad_bin = 'literal'
The gain is achieved by comparing two short CHAR values as uint64_t.
Note, this patch does not affect xxx_bin collations!
It wouldn't be correct to apply the same improvement for xxx_bin
collations (i.e. with PAD SPACE attribute), because it would change
the way how trailing spaces are compared.
- MCOL-4539 WHERE short_char_column='literal' ignores the collation on a huge table
Only the first thread used a correct collation when performing:
WHERE short_char_char='literal'
Other (15) threads used the server default collation, because
the charsetNumber attribute was not copyed during cloning.
- This patch also adds mtr/basic/suite.opt, so "mtr" can run without --extern.
1. In TupleAggregateStep::configDeliveredRowGroup(), use
jobInfo.projectionCols instead of jobInfo.nonConstCols
for setting scale and precision if the source column is
wide decimal.
2. Tighten rules for wide decimal processing. Specifically:
a. Replace (precision > INT64MAXPRECISION) checks with
(precision > INT64MAXPRECISION && precision <= INT128MAXPRECISION)
b. At places where (colWidth == MAXDECIMALWIDTH) is not enough to
determine if a column is wide decimal or not, also add a check on
type being DECIMAL/UDECIMAL.
Removed uint128 from joblist/lbidlist.*
Another toString() method for wide-decimal that is EMPTY/NULL aware
Unified decimal processing in WF functions
Fixed a potential issue in EqualCompData::operator() for
wide-decimal processing
Fixed some signedness warnings
This commit also adds support in TupleHashJoinStep::forwardCPData,
although we currently do not support wide decimals as join keys.
Row estimation to determine large-side of the join is also updated.
2. Set Decimal precision in SimpleColumn::evaluate().
3. Add support for int128_t in ConstantColumn.
4. Set IDB_Decimal::s128Value in buildDecimalColumn().
5. Use width 16 as first if predicate for branching based on decimal width.