1
0
mirror of https://github.com/mariadb-corporation/mariadb-columnstore-engine.git synced 2025-04-20 09:07:44 +03:00

88 Commits

Author SHA1 Message Date
Gagan Goel
55d4214429
MCOL-5429 Fix high memory consumption in GROUP_CONCAT() processing. (#2823)
1. Input and output RowGroup's used in GROUP_CONCAT classes
are currently allocating a raw memory buffer of size equal
to the actual width of the string datatype. As an example,
for the following query:
  SELECT col1, GROUP_CONCAT(col2) FROM t GROUP BY col1;
If col2 is a TEXT field with default width, the input
RowGroup containing the target rows to be concatenated will
assign 64kb of memory for every input row in the RowGroup.
This is wasteful as actual field values in real workloads
would be much smaller. We fix this by enabling the
RowGroup to use the StringStore when the RowGroup contains
long strings.

2. RowAggregation::initialize() allocates a memory buffer
for a NULL row. The size of this buffer is equal to the
row size for the output RowGroup. For the above scenario,
using the default group_concat_max_len (which is a server
variable that sets the maximum length of the GROUP_CONCAT string)
value of 1mb, the buffer size would be
(1mb + 64kb + some additional metadata). If the user sets
group_concat_max_len to a higher value, say 3gb, this buffer
size would be ~3gb. Now if the runtime initiates several
instances of RowAggregation, total memory consumption by
PrimProc could exceed the hardware memory limits causing the
OS OOM to kill the process. We fix this problem by again
enabling the StringStore for the NULL row allocation.

3. In the plugin code in buildAggregateColumn(), there is
an integer overflow when the server group_concat_max_len
variable (which is an uint32_t) is set to a value > INT32_MAX
(such as 3gb) and is assigned to
CalpontSystemCatalog::ColType::colWidth (which is an int32_t).
As a short term fix, we saturate the assigned value to colWidth
to INT32_MAX. Proper fix would be to upgrade
CalpontSystemCatalog::ColType::colWidth to an uint32_t.
2023-04-22 00:43:29 +03:00
Leonid Fedorov
030144127e
Remove boost shared array [develop 23.02] (#2812)
* remove boost/shared_array include

* replace boost::shared_array<T> to std::shared_ptr<T[]>
2023-04-17 20:56:09 +03:00
Leonid Fedorov
f1697c261e
MCOL-5385 set data extermination [develop-23.02] (#2813)
* Delete RowGroup::setData and make Pointer ctor explicit

* some push_backs replaced with emplace_backs

* Fixes of review notes
2023-04-16 15:57:39 +03:00
Leonid Fedorov
56f2346083 Remove windows ifdefs 2023-03-02 15:59:42 +00:00
mariadb-AndreyPiskunov
b57d2c30fe Minor fixes 2022-10-31 14:56:32 +02:00
mariadb-AndreyPiskunov
1714b75434 Non working attempt to do MCOL-5227 2022-10-31 14:56:32 +02:00
Leonid Fedorov
d2432f9bf6 get rid of pointers for 128 fields 2022-08-26 15:12:22 +00:00
mariadb-AndreyPiskunov
0863ecd279 Replace getBinaryField 2022-08-25 18:21:43 +03:00
David.Hall
272246e9fa
Merge branch 'develop' into MCOL-4841 2022-06-09 16:58:33 -05:00
david.hall
3b6449842f Merge branch 'develop' into MCOL-4841
# Conflicts:
#	exemgr/main.cpp
#	oam/etc/Columnstore.xml.singleserver
#	primitives/primproc/primproc.cpp
2022-06-09 10:07:26 -05:00
Andrey Piskunov
c7e67aedd9 Renamed variables + removed server tests 2022-06-03 15:30:25 +03:00
Andrey Piskunov
c5fa27475d Welford algorithm for STD and VAR
Naive algorithm for calculating STD and VAR is subject to catastrophic
cancellation. A well-known Welford's algorithms is used instead.
2022-06-03 15:29:30 +03:00
Gagan Goel
973e5024d8 MCOL-4957 Fix performance slowdown for processing TIMESTAMP columns.
Part 1:
 As part of MCOL-3776 to address synchronization issue while accessing
 the fTimeZone member of the Func class, mutex locks were added to the
 accessor and mutator methods. However, this slows down processing
 of TIMESTAMP columns in PrimProc significantly as all threads across
 all concurrently running queries would serialize on the mutex. This
 is because PrimProc only has a single global object for the functor
 class (class derived from Func in utils/funcexp/functor.h) for a given
 function name. To fix this problem:

   (1) We remove the fTimeZone as a member of the Func derived classes
   (hence removing the mutexes) and instead use the fOperationType
   member of the FunctionColumn class to propagate the timezone values
   down to the individual functor processing functions such as
   FunctionColumn::getStrVal(), FunctionColumn::getIntVal(), etc.

   (2) To achieve (1), a timezone member is added to the
   execplan::CalpontSystemCatalog::ColType class.

Part 2:
 Several functors in the Funcexp code call dataconvert::gmtSecToMySQLTime()
 and dataconvert::mySQLTimeToGmtSec() functions for conversion between seconds
 since unix epoch and broken-down representation. These functions in turn call
 the C library function localtime_r() which currently has a known bug of holding
 a global lock via a call to __tz_convert. This significantly reduces performance
 in multi-threaded applications where multiple threads concurrently call
 localtime_r(). More details on the bug:
   https://sourceware.org/bugzilla/show_bug.cgi?id=16145

 This bug in localtime_r() caused processing of the Functors in PrimProc to
 slowdown significantly since a query execution causes Functors code to be
 processed in a multi-threaded manner.

 As a fix, we remove the calls to localtime_r() from gmtSecToMySQLTime()
 and mySQLTimeToGmtSec() by performing the timezone-to-offset conversion
 (done in dataconvert::timeZoneToOffset()) during the execution plan
 creation in the plugin. Note that localtime_r() is only called when the
 time_zone system variable is set to "SYSTEM".

 This fix also required changing the timezone type from a std::string to
 a long across the system.
2022-02-14 14:12:27 -05:00
David Hall
27dea733c5 MCOL4841 dev port run large join without OOM 2022-02-09 17:33:55 -06:00
Leonid Fedorov
04752ec546 clang format apply 2022-01-21 16:43:49 +00:00
Alexey Antipovsky
6a4140394d [MCOL-4829] More accurate memory counting 2021-09-07 19:52:20 +03:00
Alexey Antipovsky
7fea3c988e [MCOL-4829] Compression for the temp disk-based aggregation files 2021-09-02 19:30:25 +03:00
Leonid Fedorov
51a8ffcb6a Fix sumavgoverflow.sql test 2021-07-09 22:41:28 +00:00
Leonid Fedorov
f81f743282 Replace underlying type for avg and sum for int types from long double to wide decimal 2021-07-08 17:04:43 +00:00
Alexander Barkov
9794f24369 MCOL-4801 Replace Row methods getStringLength() and getStringPointer() to getConstString() 2021-07-06 21:15:32 +04:00
Gagan Goel
8520f87237 MCOL-641 Cleanup. 2021-07-06 09:01:49 +00:00
Roman Nozdrin
bed0b7c6bc MCOL-4173 This patch adds support for wide-DECIMAL INNER, OUTER, SEMI, functional JOINs
based on top of TypelessData
2021-06-24 08:07:23 +00:00
Alexey Antipovsky
0dedb7e628 Fix compilation warnings 2021-06-09 16:51:00 +03:00
Alexey Antipovsky
475104e4d3 [MCOL-4709] Disk-based aggregation
* Introduce multigeneration aggregation

* Do not save unused part of RGDatas to disk
* Add IO error explanation (strerror)

* Reduce memory usage while aggregating
* introduce in-memory generations to better memory utilization

* Try to limit the qty of buckets at a low limit

* Refactor disk aggregation a bit
* pass calculated hash into RowAggregation
* try to keep some RGData with free space in memory

* do not dump more than half of rowgroups to disk if generations are
  allowed, instead start a new generation
* for each thread shift the first processed bucket at each iteration,
  so the generations start more evenly

* Unify temp data location

* Explicitly create temp subdirectories
  whether disk aggregation/join are enabled or not
2021-06-06 16:09:15 +03:00
Alexander Barkov
9608533d92 MCOL-4734 Compilation failure: MariaDB-10.6 + ColumnStore-develop
mcsconfig.h and my_config.h have the following
pre-processor definitions:

1. Conflicting definitions coming from the standard cmake definitions:
- PACKAGE
- PACKAGE_BUGREPORT
- PACKAGE_NAME
- PACKAGE_STRING
- PACKAGE_TARNAME
- PACKAGE_VERSION
- VERSION

2. Conflicting definitions of other kinds:
- HAVE_STRTOLL - this is a dirt in MariaDB headers.
  Should be fixed in the server code. my_config.h erroneously
  performs "#define HAVE_STRTOLL" instead of "#define HAVE_STRTOLL 1".
  in some cases. The former is not CMake compatible style. The latter is.

3. Non-conflicting definitions:
  Otherwise, mcsconfig.h and my_config.h should be mutually compatible,
  because both are generated by cmake on the same host machine. So
  they should have exactly equal definitions like "HAVE_XXX", "SIZEOF_XXX", etc.

Observations:
- It's OK to include both mcsconfig.h and my_config.h providing that we
  suppress duplicate definition of the above conflicting types #1 and #2.
- There is no a need to suppress duplicate definitions mentioned in #3,
  as they are compatible!
- my_sys.h and m_ctype.h must always follow a CMake configuation header,
  either my_config.h or mcsconfig.h (or both).
  They must never be included without any preceeding configuration header.

This change make sure that we resolve conflicts by:
- either disallowing inclusion of mcsconfig.h and my_config.h
  at the same time
- or by hiding conflicting definitions #1 and #2
  (with their later restoring).
- also, by making sure that my_sys.h and m_ctype.h always follow
  a CMake configuration file.

Details:
- idb_mysql.h can now only be included only after my_config.h
  An attempt to use idb_mysql.h with mcsconfig.h instead of
  my_config.h is caught by the "#error" preprocessor directive.

- mariadb_my_sys.h can now be only included after mcsconfig.h.
  An attempt to use mariadb_my_sys.h without mcscofig.h
  (e.g. with my_config.h) is also caught by "#error".

- collation.h now can now be included in two ways.
  It now has the following effective structure:

    #if defined(PREFER_MY_CONFIG_H) && defined(MY_CONFIG_H)
    //  Remember current conflicting definitions on the preprocessor stack
    //  Undefine current conflicting definitions
    #endif
    #include "mcsconfig.h"
    #include "m_ctype.h"
    #if defined(PREFER_MY_CONFIG_H) && defined(MY_CONFIG_H)
    #    Restore conflicting definitions from the preprocessor stack
    #endif

  and can be included as follows:

  a. using only mcsconfig.h as a configuration header:

    // my_config.h must not be included so far
    #include "collation.h"

  b. using my_config.h as the first included configuration file:

    #define PREFER_MY_CONFIG_H // Force conflict resolution
    #include "my_config.h"     // can be included directly or indirectly
    ...
    #include "collation.h"

Other changes:

- Adding helper header files
     utils/common/mcsconfig_conflicting_defs_remember.h
     utils/common/mcsconfig_conflicting_defs_restore.h
     utils/common/mcsconfig_conflicting_defs_undef.h
  to perform conflict resolution easier.

- Removing `#include "collation.h"` from a number of files,
  as it's automatically included from rowgroup.h.

- Removing redundant `#include "utils_utf8.h"`.
  This change is not directly related to the problem being fixed,
  but it's nice to remove redundant directives for both collation.h
  and utils_utf8.h from all the files that do not really need them.
  (this change could probably have gone as a separate commit)

- Changing my_init() to MY_INIT(argv[0]) in the MCS services sources.
  After the fix of the complitation failure it appeared that ColumnStore
  services compiled with the debug build crash due to recent changes in
  safemalloc. The crash happened in strcmp() with `my_progname` as an argument
  (where my_progname is a mysys global variable). This problem should
  probably be fixed on the server side as well to avoid passing NULL.
  But, the majority of MariaDB executable programs also use MY_INIT(argv[0])
  rather than my_init(). So let's make MCS do like the other programs do.
2021-05-25 12:34:36 +04:00
David Hall
f4e6939139 MCOL-4643 dev 5 reset valOut after processing UDAF
After a UDAF result has been inserted in the output stream, the valOut object needs to be reset to empty in preparation for the next value. Failing to do so may cause what should be a NULL value to erroneously take the last value inserted.
2021-04-30 10:57:40 -05:00
Alexander Barkov
362bfcd15e MCOL-4361 Replace pow(10.0, (double)scale) expressions with a static dictionary lookup. 2021-04-09 12:41:04 +04:00
David Hall
0eee6cfc62 MCOL-4643 reset valOut after UDAF evaluation 2021-03-26 16:09:15 -05:00
Roman Nozdrin
5b9689ce55 MCOL-4478 MCS now rounds the last digits of an avg() result for wide-DECIMAL argument 2020-12-30 15:02:12 +00:00
Roman Nozdrin
5815c5c526 MCOL-4452 RowAggregationUMP2::doUDAF() now calls setUserData() using a correct UDAF context 2020-12-22 15:43:51 +00:00
Roman Nozdrin
178be69bc4 MCOL-4394 __float128 related code had been moved into a separate file
Trim to double and to long double conversions for Decimal
2020-11-19 12:08:18 +00:00
Roman Nozdrin
c00daa93bd MCOL-4172 MultiDistinctRowAggregation didn't honor multiple UDAF in projection
::doUDAF() doesn't crash anymore trying to access fRGContextColl[] elements
that doesn't exist running RowAggregationMultiDistinct::doAggregate()
2020-11-18 13:53:15 +00:00
Roman Nozdrin
3eb26c0d4a MCOL-4313 Introduced TSInt128 that is a storage class for int128
Removed uint128 from joblist/lbidlist.*

Another toString() method for wide-decimal that is EMPTY/NULL aware

Unified decimal processing in WF functions

Fixed a potential issue in EqualCompData::operator() for
    wide-decimal processing

Fixed some signedness warnings
2020-11-18 13:53:15 +00:00
Alexander Barkov
d5c6645ba1 Adding mcs_basic_types.h
For now it consists of only:

using int128_t = __int128;
using uint128_t = unsigned __int128;

All new privitive data types should go into this file in the future.
2020-11-18 13:53:15 +00:00
Alexander Barkov
129d5b5a0f MCOL-4174 Review/refactor frontend/connector code 2020-11-18 13:53:15 +00:00
Roman Nozdrin
8de9764f84 MCOL-4172 Add support for wide-DECIMAL into statistical aggregate and regr_* UDAF functions
The patch fixes wrong results returned when multiple UDAF exist in projection

aggregate over wide decimal literals now works
2020-11-18 13:52:20 +00:00
David Hall
638202417f MCOL-4171 2020-11-18 13:52:19 +00:00
Roman Nozdrin
bd0d5af123 Merge fixes. 2020-11-18 13:51:26 +00:00
Roman Nozdrin
eeebe83839 MCOL-641 Fixed the incorrect if-condition. 2020-11-18 13:51:26 +00:00
Roman Nozdrin
21a41738e1 MCOL-641 Simple aggregates works with GROUP BY column keys.
Fixed constant colump copy for binary columns in TNS.
2020-11-18 13:51:26 +00:00
Roman Nozdrin
e88cbe9bc1 MCOL-641 Simple aggregates support: min, max, sum, avg for wide-DECIMALs. 2020-11-18 13:51:25 +00:00
Roman Nozdrin
97ee1609b2 MCOL-641 Replaced NULL binary constants.
DataConvert::decimalToString, toString, writeIntPart, writeFractionalPart are not templates anymore.
2020-11-18 13:47:44 +00:00
Roman Nozdrin
de85e21c38 MCOL-641 This commit cleans up Row methods and adds couple UT for Row. 2020-11-18 13:47:02 +00:00
Roman Nozdrin
f73de30427 MCOL-641 This commit introduces GTest Suite into CS.
Binary NULL magic now consists of a series of BINARYEMPTYROW-s + BINARYNULL
in the end.

ByteStream now has hexbyte alias.

Added ColumnCommand::getEmptyRowValue to support 16 byte EMPTY values.
2020-11-18 13:47:01 +00:00
drrtuy
84f9821720 MCOL-641 Switched to DataConvert static methods in joblist code.
Replaced BINARYEMPTYROW and BINARYNULL values. We need to have
separate magic values for numeric and non-numeric binary types
b/c numeric cant tolerate losing 0 used for magics previously.

atoi128() now parses minus sign and produces negative values.

RowAggregation::isNull() now uses Row::isNull() for DECIMAL.
2020-11-18 13:47:01 +00:00
drrtuy
0ff0472842 MCOL-641 sum() now works with DECIMAL(38) columns.
TupleAggregateStep class method and buildAggregateColumn() now properly set result data type.

doSum() now handles DECIMAL(38) in approprate manner.

Low-level null related methods for new binary-based datatypes now handles magic values for
binary-based DT.
2020-11-18 13:47:01 +00:00
Alexey Antipovsky
b25fee320a Remove variable-length arrays (-Wvla) 2020-11-17 15:03:10 +03:00
Gagan Goel
2ba9263df4 Silence -Werror=implicit-fallthrough compiler errors - Patch from Monty.
The patch also fixes some potential bugs due to missing break
statements.
2020-06-26 12:32:57 -04:00
David Hall
f9078efbc6 MCOL-3536 Collation 2020-06-08 17:57:37 -05:00
David Hall
78ac310e42 MCOL-3536 Collation 2020-06-01 15:08:15 -05:00