mirror of https://github.com/mariadb-corporation/mariadb-columnstore-engine.git synced 2025-11-18 10:40:57 +03:00

Files

David Hall 5dff25e0a0 MCOL-2180 Update UDAF docs

2019-03-06 11:52:01 -06:00

9.1 KiB

Raw Permalink Blame History

Source file

Usually, each UDA(n)F function will have just one .cpp. Be sure to write your header file first. It's much easier to implent the various parts if you have a template to work from.

The easiest way to create these files is to copy them an example closest to the type of function you intend to create.

You need a data structure to hold your aggregate values. You can either use simpledatamodel, or complexdatamodel.

You may only need a few accunulators and counters. These can be represented as a fixed size data structure. For these needs, you may choose simpledatamodel. Here's a struct for a possible AVG function:

struct AVGdata
{
    uint64_t    total;
    uint64_t    count;
};

If you have a more complex data structure that may have varying size, you must use complexdatamodel. This should be defined in the header. Here's a struct for MEDIAN example from median.h:

../../../median.h

In each of the functions that have a context parameter, you should type cast the data member of context's UserData member:

struct AVGdata* data = (struct allnull_data*)context->getUserData()->data;

Or, if using the complexdatamodel, type cast the UserData to your UserData derived struct:

MedianData* data = static_cast<MedianData*>(context->getUserData());

init()

param context: The context object for this call.
param colTypes: A list of the ColumnDatum used to access column types of the parameters. In init(), the columnData member is invalid.

see ColumnDatum. In Columnstore 1.2, An arbitrary number of parameters is supported.

returns: ReturnCode::ERROR or ReturnCode::SUCCESS

The init() method is where you sanity check the input datatypes, set the output type and set any run flags for this instance. init() is called one time from the mysqld process. All settings you do here are propagated through the system.

init() is the exception to type casting the UserData member of context. UserData has not been created when init() is called, so you shouldn't use it here.

Set User Data Size

If you're using simpledatamodel, you need to set the size of the structure:

context->setUserDataSize(sizeof(allnull_data));

Check parameter count and type

Each function expects a certain number of columns to be entered as parameters in the SQL query. It is possible to create a UDAF that accepts a variable number of parameters. You can discover which ones were actually used in init(), and modify your function's behavior accordingly.

colTypes is an array of ColumnData from which can be gleaned the type and name. The name is the column name from the SQL query. You can use this information to sanity check for compatible type(s) and also to modify your functions behavior based on type. To do this, add members to your data struct to be tested in the other Methods. Set these members based on colDataTypes (ColDataTypes <coldatatype>).

The actual number of paramters passed can be gotten from context->getParameterCount(). :

if (context->getParameterCount() < 1)
{
    // The error message will be prepended with
    // "The storage engine for the table doesn't support "
    context->setErrorMessage("allnull() with 0 arguments");
    return mcsv1_UDAF::ERROR;
}

Set the ResultType

When you create your function using the SQL CREATE FUNCTION command, you must include a result type in the command. However, you're not completely limited by that decision. You may choose to return a different type based on any number of factors, including the colTypes. setResultType accepts any of the CalpontSystemCatalog::ColType enum values(ColDataTypes <coldatatype>).

context->setResultType(CalpontSystemCatalog::TINYINT);

Set width and scale

If you have special requirements, especially if you might be dealing with decimal types:

context->setColWidth(8);
context->setScale(context->getScale()*2);
context->setPrecision(19);

Set runflags

There are a number of run flags that you can set. Most are for use as an analytic function (Window Function), but a useful one for all functions is UDAF_IGNORE_NULLS. see Run Flags <runflags> for a complete list:

context->setRunFlag(mcsv1sdk::UDAF_IGNORE_NULLS);

reset()

param context: The context object for this call.
returns: ReturnCode::ERROR or ReturnCode::SUCCESS

The reset() method initializes the context for a new aggregation or sub-aggregation.

Then initialize the data in whatever way you need to:

data->mData.clear();

This function may be called multiple times from both the UM and the PM. Make no assumptions about useful data in UserData from call to call.

nextValue()

param context: The context object for this call
param valsIn: an array representing the values to be added for each parameter for this row.
returns: ReturnCode::ERROR or ReturnCode::SUCCESS

nextValue() is called from the PM for aggregate usage and the UM for Analytic usage.

valsIn contains a vector of all the parameters from the function call in the SQL query.

Depending on your function, you may wish to be able to handle many different types of input. There's a helper template function convertAnyTo() which will convert the input static:any value to the designated type. For Example, if your internal accumulater is of type double, you might use:

static_any::any& valIn = valsDropped[0].columnData;
AVGData& data = static_cast<MedianData*>(context->getUserData())->mData;
int64_t val = 0;

if (valIn.empty())
{
    return mcsv1_UDAF::SUCCESS; // Ought not happen when UDAF_IGNORE_NULLS is on.
}

val = convertAnyTo<double>(valIn);

Once you've gotten your data in a format you like, then do your aggregation. For AVG, you might see:

data.total = val;
++data.count;

subEvaluate

param context: The context object for this call
param userDataIn: A UserData struct representing the sub-aggregation
returns: ReturnCode::ERROR or ReturnCode::SUCCESS

subEvaluate() is called on the UM for the consolidation of the subaggregations from the PM. The sub-aggregate from the PM is in userDataIn and the result is to be placed into the UserData struct of context. In this case, you need to type cast userDataIn in a similar fashion as you do the context's UserData struct.

For AVG, you might see:

struct AVGdata* outData = (struct AVGdata*)context->getUserData()->data;
struct AVGdata* inData = (struct AVGdata*)userDataIn->data;
outData->total += inData->total;
outData->count += inData->count;
return mcsv1_UDAF::SUCCESS;

evaluate

param context: The context object for this call
param valOut [out]: The final value for this GROUP or WINDOW.
returns: ReturnCode::ERROR or ReturnCode::SUCCESS

evaluate() is where you do your final calculations. It's pretty straight forward and is seldom different for UDAF (aggregation) or UDAnF (analytic).

For AVG, you might see:

int64_t avg;
struct AVGdata* data = (struct AVGdata*)context->getUserData()->data;
avg = data->total / data.count;
valOut = avg;
return mcsv1_UDAF::SUCCESS;

dropValue

param context: The context object for this call
param valsDropped: a vector representing the values to be dropped for each parameter for this row.

dropValue is an optional method for optimizing UDAnF (Analytic Functions). When used as an aggregate UDAF, dropValue isn't called.

As a Window Moves, some values come into scope and some values leave scope. When values leave scope, dropValue is called so that we don't have to recalculate the whole Window. We just need to undo what was done in nextValue for the dropped entries.

Like nextValue, your function may be able to handle a whole range of data types: For AVG, you might have:

static_any::any& valIn = valsDropped[0].columnData;
AVGData& data = static_cast<MedianData*>(context->getUserData())->mData;
int64_t val = 0;

if (valIn.empty())
{
    return mcsv1_UDAF::SUCCESS; // Ought not happen when UDAF_IGNORE_NULLS is on.
}

if (valIn.compatible(charTypeId))
{
    val = valIn.cast<char>();
}
else if (valIn.compatible(scharTypeId))
{
    val = valIn.cast<signed char>();
}
else if (valIn.compatible(shortTypeId))
{
    val = valIn.cast<short>();
}
.
.
.

data.total -= val;
--data.count;

return mcsv1_UDAF::SUCCESS;

param userData [out]: A pointer to be allocated by the function.
param length [out]: The length of the data allocated.
returns: ReturnCode::ERROR or ReturnCode::SUCCESS

See the chapter on complexdatamodel for more information on how to use this Method.

9.1 KiB Raw Permalink Blame History