1
0
mirror of https://github.com/postgres/postgres.git synced 2025-07-31 22:04:40 +03:00

Support GROUPING SETS, CUBE and ROLLUP.

This SQL standard functionality allows to aggregate data by different
GROUP BY clauses at once. Each grouping set returns rows with columns
grouped by in other sets set to NULL.

This could previously be achieved by doing each grouping as a separate
query, conjoined by UNION ALLs. Besides being considerably more concise,
grouping sets will in many cases be faster, requiring only one scan over
the underlying data.

The current implementation of grouping sets only supports using sorting
for input. Individual sets that share a sort order are computed in one
pass. If there are sets that don't share a sort order, additional sort &
aggregation steps are performed. These additional passes are sourced by
the previous sort step; thus avoiding repeated scans of the source data.

The code is structured in a way that adding support for purely using
hash aggregation or a mix of hashing and sorting is possible. Sorting
was chosen to be supported first, as it is the most generic method of
implementation.

Instead of, as in an earlier versions of the patch, representing the
chain of sort and aggregation steps as full blown planner and executor
nodes, all but the first sort are performed inside the aggregation node
itself. This avoids the need to do some unusual gymnastics to handle
having to return aggregated and non-aggregated tuples from underlying
nodes, as well as having to shut down underlying nodes early to limit
memory usage.  The optimizer still builds Sort/Agg node to describe each
phase, but they're not part of the plan tree, but instead additional
data for the aggregation node. They're a convenient and preexisting way
to describe aggregation and sorting.  The first (and possibly only) sort
step is still performed as a separate execution step. That retains
similarity with existing group by plans, makes rescans fairly simple,
avoids very deep plans (leading to slow explains) and easily allows to
avoid the sorting step if the underlying data is sorted by other means.

A somewhat ugly side of this patch is having to deal with a grammar
ambiguity between the new CUBE keyword and the cube extension/functions
named cube (and rollup). To avoid breaking existing deployments of the
cube extension it has not been renamed, neither has cube been made a
reserved keyword. Instead precedence hacking is used to make GROUP BY
cube(..) refer to the CUBE grouping sets feature, and not the function
cube(). To actually group by a function cube(), unlikely as that might
be, the function name has to be quoted.

Needs a catversion bump because stored rules may change.

Author: Andrew Gierth and Atri Sharma, with contributions from Andres Freund
Reviewed-By: Andres Freund, Noah Misch, Tom Lane, Svenne Krap, Tomas
    Vondra, Erik Rijkers, Marti Raudsepp, Pavel Stehule
Discussion: CAOeZVidmVRe2jU6aMk_5qkxnB7dfmPROzM7Ur8JPW5j8Y5X-Lw@mail.gmail.com
This commit is contained in:
Andres Freund
2015-05-16 03:40:59 +02:00
parent 6e4415c6aa
commit f3d3118532
63 changed files with 5255 additions and 618 deletions

View File

@ -12228,7 +12228,9 @@ NULL baz</literallayout>(3 rows)</entry>
<xref linkend="functions-aggregate-statistics-table">.
The built-in ordered-set aggregate functions
are listed in <xref linkend="functions-orderedset-table"> and
<xref linkend="functions-hypothetical-table">.
<xref linkend="functions-hypothetical-table">. Grouping operations,
which are closely related to aggregate functions, are listed in
<xref linkend="functions-grouping-table">.
The special syntax considerations for aggregate
functions are explained in <xref linkend="syntax-aggregates">.
Consult <xref linkend="tutorial-agg"> for additional introductory
@ -13326,6 +13328,72 @@ SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;
to the rule specified in the <literal>ORDER BY</> clause.
</para>
<table id="functions-grouping-table">
<title>Grouping Operations</title>
<tgroup cols="3">
<thead>
<row>
<entry>Function</entry>
<entry>Return Type</entry>
<entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry>
<indexterm>
<primary>GROUPING</primary>
</indexterm>
<function>GROUPING(<replaceable class="parameter">args...</replaceable>)</function>
</entry>
<entry>
<type>integer</type>
</entry>
<entry>
Integer bitmask indicating which arguments are not being included in the current
grouping set
</entry>
</row>
</tbody>
</tgroup>
</table>
<para>
Grouping operations are used in conjunction with grouping sets (see
<xref linkend="queries-grouping-sets">) to distinguish result rows. The
arguments to the <literal>GROUPING</> operation are not actually evaluated,
but they must match exactly expressions given in the <literal>GROUP BY</>
clause of the associated query level. Bits are assigned with the rightmost
argument being the least-significant bit; each bit is 0 if the corresponding
expression is included in the grouping criteria of the grouping set generating
the result row, and 1 if it is not. For example:
<screen>
<prompt>=&gt;</> <userinput>SELECT * FROM items_sold;</>
make | model | sales
-------+-------+-------
Foo | GT | 10
Foo | Tour | 20
Bar | City | 15
Bar | Sport | 5
(4 rows)
<prompt>=&gt;</> <userinput>SELECT make, model, GROUPING(make,model), sum(sales) FROM items_sold GROUP BY ROLLUP(make,model);</>
make | model | grouping | sum
-------+-------+----------+-----
Foo | GT | 0 | 10
Foo | Tour | 0 | 20
Bar | City | 0 | 15
Bar | Sport | 0 | 5
Foo | | 1 | 30
Bar | | 1 | 20
| | 3 | 50
(7 rows)
</screen>
</para>
</sect1>
<sect1 id="functions-window">

View File

@ -1183,6 +1183,181 @@ SELECT product_id, p.name, (sum(s.units) * (p.price - p.cost)) AS profit
</para>
</sect2>
<sect2 id="queries-grouping-sets">
<title><literal>GROUPING SETS</>, <literal>CUBE</>, and <literal>ROLLUP</></title>
<indexterm zone="queries-grouping-sets">
<primary>GROUPING SETS</primary>
</indexterm>
<indexterm zone="queries-grouping-sets">
<primary>CUBE</primary>
</indexterm>
<indexterm zone="queries-grouping-sets">
<primary>ROLLUP</primary>
</indexterm>
<para>
More complex grouping operations than those described above are possible
using the concept of <firstterm>grouping sets</>. The data selected by
the <literal>FROM</> and <literal>WHERE</> clauses is grouped separately
by each specified grouping set, aggregates computed for each group just as
for simple <literal>GROUP BY</> clauses, and then the results returned.
For example:
<screen>
<prompt>=&gt;</> <userinput>SELECT * FROM items_sold;</>
brand | size | sales
-------+------+-------
Foo | L | 10
Foo | M | 20
Bar | M | 15
Bar | L | 5
(4 rows)
<prompt>=&gt;</> <userinput>SELECT brand, size, sum(sales) FROM items_sold GROUP BY GROUPING SETS ((brand), (size), ());</>
brand | size | sum
-------+------+-----
Foo | | 30
Bar | | 20
| L | 15
| M | 35
| | 50
(5 rows)
</screen>
</para>
<para>
Each sublist of <literal>GROUPING SETS</> may specify zero or more columns
or expressions and is interpreted the same way as though it were directly
in the <literal>GROUP BY</> clause. An empty grouping set means that all
rows are aggregated down to a single group (which is output even if no
input rows were present), as described above for the case of aggregate
functions with no <literal>GROUP BY</> clause.
</para>
<para>
References to the grouping columns or expressions are replaced
by <literal>NULL</> values in result rows for grouping sets in which those
columns do not appear. To distinguish which grouping a particular output
row resulted from, see <xref linkend="functions-grouping-table">.
</para>
<para>
A shorthand notation is provided for specifying two common types of grouping set.
A clause of the form
<programlisting>
ROLLUP ( <replaceable>e1</>, <replaceable>e2</>, <replaceable>e3</>, ... )
</programlisting>
represents the given list of expressions and all prefixes of the list including
the empty list; thus it is equivalent to
<programlisting>
GROUPING SETS (
( <replaceable>e1</>, <replaceable>e2</>, <replaceable>e3</>, ... ),
...
( <replaceable>e1</>, <replaceable>e2</> )
( <replaceable>e1</> )
( )
)
</programlisting>
This is commonly used for analysis over hierarchical data; e.g. total
salary by department, division, and company-wide total.
</para>
<para>
A clause of the form
<programlisting>
CUBE ( <replaceable>e1</>, <replaceable>e2</>, ... )
</programlisting>
represents the given list and all of its possible subsets (i.e. the power
set). Thus
<programlisting>
CUBE ( a, b, c )
</programlisting>
is equivalent to
<programlisting>
GROUPING SETS (
( a, b, c ),
( a, b ),
( a, c ),
( a ),
( b, c ),
( b ),
( c ),
( ),
)
</programlisting>
</para>
<para>
The individual elements of a <literal>CUBE</> or <literal>ROLLUP</>
clause may be either individual expressions, or sub-lists of elements in
parentheses. In the latter case, the sub-lists are treated as single
units for the purposes of generating the individual grouping sets.
For example:
<programlisting>
CUBE ( (a,b), (c,d) )
</programlisting>
is equivalent to
<programlisting>
GROUPING SETS (
( a, b, c, d )
( a, b )
( c, d )
( )
)
</programlisting>
and
<programlisting>
ROLLUP ( a, (b,c), d )
</programlisting>
is equivalent to
<programlisting>
GROUPING SETS (
( a, b, c, d )
( a, b, c )
( a )
( )
)
</programlisting>
</para>
<para>
The <literal>CUBE</> and <literal>ROLLUP</> constructs can be used either
directly in the <literal>GROUP BY</> clause, or nested inside a
<literal>GROUPING SETS</> clause. If one <literal>GROUPING SETS</> clause
is nested inside another, the effect is the same as if all the elements of
the inner clause had been written directly in the outer clause.
</para>
<para>
If multiple grouping items are specified in a single <literal>GROUP BY</>
clause, then the final list of grouping sets is the cross product of the
individual items. For example:
<programlisting>
GROUP BY a, CUBE(b,c), GROUPING SETS ((d), (e))
</programlisting>
is equivalent to
<programlisting>
GROUP BY GROUPING SETS (
(a,b,c,d), (a,b,c,e),
(a,b,d), (a,b,e),
(a,c,d), (a,c,e),
(a,d), (a,e)
)
</programlisting>
</para>
<note>
<para>
The construct <literal>(a,b)</> is normally recognized in expressions as
a <link linkend="sql-syntax-row-constructors">row constructor</link>.
Within the <literal>GROUP BY</> clause, this does not apply at the top
levels of expressions, and <literal>(a,b)</> is parsed as a list of
expressions as described above. If for some reason you <emphasis>need</>
a row constructor in a grouping expression, use <literal>ROW(a,b)</>.
</para>
</note>
</sect2>
<sect2 id="queries-window">
<title>Window Function Processing</title>

View File

@ -37,7 +37,7 @@ SELECT [ ALL | DISTINCT [ ON ( <replaceable class="parameter">expression</replac
[ * | <replaceable class="parameter">expression</replaceable> [ [ AS ] <replaceable class="parameter">output_name</replaceable> ] [, ...] ]
[ FROM <replaceable class="parameter">from_item</replaceable> [, ...] ]
[ WHERE <replaceable class="parameter">condition</replaceable> ]
[ GROUP BY <replaceable class="parameter">expression</replaceable> [, ...] ]
[ GROUP BY <replaceable class="parameter">grouping_element</replaceable> [, ...] ]
[ HAVING <replaceable class="parameter">condition</replaceable> [, ...] ]
[ WINDOW <replaceable class="parameter">window_name</replaceable> AS ( <replaceable class="parameter">window_definition</replaceable> ) [, ...] ]
[ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] <replaceable class="parameter">select</replaceable> ]
@ -60,6 +60,15 @@ SELECT [ ALL | DISTINCT [ ON ( <replaceable class="parameter">expression</replac
[ WITH ORDINALITY ] [ [ AS ] <replaceable class="parameter">alias</replaceable> [ ( <replaceable class="parameter">column_alias</replaceable> [, ...] ) ] ]
<replaceable class="parameter">from_item</replaceable> [ NATURAL ] <replaceable class="parameter">join_type</replaceable> <replaceable class="parameter">from_item</replaceable> [ ON <replaceable class="parameter">join_condition</replaceable> | USING ( <replaceable class="parameter">join_column</replaceable> [, ...] ) ]
<phrase>and <replaceable class="parameter">grouping_element</replaceable> can be one of:</phrase>
( )
<replaceable class="parameter">expression</replaceable>
( <replaceable class="parameter">expression</replaceable> [, ...] )
ROLLUP ( { <replaceable class="parameter">expression</replaceable> | ( <replaceable class="parameter">expression</replaceable> [, ...] ) } [, ...] )
CUBE ( { <replaceable class="parameter">expression</replaceable> | ( <replaceable class="parameter">expression</replaceable> [, ...] ) } [, ...] )
GROUPING SETS ( <replaceable class="parameter">grouping_element</replaceable> [, ...] )
<phrase>and <replaceable class="parameter">with_query</replaceable> is:</phrase>
<replaceable class="parameter">with_query_name</replaceable> [ ( <replaceable class="parameter">column_name</replaceable> [, ...] ) ] AS ( <replaceable class="parameter">select</replaceable> | <replaceable class="parameter">values</replaceable> | <replaceable class="parameter">insert</replaceable> | <replaceable class="parameter">update</replaceable> | <replaceable class="parameter">delete</replaceable> )
@ -665,22 +674,34 @@ WHERE <replaceable class="parameter">condition</replaceable>
<para>
The optional <literal>GROUP BY</literal> clause has the general form
<synopsis>
GROUP BY <replaceable class="parameter">expression</replaceable> [, ...]
GROUP BY <replaceable class="parameter">grouping_element</replaceable> [, ...]
</synopsis>
</para>
<para>
<literal>GROUP BY</literal> will condense into a single row all
selected rows that share the same values for the grouped
expressions. <replaceable
class="parameter">expression</replaceable> can be an input column
name, or the name or ordinal number of an output column
(<command>SELECT</command> list item), or an arbitrary
expressions. An <replaceable
class="parameter">expression</replaceable> used inside a
<replaceable class="parameter">grouping_element</replaceable>
can be an input column name, or the name or ordinal number of an
output column (<command>SELECT</command> list item), or an arbitrary
expression formed from input-column values. In case of ambiguity,
a <literal>GROUP BY</literal> name will be interpreted as an
input-column name rather than an output column name.
</para>
<para>
If any of <literal>GROUPING SETS</>, <literal>ROLLUP</> or
<literal>CUBE</> are present as grouping elements, then the
<literal>GROUP BY</> clause as a whole defines some number of
independent <replaceable>grouping sets</>. The effect of this is
equivalent to constructing a <literal>UNION ALL</> between
subqueries with the individual grouping sets as their
<literal>GROUP BY</> clauses. For further details on the handling
of grouping sets see <xref linkend="queries-grouping-sets">.
</para>
<para>
Aggregate functions, if any are used, are computed across all rows
making up each group, producing a separate value for each group.