Support GROUPING SETS, CUBE and ROLLUP.

This SQL standard functionality allows to aggregate data by different GROUP BY clauses at once. Each grouping set returns rows with columns grouped by in other sets set to NULL. This could previously be achieved by doing each grouping as a separate query, conjoined by UNION ALLs. Besides being considerably more concise, grouping sets will in many cases be faster, requiring only one scan over the underlying data. The current implementation of grouping sets only supports using sorting for input. Individual sets that share a sort order are computed in one pass. If there are sets that don't share a sort order, additional sort & aggregation steps are performed. These additional passes are sourced by the previous sort step; thus avoiding repeated scans of the source data. The code is structured in a way that adding support for purely using hash aggregation or a mix of hashing and sorting is possible. Sorting was chosen to be supported first, as it is the most generic method of implementation. Instead of, as in an earlier versions of the patch, representing the chain of sort and aggregation steps as full blown planner and executor nodes, all but the first sort are performed inside the aggregation node itself. This avoids the need to do some unusual gymnastics to handle having to return aggregated and non-aggregated tuples from underlying nodes, as well as having to shut down underlying nodes early to limit memory usage. The optimizer still builds Sort/Agg node to describe each phase, but they're not part of the plan tree, but instead additional data for the aggregation node. They're a convenient and preexisting way to describe aggregation and sorting. The first (and possibly only) sort step is still performed as a separate execution step. That retains similarity with existing group by plans, makes rescans fairly simple, avoids very deep plans (leading to slow explains) and easily allows to avoid the sorting step if the underlying data is sorted by other means. A somewhat ugly side of this patch is having to deal with a grammar ambiguity between the new CUBE keyword and the cube extension/functions named cube (and rollup). To avoid breaking existing deployments of the cube extension it has not been renamed, neither has cube been made a reserved keyword. Instead precedence hacking is used to make GROUP BY cube(..) refer to the CUBE grouping sets feature, and not the function cube(). To actually group by a function cube(), unlikely as that might be, the function name has to be quoted. Needs a catversion bump because stored rules may change. Author: Andrew Gierth and Atri Sharma, with contributions from Andres Freund Reviewed-By: Andres Freund, Noah Misch, Tom Lane, Svenne Krap, Tomas Vondra, Erik Rijkers, Marti Raudsepp, Pavel Stehule Discussion: CAOeZVidmVRe2jU6aMk_5qkxnB7dfmPROzM7Ur8JPW5j8Y5X-Lw@mail.gmail.com
2025-07-31 22:04:40 +03:00 · 2015-05-16 03:40:59 +02:00
parent 6e4415c6aa
commit f3d3118532
63 changed files with 5255 additions and 618 deletions
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@ -12228,7 +12228,9 @@ NULL baz</literallayout>(3 rows)</entry>
   <xref linkend="functions-aggregate-statistics-table">.
   The built-in ordered-set aggregate functions
   are listed in <xref linkend="functions-orderedset-table"> and
-   <xref linkend="functions-hypothetical-table">.
+   <xref linkend="functions-hypothetical-table">.  Grouping operations,
+   which are closely related to aggregate functions, are listed in
+   <xref linkend="functions-grouping-table">.
   The special syntax considerations for aggregate
   functions are explained in <xref linkend="syntax-aggregates">.
   Consult <xref linkend="tutorial-agg"> for additional introductory
@ -13326,6 +13328,72 @@ SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;
   to the rule specified in the <literal>ORDER BY</> clause.
  </para>

+  <table id="functions-grouping-table">
+   <title>Grouping Operations</title>
+
+   <tgroup cols="3">
+    <thead>
+     <row>
+      <entry>Function</entry>
+      <entry>Return Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+    <tbody>
+
+     <row>
+      <entry>
+       <indexterm>
+        <primary>GROUPING</primary>
+       </indexterm>
+       <function>GROUPING(<replaceable class="parameter">args...</replaceable>)</function>
+      </entry>
+      <entry>
+       <type>integer</type>
+      </entry>
+      <entry>
+       Integer bitmask indicating which arguments are not being included in the current
+       grouping set
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+   <para>
+    Grouping operations are used in conjunction with grouping sets (see
+    <xref linkend="queries-grouping-sets">) to distinguish result rows.  The
+    arguments to the <literal>GROUPING</> operation are not actually evaluated,
+    but they must match exactly expressions given in the <literal>GROUP BY</>
+    clause of the associated query level.  Bits are assigned with the rightmost
+    argument being the least-significant bit; each bit is 0 if the corresponding
+    expression is included in the grouping criteria of the grouping set generating
+    the result row, and 1 if it is not.  For example:
+<screen>
+<prompt>=&gt;</> <userinput>SELECT * FROM items_sold;</>
+ make  | model | sales
+-------+-------+-------
+ Foo   | GT    |  10
+ Foo   | Tour  |  20
+ Bar   | City  |  15
+ Bar   | Sport |  5
+(4 rows)
+
+<prompt>=&gt;</> <userinput>SELECT make, model, GROUPING(make,model), sum(sales) FROM items_sold GROUP BY ROLLUP(make,model);</>
+ make  | model | grouping | sum
+-------+-------+----------+-----
+ Foo   | GT    |        0 | 10
+ Foo   | Tour  |        0 | 20
+ Bar   | City  |        0 | 15
+ Bar   | Sport |        0 | 5
+ Foo   |       |        1 | 30
+ Bar   |       |        1 | 20
+       |       |        3 | 50
+(7 rows)
+</screen>
+   </para>
+
 </sect1>

 <sect1 id="functions-window">
--- a/doc/src/sgml/queries.sgml
+++ b/doc/src/sgml/queries.sgml
@ -1183,6 +1183,181 @@ SELECT product_id, p.name, (sum(s.units) * (p.price - p.cost)) AS profit
   </para>
  </sect2>

+  <sect2 id="queries-grouping-sets">
+   <title><literal>GROUPING SETS</>, <literal>CUBE</>, and <literal>ROLLUP</></title>
+
+   <indexterm zone="queries-grouping-sets">
+    <primary>GROUPING SETS</primary>
+   </indexterm>
+   <indexterm zone="queries-grouping-sets">
+    <primary>CUBE</primary>
+   </indexterm>
+   <indexterm zone="queries-grouping-sets">
+    <primary>ROLLUP</primary>
+   </indexterm>
+
+   <para>
+    More complex grouping operations than those described above are possible
+    using the concept of <firstterm>grouping sets</>.  The data selected by
+    the <literal>FROM</> and <literal>WHERE</> clauses is grouped separately
+    by each specified grouping set, aggregates computed for each group just as
+    for simple <literal>GROUP BY</> clauses, and then the results returned.
+    For example:
+<screen>
+<prompt>=&gt;</> <userinput>SELECT * FROM items_sold;</>
+ brand | size | sales
+-------+------+-------
+ Foo   | L    |  10
+ Foo   | M    |  20
+ Bar   | M    |  15
+ Bar   | L    |  5
+(4 rows)
+
+<prompt>=&gt;</> <userinput>SELECT brand, size, sum(sales) FROM items_sold GROUP BY GROUPING SETS ((brand), (size), ());</>
+ brand | size | sum
+-------+------+-----
+ Foo   |      |  30
+ Bar   |      |  20
+       | L    |  15
+       | M    |  35
+       |      |  50
+(5 rows)
+</screen>
+   </para>
+
+   <para>
+    Each sublist of <literal>GROUPING SETS</> may specify zero or more columns
+    or expressions and is interpreted the same way as though it were directly
+    in the <literal>GROUP BY</> clause.  An empty grouping set means that all
+    rows are aggregated down to a single group (which is output even if no
+    input rows were present), as described above for the case of aggregate
+    functions with no <literal>GROUP BY</> clause.
+   </para>
+
+   <para>
+    References to the grouping columns or expressions are replaced
+    by <literal>NULL</> values in result rows for grouping sets in which those
+    columns do not appear.  To distinguish which grouping a particular output
+    row resulted from, see <xref linkend="functions-grouping-table">.
+   </para>
+
+   <para>
+    A shorthand notation is provided for specifying two common types of grouping set.
+    A clause of the form
+<programlisting>
+ROLLUP ( <replaceable>e1</>, <replaceable>e2</>, <replaceable>e3</>, ... )
+</programlisting>
+    represents the given list of expressions and all prefixes of the list including
+    the empty list; thus it is equivalent to
+<programlisting>
+GROUPING SETS (
+    ( <replaceable>e1</>, <replaceable>e2</>, <replaceable>e3</>, ... ),
+    ...
+    ( <replaceable>e1</>, <replaceable>e2</> )
+    ( <replaceable>e1</> )
+    ( )
+)
+</programlisting>
+    This is commonly used for analysis over hierarchical data; e.g. total
+    salary by department, division, and company-wide total.
+   </para>
+
+   <para>
+    A clause of the form
+<programlisting>
+CUBE ( <replaceable>e1</>, <replaceable>e2</>, ... )
+</programlisting>
+    represents the given list and all of its possible subsets (i.e. the power
+    set).  Thus
+<programlisting>
+CUBE ( a, b, c )
+</programlisting>
+    is equivalent to
+<programlisting>
+GROUPING SETS (
+    ( a, b, c ),
+    ( a, b    ),
+    ( a,    c ),
+    ( a       ),
+    (    b, c ),
+    (    b    ),
+    (       c ),
+    (         ),
+)
+</programlisting>
+   </para>
+
+   <para>
+    The individual elements of a <literal>CUBE</> or <literal>ROLLUP</>
+    clause may be either individual expressions, or sub-lists of elements in
+    parentheses.  In the latter case, the sub-lists are treated as single
+    units for the purposes of generating the individual grouping sets.
+    For example:
+<programlisting>
+CUBE ( (a,b), (c,d) )
+</programlisting>
+    is equivalent to
+<programlisting>
+GROUPING SETS (
+    ( a, b, c, d )
+    ( a, b       )
+    (       c, d )
+    (            )
+)
+</programlisting>
+    and
+<programlisting>
+ROLLUP ( a, (b,c), d )
+</programlisting>
+    is equivalent to
+<programlisting>
+GROUPING SETS (
+    ( a, b, c, d )
+    ( a, b, c    )
+    ( a          )
+    (            )
+)
+</programlisting>
+   </para>
+
+   <para>
+    The <literal>CUBE</> and <literal>ROLLUP</> constructs can be used either
+    directly in the <literal>GROUP BY</> clause, or nested inside a
+    <literal>GROUPING SETS</> clause.  If one <literal>GROUPING SETS</> clause
+    is nested inside another, the effect is the same as if all the elements of
+    the inner clause had been written directly in the outer clause.
+   </para>
+
+   <para>
+    If multiple grouping items are specified in a single <literal>GROUP BY</>
+    clause, then the final list of grouping sets is the cross product of the
+    individual items.  For example:
+<programlisting>
+GROUP BY a, CUBE(b,c), GROUPING SETS ((d), (e))
+</programlisting>
+    is equivalent to
+<programlisting>
+GROUP BY GROUPING SETS (
+  (a,b,c,d), (a,b,c,e),
+  (a,b,d),   (a,b,e),
+  (a,c,d),   (a,c,e),
+  (a,d),     (a,e)
+)
+</programlisting>
+   </para>
+
+  <note>
+   <para>
+    The construct <literal>(a,b)</> is normally recognized in expressions as
+    a <link linkend="sql-syntax-row-constructors">row constructor</link>.
+    Within the <literal>GROUP BY</> clause, this does not apply at the top
+    levels of expressions, and <literal>(a,b)</> is parsed as a list of
+    expressions as described above.  If for some reason you <emphasis>need</>
+    a row constructor in a grouping expression, use <literal>ROW(a,b)</>.
+   </para>
+  </note>
+  </sect2>
+
  <sect2 id="queries-window">
   <title>Window Function Processing</title>

--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@ -37,7 +37,7 @@ SELECT [ ALL | DISTINCT [ ON ( <replaceable class="parameter">expression</replac
    [ * | <replaceable class="parameter">expression</replaceable> [ [ AS ] <replaceable class="parameter">output_name</replaceable> ] [, ...] ]
    [ FROM <replaceable class="parameter">from_item</replaceable> [, ...] ]
    [ WHERE <replaceable class="parameter">condition</replaceable> ]
-    [ GROUP BY <replaceable class="parameter">expression</replaceable> [, ...] ]
+    [ GROUP BY <replaceable class="parameter">grouping_element</replaceable> [, ...] ]
    [ HAVING <replaceable class="parameter">condition</replaceable> [, ...] ]
    [ WINDOW <replaceable class="parameter">window_name</replaceable> AS ( <replaceable class="parameter">window_definition</replaceable> ) [, ...] ]
    [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] <replaceable class="parameter">select</replaceable> ]
@ -60,6 +60,15 @@ SELECT [ ALL | DISTINCT [ ON ( <replaceable class="parameter">expression</replac
                [ WITH ORDINALITY ] [ [ AS ] <replaceable class="parameter">alias</replaceable> [ ( <replaceable class="parameter">column_alias</replaceable> [, ...] ) ] ]
    <replaceable class="parameter">from_item</replaceable> [ NATURAL ] <replaceable class="parameter">join_type</replaceable> <replaceable class="parameter">from_item</replaceable> [ ON <replaceable class="parameter">join_condition</replaceable> | USING ( <replaceable class="parameter">join_column</replaceable> [, ...] ) ]

+<phrase>and <replaceable class="parameter">grouping_element</replaceable> can be one of:</phrase>
+
+    ( )
+    <replaceable class="parameter">expression</replaceable>
+    ( <replaceable class="parameter">expression</replaceable> [, ...] )
+    ROLLUP ( { <replaceable class="parameter">expression</replaceable> | ( <replaceable class="parameter">expression</replaceable> [, ...] ) } [, ...] )
+    CUBE ( { <replaceable class="parameter">expression</replaceable> | ( <replaceable class="parameter">expression</replaceable> [, ...] ) } [, ...] )
+    GROUPING SETS ( <replaceable class="parameter">grouping_element</replaceable> [, ...] )
+
 <phrase>and <replaceable class="parameter">with_query</replaceable> is:</phrase>

    <replaceable class="parameter">with_query_name</replaceable> [ ( <replaceable class="parameter">column_name</replaceable> [, ...] ) ] AS ( <replaceable class="parameter">select</replaceable> | <replaceable class="parameter">values</replaceable> | <replaceable class="parameter">insert</replaceable> | <replaceable class="parameter">update</replaceable> | <replaceable class="parameter">delete</replaceable> )
@ -665,22 +674,34 @@ WHERE <replaceable class="parameter">condition</replaceable>
   <para>
    The optional <literal>GROUP BY</literal> clause has the general form
 <synopsis>
-GROUP BY <replaceable class="parameter">expression</replaceable> [, ...]
+GROUP BY <replaceable class="parameter">grouping_element</replaceable> [, ...]
 </synopsis>
   </para>

   <para>
    <literal>GROUP BY</literal> will condense into a single row all
    selected rows that share the same values for the grouped
-    expressions.  <replaceable
-    class="parameter">expression</replaceable> can be an input column
-    name, or the name or ordinal number of an output column
-    (<command>SELECT</command> list item), or an arbitrary
+    expressions.  An <replaceable
+    class="parameter">expression</replaceable> used inside a
+    <replaceable class="parameter">grouping_element</replaceable>
+    can be an input column name, or the name or ordinal number of an
+    output column (<command>SELECT</command> list item), or an arbitrary
    expression formed from input-column values.  In case of ambiguity,
    a <literal>GROUP BY</literal> name will be interpreted as an
    input-column name rather than an output column name.
   </para>

+   <para>
+    If any of <literal>GROUPING SETS</>, <literal>ROLLUP</> or
+    <literal>CUBE</> are present as grouping elements, then the
+    <literal>GROUP BY</> clause as a whole defines some number of
+    independent <replaceable>grouping sets</>.  The effect of this is
+    equivalent to constructing a <literal>UNION ALL</> between
+    subqueries with the individual grouping sets as their
+    <literal>GROUP BY</> clauses.  For further details on the handling
+    of grouping sets see <xref linkend="queries-grouping-sets">.
+   </para>
+
   <para>
    Aggregate functions, if any are used, are computed across all rows
    making up each group, producing a separate value for each group.