1
0
mirror of https://github.com/postgres/postgres.git synced 2025-07-30 11:03:19 +03:00

Support ordered-set (WITHIN GROUP) aggregates.

This patch introduces generic support for ordered-set and hypothetical-set
aggregate functions, as well as implementations of the instances defined in
SQL:2008 (percentile_cont(), percentile_disc(), rank(), dense_rank(),
percent_rank(), cume_dist()).  We also added mode() though it is not in the
spec, as well as versions of percentile_cont() and percentile_disc() that
can compute multiple percentile values in one pass over the data.

Unlike the original submission, this patch puts full control of the sorting
process in the hands of the aggregate's support functions.  To allow the
support functions to find out how they're supposed to sort, a new API
function AggGetAggref() is added to nodeAgg.c.  This allows retrieval of
the aggregate call's Aggref node, which may have other uses beyond the
immediate need.  There is also support for ordered-set aggregates to
install cleanup callback functions, so that they can be sure that
infrastructure such as tuplesort objects gets cleaned up.

In passing, make some fixes in the recently-added support for variadic
aggregates, and make some editorial adjustments in the recent FILTER
additions for aggregates.  Also, simplify use of IsBinaryCoercible() by
allowing it to succeed whenever the target type is ANY or ANYELEMENT.
It was inconsistent that it dealt with other polymorphic target types
but not these.

Atri Sharma and Andrew Gierth; reviewed by Pavel Stehule and Vik Fearing,
and rather heavily editorialized upon by Tom Lane
This commit is contained in:
Tom Lane
2013-12-23 16:11:35 -05:00
parent 37484ad2aa
commit 8d65da1f01
64 changed files with 4686 additions and 755 deletions

View File

@ -1555,7 +1555,15 @@ sqrt(2)
</indexterm>
<indexterm zone="syntax-aggregates">
<primary>filter</primary>
<primary>ordered-set aggregate</primary>
</indexterm>
<indexterm zone="syntax-aggregates">
<primary>WITHIN GROUP</primary>
</indexterm>
<indexterm zone="syntax-aggregates">
<primary>FILTER</primary>
</indexterm>
<para>
@ -1570,6 +1578,7 @@ sqrt(2)
<replaceable>aggregate_name</replaceable> (ALL <replaceable>expression</replaceable> [ , ... ] [ <replaceable>order_by_clause</replaceable> ] ) [ FILTER ( WHERE <replaceable>filter_clause</replaceable> ) ]
<replaceable>aggregate_name</replaceable> (DISTINCT <replaceable>expression</replaceable> [ , ... ] [ <replaceable>order_by_clause</replaceable> ] ) [ FILTER ( WHERE <replaceable>filter_clause</replaceable> ) ]
<replaceable>aggregate_name</replaceable> ( * ) [ FILTER ( WHERE <replaceable>filter_clause</replaceable> ) ]
<replaceable>aggregate_name</replaceable> ( [ <replaceable>expression</replaceable> [ , ... ] ] ) WITHIN GROUP ( <replaceable>order_by_clause</replaceable> ) [ FILTER ( WHERE <replaceable>filter_clause</replaceable> ) ]
</synopsis>
where <replaceable>aggregate_name</replaceable> is a previously
@ -1589,9 +1598,11 @@ sqrt(2)
The third form invokes the aggregate once for each distinct value
of the expression (or distinct set of values, for multiple expressions)
found in the input rows.
The last form invokes the aggregate once for each input row; since no
The fourth form invokes the aggregate once for each input row; since no
particular input value is specified, it is generally only useful
for the <function>count(*)</function> aggregate function.
The last form is used with <firstterm>ordered-set</> aggregate
functions, which are described below.
</para>
<para>
@ -1610,23 +1621,6 @@ sqrt(2)
distinct non-null values of <literal>f1</literal>.
</para>
<para>
If <literal>FILTER</literal> is specified, then only the input
rows for which the <replaceable>filter_clause</replaceable>
evaluates to true are fed to the aggregate function; other rows
are discarded. For example:
<programlisting>
SELECT
count(*) AS unfiltered,
count(*) FILTER (WHERE i < 5) AS filtered
FROM generate_series(1,10) AS s(i);
unfiltered | filtered
------------+----------
10 | 4
(1 row)
</programlisting>
</para>
<para>
Ordinarily, the input rows are fed to the aggregate function in an
unspecified order. In many cases this does not matter; for example,
@ -1676,6 +1670,71 @@ SELECT string_agg(a ORDER BY a, ',') FROM table; -- incorrect
</para>
</note>
<para>
Placing <literal>ORDER BY</> within the aggregate's regular argument
list, as described so far, is used when ordering the input rows for
a <quote>normal</> aggregate for which ordering is optional. There is a
subclass of aggregate functions called <firstterm>ordered-set
aggregates</> for which an <replaceable>order_by_clause</replaceable>
is <emphasis>required</>, usually because the aggregate's computation is
only sensible in terms of a specific ordering of its input rows.
Typical examples of ordered-set aggregates include rank and percentile
calculations. For an ordered-set aggregate,
the <replaceable>order_by_clause</replaceable> is written
inside <literal>WITHIN GROUP (...)</>, as shown in the final syntax
alternative above. The expressions in
the <replaceable>order_by_clause</replaceable> are evaluated once per
input row just like normal aggregate arguments, sorted as per
the <replaceable>order_by_clause</replaceable>'s requirements, and fed
to the aggregate function as input arguments. (This is unlike the case
for a non-<literal>WITHIN GROUP</> <replaceable>order_by_clause</>,
which is not treated as argument(s) to the aggregate function.) The
argument expressions preceding <literal>WITHIN GROUP</>, if any, are
called <firstterm>direct arguments</> to distinguish them from
the <firstterm>aggregated arguments</> listed in
the <replaceable>order_by_clause</replaceable>. Unlike normal aggregate
arguments, direct arguments are evaluated only once per aggregate call,
not once per input row. This means that they can contain variables only
if those variables are grouped by <literal>GROUP BY</>; this restriction
is the same as if the direct arguments were not inside an aggregate
expression at all. Direct arguments are typically used for things like
percentile fractions, which only make sense as a single value per
aggregation calculation. The direct argument list can be empty; in this
case, write just <literal>()</> not <literal>(*)</>.
(<productname>PostgreSQL</> will actually accept either spelling, but
only the first way conforms to the SQL standard.)
An example of an ordered-set aggregate call is:
<programlisting>
SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY income) FROM households;
percentile_disc
-----------------
50489
</programlisting>
which obtains the 50th percentile, or median, value of
the <structfield>income</> column from table <structname>households</>.
Here, <literal>0.5</> is a direct argument; it would make no sense
for the percentile fraction to be a value varying across rows.
</para>
<para>
If <literal>FILTER</literal> is specified, then only the input
rows for which the <replaceable>filter_clause</replaceable>
evaluates to true are fed to the aggregate function; other rows
are discarded. For example:
<programlisting>
SELECT
count(*) AS unfiltered,
count(*) FILTER (WHERE i < 5) AS filtered
FROM generate_series(1,10) AS s(i);
unfiltered | filtered
------------+----------
10 | 4
(1 row)
</programlisting>
</para>
<para>
The predefined aggregate functions are described in <xref
linkend="functions-aggregate">. Other aggregate functions can be added
@ -1695,7 +1754,8 @@ SELECT string_agg(a ORDER BY a, ',') FROM table; -- incorrect
<xref linkend="sql-syntax-scalar-subqueries"> and
<xref linkend="functions-subquery">), the aggregate is normally
evaluated over the rows of the subquery. But an exception occurs
if the aggregate's arguments contain only outer-level variables:
if the aggregate's arguments (and <replaceable>filter_clause</replaceable>
if any) contain only outer-level variables:
the aggregate then belongs to the nearest such outer level, and is
evaluated over the rows of that query. The aggregate expression
as a whole is then an outer reference for the subquery it appears in,
@ -1856,15 +1916,16 @@ UNBOUNDED FOLLOWING
If <literal>FILTER</literal> is specified, then only the input
rows for which the <replaceable>filter_clause</replaceable>
evaluates to true are fed to the window function; other rows
are discarded. Only aggregate window functions accept
are discarded. Only window functions that are aggregates accept
a <literal>FILTER</literal> clause.
</para>
<para>
The built-in window functions are described in <xref
linkend="functions-window-table">. Other window functions can be added by
the user. Also, any built-in or user-defined aggregate function can be
used as a window function.
the user. Also, any built-in or user-defined normal aggregate function
can be used as a window function. Ordered-set aggregates presently
cannot be used as window functions, however.
</para>
<para>
@ -1885,7 +1946,7 @@ UNBOUNDED FOLLOWING
<para>
More information about window functions can be found in
<xref linkend="tutorial-window">,
<xref linkend="functions-window">,
<xref linkend="functions-window">, and
<xref linkend="queries-window">.
</para>
</sect2>