diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index ef36e87a720..ed74704b2ad 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -4320,6 +4320,17 @@ Owner of the statistic + + stxkeys + int2vector + pg_attribute.attnum + + This is an array of values that indicate which table columns this + statistic covers. For example a value of 1 3 would + mean that the first and the third table columns make up the statistic key. + + + stxkind char[] @@ -4332,17 +4343,6 @@ - - stxkeys - int2vector - pg_attribute.attnum - - This is an array of values that indicate which table columns this - statistic covers. For example a value of 1 3 would - mean that the first and the third table columns make up the statistic key. - - - stxndistinct pg_ndistinct diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml index 8d30fd1384c..a8bebb14363 100644 --- a/doc/src/sgml/perform.sgml +++ b/doc/src/sgml/perform.sgml @@ -906,6 +906,8 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000 of the planner + + Single-Column Statistics As we saw in the previous section, the query planner needs to estimate the number of rows retrieved by a query in order to make good choices @@ -1043,7 +1045,200 @@ WHERE tablename = 'road'; Further details about the planner's use of statistics can be found in . + + + Extended Statistics + + + statistics + of the planner + + + + correlation + in the query planner + + + + pg_statistic_ext + + + + It is common to see slow queries running bad execution plans because + multiple columns used in the query clauses are correlated. + The planner normally assumes that multiple conditions + are independent of each other, + an assumption that does not hold when column values are correlated. + Regular statistics, because of their per-individual-column nature, + do not capture the knowledge of cross-column correlation; + multivariate statistics can be used to instruct + the server to obtain statistics across such a set of columns, + which are later used by the query optimizer + to determine cardinality and selectivity + of clauses involving those columns. + Multivariate statistics are currently the only use of + extended statistics. + + + + Extended statistics are created using + , which see for more details. + Data collection is deferred until the next ANALYZE + on the table, after which the stored values can be examined in the + pg_statistic_ext + catalog. + + + + The following subsections describe the types of extended statistics + that are currently supported. + + + + Functional Dependencies + + + The simplest type of extended statistics are functional dependencies, + a concept used in definitions of database normal forms. + Put simply, it is said that column b is functionally + dependent on column a if knowledge of the value of + a is sufficient to determine the value of b. + In normalized databases, functional dependencies are allowed only on + primary keys and superkeys. However, many data sets are in practice not + fully normalized for various reasons; intentional denormalization for + performance reasons is a common example. + + + + The existance of functional dependencies directly affects the accuracy + of estimates in certain queries. + The reason is that conditions on the dependent columns do not + restrict the result set, but the query planner (lacking functional + dependency knowledge) considers them independent, resulting in + underestimates. + To inform the planner about the functional dependencies, we collect + measurements of dependency during ANALYZE. Assessing + the degree of dependency between all sets of columns would be + prohibitively expensive, so the search is limited to potential + dependencies defined using the dependencies option of + extended statistics. It is advisable to create + dependencies statistics if and only if functional + dependencies actually exist, to avoid unnecessary overhead on both + ANALYZE and query planning. + + + + To inspect functional dependencies on a statistics + stts, you may do this: + +CREATE STATISTICS stts WITH (dependencies) + ON (zip, city) FROM zipcodes; +ANALYZE zipcodes; +SELECT stxname, stxkeys, stxdependencies + FROM pg_statistic_ext + WHERE stxname = 'stts'; + stxname | stxkeys | stxdependencies +---------+---------+-------------------------------------------- + stts | 1 5 | [{1 => 5 : 1.000000}, {5 => 1 : 0.423130}] +(1 row) + + where it can be seen that column 1 (a zip code) fully determines column + 5 (city) so the coefficient is 1.0, while city only determines zip code + about 42% of the time, meaning that there are many cities (58%) that are + represented by more than a single ZIP code. + + + + When computing the selectivity, the planner inspects all conditions and + attempts to identify which conditions are already implied by other + conditions. The selectivity estimates from any redundant conditions are + ignored from a selectivity point of view. In the example query above, + the selectivity estimates for either of the conditions may be eliminated, + thus improving the overall estimate. + + + + Limitations of Functional Dependencies + + + Functional dependencies are a very simple type of statistics, and + as such have several limitations. The first limitation is that they + only work with simple equality conditions, comparing columns and constant + values. It's not possible to use them to eliminate equality conditions + comparing two columns or a column to an expression, range clauses, + LIKE or any other type of conditions. + + + + When eliminating the implied conditions, the planner assumes that the + conditions are compatible. Consider the following example, where + this assumption does not hold: + + +EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 10; + QUERY PLAN +----------------------------------------------------------------------------- + Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=0 loops=1) + Filter: ((a = 1) AND (b = 10)) + Rows Removed by Filter: 10000 + + + While there are no rows with such combination of values, the planner + is unable to verify whether the values match — it only knows that + the columns are functionally dependent. + + + + This assumption is related to queries executed on the database; in many + cases, it's actually satisfied (e.g. when the GUI only allows selecting + compatible values). But if that's not the case, functional dependencies + may not be a viable option. + + + + + + Multivariate N-Distinct Coefficients + + + Single-column statistics store the number of distinct values in each + column. Estimates of the number of distinct values on more than one + column (for example, for GROUP BY a, b) are + frequently wrong when the planner only has single-column statistical + data, however, causing it to select bad plans. + In order to improve n-distinct estimation when multiple columns are + grouped together, the ndistinct option of extended statistics + can be used, which instructs ANALYZE to collect n-distinct + estimates for all possible combinations of two or more columns of the set + of columns in the statistics object (the per-column estimates are already + available in pg_statistic). + + + + Continuing the above example, the n-distinct coefficients in a ZIP + code table may look like the following: + +CREATE STATISTICS stts2 WITH (ndistinct) + ON (zip, state, city) FROM zipcodes; +ANALYZE zipcodes; +SELECT stxkeys AS k, stxndistinct AS nd + FROM pg_statistic_ext + WHERE stxname = 'stts2'; +-[ RECORD 1 ]--------------------------------------------- +k | 1 2 5 +nd | [{(b 1 2), 33178.000000}, {(b 1 5), 33178.000000}, + {(b 2 5), 27435.000000}, {(b 1 2 5), 33178.000000}] +(1 row) + + which indicates that there are three combinations of columns that + have 33178 distinct values: ZIP code and state; ZIP code and city; + and ZIP code, city and state (the fact that they are all equal is + expected given the nature of ZIP-code data). On the other hand, + the combination of city and state only has 27435 distinct values. + + + diff --git a/doc/src/sgml/planstats.sgml b/doc/src/sgml/planstats.sgml index 124e7e20ced..f4430eb23cf 100644 --- a/doc/src/sgml/planstats.sgml +++ b/doc/src/sgml/planstats.sgml @@ -445,159 +445,141 @@ rows = (outer_cardinality * inner_cardinality) * selectivity operator-specific selectivity functions are mostly found in src/backend/utils/adt/selfuncs.c. + - - Functional Dependencies + + Multivariate Statistics Examples + + row estimation + multivariate + + + + Functional dependencies - The simplest type of extended statistics are functional dependencies, - used in definitions of database normal forms. When simplified, saying that - b is functionally dependent on a means that - knowledge of value of a is sufficient to determine value of - b. - - - - In normalized databases, only functional dependencies on primary keys - and superkeys are allowed. However, in practice, many data sets are not - fully normalized, for example, due to intentional denormalization for - performance reasons. - - - - Functional dependencies directly affect accuracy of the estimates, as - conditions on the dependent column(s) do not restrict the result set, - resulting in underestimates. - - - - To inform the planner about the functional dependencies, we collect - measurements of dependency during ANALYZE. Assessing - dependency between all sets of columns would be prohibitively - expensive, so we limit our search to potential dependencies defined - using the CREATE STATISTICS command. + Multivariate correlation can be seen with a very simple data set — a + table with two columns, both containing the same values: CREATE TABLE t (a INT, b INT); -INSERT INTO t SELECT i/100, i/100 FROM generate_series(1,10000) s(i); -CREATE STATISTICS s1 WITH (dependencies) ON (a, b) FROM t; +INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i); ANALYZE t; -EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1; - QUERY PLAN -------------------------------------------------------------------------------------------------- - Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual time=0.095..3.118 rows=100 loops=1) + + + As explained in , the planner can determine + cardinality of t using the number of pages and + rows obtained from pg_class: + + +SELECT relpages, reltuples FROM pg_class WHERE relname = 't'; + + relpages | reltuples +----------+----------- + 45 | 10000 + + + The data distribution is very simple; there are only 100 distinct values + in each column, uniformly distributed. + + + + The following example shows the result of estimating a WHERE + condition on the a column: + + +EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1; + QUERY PLAN +------------------------------------------------------------------------------- + Seq Scan on t (cost=0.00..170.00 rows=100 width=8) (actual rows=100 loops=1) + Filter: (a = 1) + Rows Removed by Filter: 9900 + + + The planner examines the condition and determines the selectivity + of this clause to be 1%. By comparing this estimate and the actual + number of rows, we see that the estimate is very accurate + (in fact exact, as the table is very small). Changing the + WHERE to use the b column, an identical + plan is generated. Observe what happens if we apply the same + condition on both columns combining them with AND: + + +EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1; + QUERY PLAN +----------------------------------------------------------------------------- + Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=100 loops=1) Filter: ((a = 1) AND (b = 1)) Rows Removed by Filter: 9900 - Planning time: 0.367 ms - Execution time: 3.380 ms -(5 rows) - The planner is now aware of the functional dependencies and considers - them when computing the selectivity of the second condition. Running - the query without the statistics would lead to quite different estimates. + The planner estimates the selectivity for each condition individually, + arriving to the 1% estimates as above, and then multiplies them, getting + the final 0.01% estimate. The actual figures, however, + show that this results in a significant underestimate, as the actual + number of rows matching the conditions (100) is two orders of magnitude + higher than the estimated value. + + + + This problem can be fixed by applying functional-dependency + multivariate statistics on the two columns: -DROP STATISTICS s1; -EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1; - QUERY PLAN ------------------------------------------------------------------------------------------------ - Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual time=0.000..6.379 rows=100 loops=1) +CREATE STATISTICS stts WITH (dependencies) ON (a, b) FROM t; +ANALYZE t; +EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1; + QUERY PLAN +------------------------------------------------------------------------------- + Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100 loops=1) Filter: ((a = 1) AND (b = 1)) Rows Removed by Filter: 9900 - Planning time: 0.000 ms - Execution time: 6.379 ms -(5 rows) - - - If no dependency exists, the collected statistics do not influence the - query plan. The only effect is to slow down ANALYZE. Should - partial dependencies exist these will also be stored and applied - during planning. - - - - Similarly to per-column statistics, extended statistics are stored in - a system catalog called pg_statistic_ext. - To inspect the statistics s1 defined above, - you may do this: - - -SELECT stxname,stxdependencies FROM pg_statistic_ext WHERE stxname = 's1'; - stxname | stxdependencies ----------+-------------------------------------------- - s1 | [{1 => 2 : 1.000000}, {2 => 1 : 1.000000}] -(1 row) - - - This shows that the statistics are defined on table t, - attnums lists attribute numbers of columns - (references pg_attribute). It also shows - the length in bytes of the functional dependencies, as found by - ANALYZE when serialized into a bytea column. - - - - When computing the selectivity, the planner inspects all conditions and - attempts to identify which conditions are already implied by other - conditions. The selectivity estimates from any redundant conditions are - ignored from a selectivity point of view. In the example query above, - the selectivity estimates for either of the conditions may be eliminated, - thus improving the overall estimate. - - - - Limitations of functional dependencies - - - Functional dependencies are a very simple type of statistics, and - as such have several limitations. The first limitation is that they - only work with simple equality conditions, comparing columns and constant - values. It's not possible to use them to eliminate equality conditions - comparing two columns or a column to an expression, range clauses, - LIKE or any other type of conditions. - - - - When eliminating the implied conditions, the planner assumes that the - conditions are compatible. Consider the following example, violating - this assumption: - - -EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 10; - QUERY PLAN ------------------------------------------------------------------------------------------------ - Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual time=2.992..2.992 rows=0 loops=1) - Filter: ((a = 1) AND (b = 10)) - Rows Removed by Filter: 10000 - Planning time: 0.232 ms - Execution time: 3.033 ms -(5 rows) - - - While there are no rows with such combination of values, the planner - is unable to verify whether the values match - it only knows that - the columns are functionally dependent. - - - - This assumption is more about queries executed on the database - in many - cases, it's actually satisfied (e.g. when the GUI only allows selecting - compatible values). But if that's not the case, functional dependencies - may not be a viable option. - - - - For additional information about functional dependencies, see - src/backend/statistics/README.dependencies. - - - - - + + Multivariate N-Distinct coefficients + + A similar problem occurs with estimation of the cardinality of distinct + elements, used to determine the number of groups that would be generated + by a GROUP BY clause. When GROUP BY + lists a single column, the n-distinct estimate (which can be seen as the + number of rows returned by the aggregate execution node) is very accurate: + +EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a; + QUERY PLAN +----------------------------------------------------------------------------------------- + HashAggregate (cost=195.00..196.00 rows=100 width=12) (actual rows=100 loops=1) + Group Key: a + -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000 loops=1) + + But without multivariate statistics, the estimate for the number of + groups in a query with two columns in GROUP BY, as + in the following example, is off by an order of magnitude: + +EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b; + QUERY PLAN +-------------------------------------------------------------------------------------------- + HashAggregate (cost=220.00..230.00 rows=1000 width=16) (actual rows=100 loops=1) + Group Key: a, b + -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1) + + By dropping the existing statistics and re-creating it to include n-distinct + calculation, the estimate is much improved: + +DROP STATISTICS stts; +CREATE STATISTICS stts WITH (dependencies, ndistinct) ON (a, b) FROM t; +ANALYZE t; +EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b; + QUERY PLAN +-------------------------------------------------------------------------------------------- + HashAggregate (cost=220.00..221.00 rows=100 width=16) (actual rows=100 loops=1) + Group Key: a, b + -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1) + + + +