mirror of
https://github.com/postgres/postgres.git
synced 2025-05-01 01:04:50 +03:00
Several mistakes have piled in the code comments over the time, including incorrect grammar, function names and simple typos. This commit takes care of a portion of these. No backpatch is done as this is only cosmetic. Author: Justin Pryzby Discussion: https://postgr.es/m/20210924215827.GS831@telsasoft.com
104 lines
4.2 KiB
Plaintext
104 lines
4.2 KiB
Plaintext
MCV lists
|
|
=========
|
|
|
|
Multivariate MCV (most-common values) lists are a straightforward extension of
|
|
regular MCV lists, tracking most frequent combinations of values for a group of
|
|
attributes.
|
|
|
|
This works particularly well for columns with a small number of distinct values,
|
|
as the list may include all the combinations and approximate the distribution
|
|
very accurately.
|
|
|
|
For columns with a large number of distinct values (e.g. those with continuous
|
|
domains), the list will only track the most frequent combinations. If the
|
|
distribution is mostly uniform (all combinations about equally frequent), the
|
|
MCV list will be empty.
|
|
|
|
Estimates of some clauses (e.g. equality) based on MCV lists are more accurate
|
|
than when using histograms.
|
|
|
|
Also, MCV lists don't necessarily require sorting of the values (the fact that
|
|
we use sorting when building them is an implementation detail), but even more
|
|
importantly the ordering is not built into the approximation (while histograms
|
|
are built on ordering). So MCV lists work well even for attributes where the
|
|
ordering of the data type is disconnected from the meaning of the data. For
|
|
example we know how to sort strings, but it's unlikely to make much sense for
|
|
city names (or other label-like attributes).
|
|
|
|
|
|
Selectivity estimation
|
|
----------------------
|
|
|
|
The estimation, implemented in mcv_clauselist_selectivity(), is quite simple
|
|
in principle - we need to identify MCV items matching all the clauses and sum
|
|
frequencies of all those items.
|
|
|
|
Currently MCV lists support estimation of the following clause types:
|
|
|
|
(a) equality clauses WHERE (a = 1) AND (b = 2)
|
|
(b) inequality clauses WHERE (a < 1) AND (b >= 2)
|
|
(c) NULL clauses WHERE (a IS NULL) AND (b IS NOT NULL)
|
|
(d) OR clauses WHERE (a < 1) OR (b >= 2)
|
|
|
|
It's possible to add support for additional clauses, for example:
|
|
|
|
(e) multi-var clauses WHERE (a > b)
|
|
|
|
and possibly others. These are tasks for the future, not yet implemented.
|
|
|
|
|
|
Hashed MCV (not yet implemented)
|
|
--------------------------------
|
|
|
|
Regular MCV lists have to include actual values for each item, so if those items
|
|
are large the list may be quite large. This is especially true for multivariate
|
|
MCV lists, although the current implementation partially mitigates this by
|
|
de-duplicating the values before storing them on disk.
|
|
|
|
It's possible to only store hashes (32-bit values) instead of the actual values,
|
|
significantly reducing the space requirements. Obviously, this would only make
|
|
the MCV lists useful for estimating equality conditions (assuming the 32-bit
|
|
hashes make the collisions rare enough).
|
|
|
|
This might also complicate matching the columns to available stats.
|
|
|
|
|
|
TODO Consider implementing hashed MCV list, storing just 32-bit hashes instead
|
|
of the actual values. This type of MCV list will be useful only for
|
|
estimating equality clauses, and will reduce space requirements for large
|
|
varlena types (in such cases we usually only want equality anyway).
|
|
|
|
|
|
Inspecting the MCV list
|
|
-----------------------
|
|
|
|
Inspecting the regular (per-attribute) MCV lists is trivial, as it's enough
|
|
to select the columns from pg_stats. The data is encoded as anyarrays, and
|
|
all the items have the same data type, so anyarray provides a simple way to
|
|
get a text representation.
|
|
|
|
With multivariate MCV lists, the columns may use different data types, making
|
|
it impossible to use anyarrays. It might be possible to produce a similar
|
|
array-like representation, but that would complicate further processing and
|
|
analysis of the MCV list.
|
|
|
|
So instead the MCV lists are stored in a custom data type (pg_mcv_list),
|
|
which however makes it more difficult to inspect the contents. To make that
|
|
easier, there's a SRF returning detailed information about the MCV lists.
|
|
|
|
SELECT m.* FROM pg_statistic_ext s,
|
|
pg_statistic_ext_data d,
|
|
pg_mcv_list_items(stxdmcv) m
|
|
WHERE s.stxname = 'stts2'
|
|
AND d.stxoid = s.oid;
|
|
|
|
It accepts one parameter - a pg_mcv_list value (which can only be obtained
|
|
from pg_statistic_ext_data catalog, to defend against malicious input), and
|
|
returns these columns:
|
|
|
|
- item index (0, ..., (nitems-1))
|
|
- values (string array)
|
|
- nulls only (boolean array)
|
|
- frequency (double precision)
|
|
- base_frequency (double precision)
|