Improve estimate of distinct values in estimate_num_groups().

When adjusting the estimate for the number of distinct values from a rel in a grouped query to take into account the selectivity of the rel's restrictions, use a formula that is less likely to produce under-estimates. The old formula simply multiplied the number of distinct values in the rel by the restriction selectivity, which would be correct if the restrictions were fully correlated with the grouping expressions, but can produce significant under-estimates in cases where they are not well correlated. The new formula is based on the random selection probability, and so assumes that the restrictions are not correlated with the grouping expressions. This is guaranteed to produce larger estimates, and of course risks over-estimating in cases where the restrictions are correlated, but that has less severe consequences than under-estimating, which might lead to a HashAgg that consumes an excessive amount of memory. This could possibly be improved upon in the future by identifying correlated restrictions and using a hybrid of the old and new formulae. Author: Tomas Vondra, with some hacking be me Reviewed-by: Mark Dilger, Alexander Korotkov, Dean Rasheed and Tom Lane Discussion: http://www.postgresql.org/message-id/flat/56CD0381.5060502@2ndquadrant.com
2025-07-28 23:42:10 +03:00 · 2016-04-04 12:41:56 +01:00
parent bf08f2292f
commit 84f9a35e39
2 changed files with 64 additions and 25 deletions
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@ -807,27 +807,24 @@ select * from int4_tbl where
 explain (verbose, costs off)
 select * from int4_tbl o where (f1, f1) in
  (select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1);
-                              QUERY PLAN                              
----------------------------------------------------------------------
- Hash Join
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Hash Semi Join
   Output: o.f1
   Hash Cond: (o.f1 = "ANY_subquery".f1)
   ->  Seq Scan on public.int4_tbl o
         Output: o.f1
   ->  Hash
         Output: "ANY_subquery".f1, "ANY_subquery".g
-         ->  HashAggregate
+         ->  Subquery Scan on "ANY_subquery"
               Output: "ANY_subquery".f1, "ANY_subquery".g
-               Group Key: "ANY_subquery".f1, "ANY_subquery".g
-               ->  Subquery Scan on "ANY_subquery"
-                     Output: "ANY_subquery".f1, "ANY_subquery".g
-                     Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
-                     ->  HashAggregate
-                           Output: i.f1, (generate_series(1, 2) / 10)
-                           Group Key: i.f1
-                           ->  Seq Scan on public.int4_tbl i
-                                 Output: i.f1
-(18 rows)
+               Filter: ("ANY_subquery".f1 = "ANY_subquery".g)
+               ->  HashAggregate
+                     Output: i.f1, (generate_series(1, 2) / 10)
+                     Group Key: i.f1
+                     ->  Seq Scan on public.int4_tbl i
+                           Output: i.f1
+(15 rows)

 select * from int4_tbl o where (f1, f1) in
  (select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1);