MDEV-31067: selectivity_from_histogram >1.0 for a DOUBLE_PREC_HB histogram

Variant #2. When Histogram::point_selectivity() sees that the point value of interest falls into one bucket, it tries to guess whether the bucket has many different (unpopular) values or a few popular values. (The number of rows is fixed, as it's a Height-balanced histogram). The basis for this guess is the "width" of the value range the bucket covers. Buckets covering wider value ranges are assumed to contain values with proportionally lower frequencies. This is just a [brave] guesswork. For a very narrow bucket, it may produce an estimate that's larger than total #rows in the bucket or even in the whole table. Remove the guesswork and replace it with basic logic: return either the per-table average selectivity of col=const, or selectivity of one bucket, whichever is lower.
2025-07-29 05:21:33 +03:00 · 2023-04-19 15:15:27 +03:00
parent bc970573b3
commit 85cc831880
5 changed files with 234 additions and 53 deletions
--- a/sql/sql_statistics.cc
+++ b/sql/sql_statistics.cc
@ -3902,50 +3902,16 @@ double Histogram::point_selectivity(double pos, double avg_sel)
  }
  else
  {
-    /* 
+    /*
      The value 'pos' fits within one single histogram bucket.

-      Histogram buckets have the same numbers of rows, but they cover
-      different ranges of values.
-
-      We assume that values are uniformly distributed across the [0..1] value
-      range.
-    */
-
-    /* 
-      If all buckets covered value ranges of the same size, the width of
-      value range would be:
+      We also have avg_sel which is per-table average selectivity of col=const.
+      If there are popular values, this may be larger than one bucket, so 
+      cap the returned number by the selectivity of one bucket.
    */
    double avg_bucket_width= 1.0 / (get_width() + 1);
-    
-    /*
-      Let's see what is the width of value range that our bucket is covering.
-        (min==max currently. they are kept in the formula just in case we 
-         will want to extend it to handle multi-bucket case)
-    */
-    double inv_prec_factor= (double) 1.0 / prec_factor(); 
-    double current_bucket_width= 
-        (max + 1 == get_width() ?  1.0 : (get_value(max) * inv_prec_factor)) -
-        (min == 0 ?  0.0 : (get_value(min-1) * inv_prec_factor));

-    DBUG_ASSERT(current_bucket_width); /* We shouldn't get a one zero-width bucket */
-
-    /*
-      So:
-      - each bucket has the same #rows 
-      - values are unformly distributed across the [min_value,max_value] domain.
-
-      If a bucket has value range that's N times bigger then average, than
-      each value will have to have N times fewer rows than average.
-    */
-    sel= avg_sel * avg_bucket_width / current_bucket_width;
-
-    /*
-      (Q: if we just follow this proportion we may end up in a situation
-      where number of different values we expect to find in this bucket
-      exceeds the number of rows that this histogram has in a bucket. Are 
-      we ok with this or we would want to have certain caps?)
-    */
+    sel= MY_MIN(avg_bucket_width, avg_sel);
  }
  return sel;
 }