mirror of
				https://github.com/postgres/postgres.git
				synced 2025-11-03 09:13:20 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			2407 lines
		
	
	
		
			106 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			2407 lines
		
	
	
		
			106 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
From owner-pgsql-hackers@hub.org Mon Mar 22 18:43:41 1999
 | 
						|
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id SAA23978
 | 
						|
	for <maillist@candle.pha.pa.us>; Mon, 22 Mar 1999 18:43:39 -0500 (EST)
 | 
						|
Received: from hub.org (majordom@hub.org [209.47.145.100]) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id SAA06472 for <maillist@candle.pha.pa.us>; Mon, 22 Mar 1999 18:36:44 -0500 (EST)
 | 
						|
Received: from localhost (majordom@localhost)
 | 
						|
	by hub.org (8.9.2/8.9.1) with SMTP id SAA92604;
 | 
						|
	Mon, 22 Mar 1999 18:34:23 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers@hub.org)
 | 
						|
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 22 Mar 1999 18:33:50 +0000 (EST)
 | 
						|
Received: (from majordom@localhost)
 | 
						|
	by hub.org (8.9.2/8.9.1) id SAA92469
 | 
						|
	for pgsql-hackers-outgoing; Mon, 22 Mar 1999 18:33:47 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 | 
						|
Received: from po8.andrew.cmu.edu (PO8.ANDREW.CMU.EDU [128.2.10.108])
 | 
						|
	by hub.org (8.9.2/8.9.1) with ESMTP id SAA92456
 | 
						|
	for <pgsql-hackers@postgresql.org>; Mon, 22 Mar 1999 18:33:41 -0500 (EST)
 | 
						|
	(envelope-from er1p+@andrew.cmu.edu)
 | 
						|
Received: (from postman@localhost) by po8.andrew.cmu.edu (8.8.5/8.8.2) id SAA12894 for pgsql-hackers@postgresql.org; Mon, 22 Mar 1999 18:33:38 -0500 (EST)
 | 
						|
Received: via switchmail; Mon, 22 Mar 1999 18:33:38 -0500 (EST)
 | 
						|
Received: from cloudy.me.cmu.edu via qmail
 | 
						|
          ID </afs/andrew.cmu.edu/service/mailqs/q007/QF.Aqxh7Lu00gNtQ0TZE5>;
 | 
						|
          Mon, 22 Mar 1999 18:27:20 -0500 (EST)
 | 
						|
Received: from cloudy.me.cmu.edu via qmail
 | 
						|
          ID </afs/andrew.cmu.edu/usr2/er1p/.Outgoing/QF.Uqxh7JS00gNtMmTJFk>;
 | 
						|
          Mon, 22 Mar 1999 18:27:17 -0500 (EST)
 | 
						|
Received: from mms.4.60.Jun.27.1996.03.05.56.sun4.41.EzMail.2.0.CUILIB.3.45.SNAP.NOT.LINKED.cloudy.me.cmu.edu.sun4m.412
 | 
						|
          via MS.5.6.cloudy.me.cmu.edu.sun4_41;
 | 
						|
          Mon, 22 Mar 1999 18:27:15 -0500 (EST)
 | 
						|
Message-ID: <sqxh7H_00gNtAmTJ5Q@andrew.cmu.edu>
 | 
						|
Date: Mon, 22 Mar 1999 18:27:15 -0500 (EST)
 | 
						|
From: Erik Riedel <riedel+@CMU.EDU>
 | 
						|
To: pgsql-hackers@postgreSQL.org
 | 
						|
Subject: [HACKERS] optimizer and type question
 | 
						|
Sender: owner-pgsql-hackers@postgreSQL.org
 | 
						|
Precedence: bulk
 | 
						|
Status: RO
 | 
						|
 | 
						|
 | 
						|
[last week aggregation, this week, the optimizer]
 | 
						|
 | 
						|
I have a somewhat general optimizer question/problem that I would like
 | 
						|
to get some input on - i.e. I'd like to know what is "supposed" to
 | 
						|
work here and what I should be expecting.  Sadly, I think the patch
 | 
						|
for this is more involved than my last message.
 | 
						|
 | 
						|
Using my favorite table these days:
 | 
						|
 | 
						|
Table    = lineitem
 | 
						|
+------------------------+----------------------------------+-------+
 | 
						|
|              Field     |              Type                | Length|
 | 
						|
+------------------------+----------------------------------+-------+
 | 
						|
| l_orderkey             | int4 not null                    |     4 |
 | 
						|
| l_partkey              | int4 not null                    |     4 |
 | 
						|
| l_suppkey              | int4 not null                    |     4 |
 | 
						|
| l_linenumber           | int4 not null                    |     4 |
 | 
						|
| l_quantity             | float4 not null                  |     4 |
 | 
						|
| l_extendedprice        | float4 not null                  |     4 |
 | 
						|
| l_discount             | float4 not null                  |     4 |
 | 
						|
| l_tax                  | float4 not null                  |     4 |
 | 
						|
| l_returnflag           | char() not null                  |     1 |
 | 
						|
| l_linestatus           | char() not null                  |     1 |
 | 
						|
| l_shipdate             | date                             |     4 |
 | 
						|
| l_commitdate           | date                             |     4 |
 | 
						|
| l_receiptdate          | date                             |     4 |
 | 
						|
| l_shipinstruct         | char() not null                  |    25 |
 | 
						|
| l_shipmode             | char() not null                  |    10 |
 | 
						|
| l_comment              | char() not null                  |    44 |
 | 
						|
+------------------------+----------------------------------+-------+
 | 
						|
Index:    lineitem_index_
 | 
						|
 | 
						|
and the query:
 | 
						|
 | 
						|
--
 | 
						|
-- Query 1
 | 
						|
--
 | 
						|
explain select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, 
 | 
						|
sum(l_extendedprice) as sum_base_price, 
 | 
						|
sum(l_extendedprice*(1-l_discount)) as sum_disc_price, 
 | 
						|
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, 
 | 
						|
avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, 
 | 
						|
avg(l_discount) as avg_disc, count(*) as count_order 
 | 
						|
from lineitem 
 | 
						|
where l_shipdate <= '1998-09-02'::date 
 | 
						|
group by l_returnflag, l_linestatus 
 | 
						|
order by l_returnflag, l_linestatus;
 | 
						|
 | 
						|
 | 
						|
note that I have eliminated the date calculation in my query of last
 | 
						|
week and manually replaced it with a constant (since this wasn't
 | 
						|
happening automatically - but let's not worry about that for now).
 | 
						|
And this is only an explain, we care about the optimizer.  So we get:
 | 
						|
 | 
						|
Sort  (cost=34467.88 size=0 width=0)
 | 
						|
 ->  Aggregate  (cost=34467.88 size=0 width=0)
 | 
						|
   ->  Group  (cost=34467.88 size=0 width=0)
 | 
						|
     ->  Sort  (cost=34467.88 size=0 width=0)
 | 
						|
       ->  Seq Scan on lineitem  (cost=34467.88 size=200191 width=44)
 | 
						|
 | 
						|
so let's think about the selectivity that is being chosen for the
 | 
						|
seq scan (the where l_shipdate <= '1998-09-02').
 | 
						|
 | 
						|
Turns out the optimizer is choosing "33%", even though the real answer
 | 
						|
is somewhere in 90+% (that's how the query is designed).  So, why does
 | 
						|
it do that?
 | 
						|
 | 
						|
Turns out that selectivity in this case is determined via
 | 
						|
plancat::restriction_selectivity() which calls into functionOID = 103
 | 
						|
(intltsel) for operatorOID = 1096 (date "<=") on relation OID = 18663
 | 
						|
(my lineitem).
 | 
						|
 | 
						|
This all follows because of the description of 1096 (date "<=") in
 | 
						|
pg_operator.  Looking at local1_template1.bki.source near line 1754
 | 
						|
shows:
 | 
						|
 | 
						|
insert OID = 1096 ( "<=" PGUID 0 <...> date_le intltsel intltjoinsel )
 | 
						|
 | 
						|
where we see that indeed, it thinks "intltsel" is the right function
 | 
						|
to use for "oprrest" in the case of dates.
 | 
						|
 | 
						|
Question 1 - is intltsel the right thing for selectivity on dates?
 | 
						|
 | 
						|
Hope someone is still with me.
 | 
						|
 | 
						|
So now we're running selfuncs::intltsel() where we make a further call
 | 
						|
to selfuncs::gethilokey().  The job of gethilokey is to determine the
 | 
						|
min and max values of a particular attribute in the table, which will
 | 
						|
then be used with the constant in my where clause to estimate the
 | 
						|
selectivity.  It is going to search the pg_statistic relation with
 | 
						|
three key values:
 | 
						|
 | 
						|
Anum_pg_statistic_starelid     18663  (lineitem)
 | 
						|
Anum_pg_statistic_staattnum       11  (l_shipdate)
 | 
						|
Anum_pg_statistic_staop         1096  (date "<=")
 | 
						|
 | 
						|
this finds no tuples in pg_statistic.  Why is that?  The only nearby
 | 
						|
tuple in pg_statistic is:
 | 
						|
 | 
						|
starelid|staattnum|staop|stalokey        |stahikey       
 | 
						|
--------+---------+-----+----------------+----------------
 | 
						|
   18663|       11|    0|01-02-1992      |12-01-1998
 | 
						|
 | 
						|
and the reason the query doesn't match anything?  Because 1096 != 0.
 | 
						|
But why is it 0 in pg_statistic?  Statistics are determined near line
 | 
						|
1844 in vacuum.c (assuming a 'vacuum analyze' run at some point)
 | 
						|
 | 
						|
             i = 0;
 | 
						|
             values[i++] = (Datum) relid;            /* 1 */
 | 
						|
             values[i++] = (Datum) attp->attnum; /* 2 */
 | 
						|
====>        values[i++] = (Datum) InvalidOid;       /* 3 */
 | 
						|
             fmgr_info(stats->outfunc, &out_function);
 | 
						|
             out_string = <...min...>
 | 
						|
             values[i++] = (Datum) fmgr(F_TEXTIN, out_string);
 | 
						|
             pfree(out_string);
 | 
						|
             out_string = <...max...>
 | 
						|
             values[i++] = (Datum) fmgr(F_TEXTIN, out_string);
 | 
						|
             pfree(out_string);
 | 
						|
             stup = heap_formtuple(sd->rd_att, values, nulls);
 | 
						|
 | 
						|
the "offending" line is setting the staop to InvalidOid (i.e. 0).
 | 
						|
 | 
						|
Question 2 - is this right?  Is the intent for 0 to serve as a
 | 
						|
"wildcard", or should it be inserting an entry for each operation
 | 
						|
individually?
 | 
						|
 | 
						|
In the case of "wildcard" then gethilokey() should allow a match for 
 | 
						|
 | 
						|
Anum_pg_statistic_staop         0
 | 
						|
 | 
						|
instead of requiring the more restrictive 1096.  In the current code,
 | 
						|
what happens next is gethilokey() returns "not found" and intltsel()
 | 
						|
returns the default 1/3 which I see in the resultant query plan (size
 | 
						|
= 200191 is 1/3 of the number of lineitem tuples).
 | 
						|
 | 
						|
Question 3 - is there any inherent reason it couldn't get this right?
 | 
						|
The statistic is in the table 1992 to 1998, so the '1998-09-02' date
 | 
						|
should be 90-some% selectivity, a much better guess than 33%.
 | 
						|
 | 
						|
Doesn't make a difference for this particular query, of course,
 | 
						|
because the seq scan must proceed anyhow, but it could easily affect
 | 
						|
other queries where selectivities matter (and it affects the
 | 
						|
modifications I am trying to test in the optimizer to be "smarter"
 | 
						|
about selectivities - my overall context is to understand/improve the
 | 
						|
behavior that the underlying storage system sees from queries like this).
 | 
						|
 | 
						|
OK, so let's say we treat 0 as a "wildcard" and stop checking for
 | 
						|
1096.  Not we let gethilokey() return the two dates from the statistic
 | 
						|
table.  The immediate next thing that intltsel() does, near lines 122
 | 
						|
in selfuncs.c is call atol() on the strings from gethilokey().  And
 | 
						|
guess what it comes up with?
 | 
						|
 | 
						|
low = 1
 | 
						|
high = 12
 | 
						|
 | 
						|
because it calls atol() on '01-02-1992' and '12-01-1998'.  This
 | 
						|
clearly isn't right, it should get some large integer that includes
 | 
						|
the year and day in the result.  Then it should compare reasonably
 | 
						|
with my constant from the where clause and give a decent selectivity
 | 
						|
value.  This leads to a re-visit of Question 1.
 | 
						|
 | 
						|
Question 4 - should date "<=" use a dateltsel() function instead of
 | 
						|
intltsel() as oprrest?
 | 
						|
 | 
						|
If anyone is still with me, could you tell me if this makes sense, or
 | 
						|
if there is some other location where the appropriate type conversion
 | 
						|
could take place so that intltsel() gets something reasonable when it
 | 
						|
does the atol() calls?
 | 
						|
 | 
						|
Could someone also give me a sense for how far out-of-whack the whole
 | 
						|
current selectivity-handling structure is?  It seems that most of the
 | 
						|
operators in pg_operator actually use intltsel() and would have
 | 
						|
type-specific problems like that described.  Or is the problem in the
 | 
						|
way attribute values are stored in pg_statistic by vacuum analyze?  Or
 | 
						|
is there another layer where type conversion belongs?
 | 
						|
 | 
						|
Phew.  Enough typing, hope someone can follow this and address at
 | 
						|
least some of the questions.
 | 
						|
 | 
						|
Thanks.
 | 
						|
 | 
						|
Erik Riedel
 | 
						|
Carnegie Mellon University
 | 
						|
www.cs.cmu.edu/~riedel
 | 
						|
 | 
						|
 | 
						|
 | 
						|
From owner-pgsql-hackers@hub.org Mon Mar 22 20:31:11 1999
 | 
						|
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA00802
 | 
						|
	for <maillist@candle.pha.pa.us>; Mon, 22 Mar 1999 20:31:09 -0500 (EST)
 | 
						|
Received: from hub.org (majordom@hub.org [209.47.145.100]) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id UAA13231 for <maillist@candle.pha.pa.us>; Mon, 22 Mar 1999 20:15:20 -0500 (EST)
 | 
						|
Received: from localhost (majordom@localhost)
 | 
						|
	by hub.org (8.9.2/8.9.1) with SMTP id UAA01981;
 | 
						|
	Mon, 22 Mar 1999 20:14:04 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers@hub.org)
 | 
						|
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 22 Mar 1999 20:13:32 +0000 (EST)
 | 
						|
Received: (from majordom@localhost)
 | 
						|
	by hub.org (8.9.2/8.9.1) id UAA01835
 | 
						|
	for pgsql-hackers-outgoing; Mon, 22 Mar 1999 20:13:28 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 | 
						|
Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [206.210.65.6])
 | 
						|
	by hub.org (8.9.2/8.9.1) with ESMTP id UAA01822
 | 
						|
	for <pgsql-hackers@postgreSQL.org>; Mon, 22 Mar 1999 20:13:21 -0500 (EST)
 | 
						|
	(envelope-from tgl@sss.pgh.pa.us)
 | 
						|
Received: from sss.sss.pgh.pa.us (localhost [127.0.0.1])
 | 
						|
	by sss.sss.pgh.pa.us (8.9.1/8.9.1) with ESMTP id UAA23294;
 | 
						|
	Mon, 22 Mar 1999 20:12:43 -0500 (EST)
 | 
						|
To: Erik Riedel <riedel+@CMU.EDU>
 | 
						|
cc: pgsql-hackers@postgreSQL.org
 | 
						|
Subject: Re: [HACKERS] optimizer and type question 
 | 
						|
In-reply-to: Your message of Mon, 22 Mar 1999 18:27:15 -0500 (EST) 
 | 
						|
             <sqxh7H_00gNtAmTJ5Q@andrew.cmu.edu> 
 | 
						|
Date: Mon, 22 Mar 1999 20:12:43 -0500
 | 
						|
Message-ID: <23292.922151563@sss.pgh.pa.us>
 | 
						|
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
Sender: owner-pgsql-hackers@postgreSQL.org
 | 
						|
Precedence: bulk
 | 
						|
Status: ROr
 | 
						|
 | 
						|
Erik Riedel <riedel+@CMU.EDU> writes:
 | 
						|
> [ optimizer doesn't find relevant pg_statistic entry ]
 | 
						|
 | 
						|
It's clearly a bug that the selectivity code is not finding this tuple.
 | 
						|
If your analysis is correct, then selectivity estimation has *never*
 | 
						|
worked properly, or at least not in recent memory :-(.  Yipes.
 | 
						|
Bruce and I found a bunch of other problems in the optimizer recently,
 | 
						|
so it doesn't faze me to assume that this is broken too.
 | 
						|
 | 
						|
> the "offending" line is setting the staop to InvalidOid (i.e. 0).
 | 
						|
> Question 2 - is this right?  Is the intent for 0 to serve as a
 | 
						|
> "wildcard",
 | 
						|
 | 
						|
My thought is that what the staop column ought to be is the OID of the
 | 
						|
comparison function that was used to determine the sort order of the
 | 
						|
column.  Without a sort op the lowest and highest keys in the column are
 | 
						|
not well defined, so it makes no sense to assert "these are the lowest
 | 
						|
and highest values" without providing the sort op that determined that.
 | 
						|
(For sufficiently complex data types one could reasonably have multiple
 | 
						|
ordering operators.  A crude example is sorting on "circumference" and
 | 
						|
"area" for polygons.)  But typically the sort op will be the "<"
 | 
						|
operator for the column data type.
 | 
						|
 | 
						|
So, the vacuum code is definitely broken --- it's not storing the sort
 | 
						|
op that it used.  The code in gethilokey might be broken too, depending
 | 
						|
on how it is producing the operator it's trying to match against the
 | 
						|
tuple.  For example, if the actual operator in the query is any of
 | 
						|
< <= > >= on int4, then int4lt ought to be used to probe the pg_statistic
 | 
						|
table.  I'm not sure if we have adequate info in pg_operator or pg_type
 | 
						|
to let the optimizer code determine the right thing to probe with :-(
 | 
						|
 | 
						|
> The immediate next thing that intltsel() does, near lines 122
 | 
						|
> in selfuncs.c is call atol() on the strings from gethilokey().  And
 | 
						|
> guess what it comes up with?
 | 
						|
> low = 1
 | 
						|
> high = 12
 | 
						|
> because it calls atol() on '01-02-1992' and '12-01-1998'.  This
 | 
						|
> clearly isn't right, it should get some large integer that includes
 | 
						|
> the year and day in the result.  Then it should compare reasonably
 | 
						|
> with my constant from the where clause and give a decent selectivity
 | 
						|
> value.  This leads to a re-visit of Question 1.
 | 
						|
> Question 4 - should date "<=" use a dateltsel() function instead of
 | 
						|
> intltsel() as oprrest?
 | 
						|
 | 
						|
This is clearly busted as well.  I'm not sure that creating dateltsel()
 | 
						|
is the right fix, however, because if you go down that path then every
 | 
						|
single datatype needs its own selectivity function; that's more than we
 | 
						|
need.
 | 
						|
 | 
						|
What we really want here is to be able to map datatype values into
 | 
						|
some sort of numeric range so that we can compute what fraction of the
 | 
						|
low-key-to-high-key range is on each side of the probe value (the
 | 
						|
constant taken from the query).  This general concept will apply to
 | 
						|
many scalar types, so what we want is a type-specific mapping function
 | 
						|
and a less-specific fraction-computing-function.  Offhand I'd say that
 | 
						|
we want intltsel() and floatltsel(), plus conversion routines that can
 | 
						|
produce either int4 or float8 from a data type as seems appropriate.
 | 
						|
Anything that couldn't map to one or the other would have to supply its
 | 
						|
own selectivity function.
 | 
						|
 | 
						|
> Or is the problem in the
 | 
						|
> way attribute values are stored in pg_statistic by vacuum analyze?
 | 
						|
 | 
						|
Looks like it converts the low and high values to text and stores them
 | 
						|
that way.  Ugly as can be :-( but I'm not sure there is a good
 | 
						|
alternative.  We have no "wild card" column type AFAIK, which is what
 | 
						|
these columns of pg_statistic would have to be to allow storage of
 | 
						|
unconverted min and max values.
 | 
						|
 | 
						|
I think you've found a can of worms here.  Congratulations ;-)
 | 
						|
 | 
						|
			regards, tom lane
 | 
						|
 | 
						|
 | 
						|
From owner-pgsql-hackers@hub.org Mon Mar 22 23:31:00 1999
 | 
						|
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id XAA03384
 | 
						|
	for <maillist@candle.pha.pa.us>; Mon, 22 Mar 1999 23:30:58 -0500 (EST)
 | 
						|
Received: from hub.org (majordom@hub.org [209.47.145.100]) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id XAA25586 for <maillist@candle.pha.pa.us>; Mon, 22 Mar 1999 23:18:25 -0500 (EST)
 | 
						|
Received: from localhost (majordom@localhost)
 | 
						|
	by hub.org (8.9.2/8.9.1) with SMTP id XAA17955;
 | 
						|
	Mon, 22 Mar 1999 23:17:24 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers@hub.org)
 | 
						|
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 22 Mar 1999 23:16:49 +0000 (EST)
 | 
						|
Received: (from majordom@localhost)
 | 
						|
	by hub.org (8.9.2/8.9.1) id XAA17764
 | 
						|
	for pgsql-hackers-outgoing; Mon, 22 Mar 1999 23:16:46 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 | 
						|
Received: from po8.andrew.cmu.edu (PO8.ANDREW.CMU.EDU [128.2.10.108])
 | 
						|
	by hub.org (8.9.2/8.9.1) with ESMTP id XAA17745
 | 
						|
	for <pgsql-hackers@postgreSQL.org>; Mon, 22 Mar 1999 23:16:39 -0500 (EST)
 | 
						|
	(envelope-from er1p+@andrew.cmu.edu)
 | 
						|
Received: (from postman@localhost) by po8.andrew.cmu.edu (8.8.5/8.8.2) id XAA04273; Mon, 22 Mar 1999 23:16:37 -0500 (EST)
 | 
						|
Received: via switchmail; Mon, 22 Mar 1999 23:16:37 -0500 (EST)
 | 
						|
Received: from hazy.adsl.net.cmu.edu via qmail
 | 
						|
          ID </afs/andrew.cmu.edu/service/mailqs/q000/QF.kqxlJ:S00anI00p040>;
 | 
						|
          Mon, 22 Mar 1999 23:15:09 -0500 (EST)
 | 
						|
Received: from hazy.adsl.net.cmu.edu via qmail
 | 
						|
          ID </afs/andrew.cmu.edu/usr2/er1p/.Outgoing/QF.MqxlJ3q00anI01hKE0>;
 | 
						|
          Mon, 22 Mar 1999 23:15:00 -0500 (EST)
 | 
						|
Received: from mms.4.60.Jun.27.1996.03.02.53.sun4.51.EzMail.2.0.CUILIB.3.45.SNAP.NOT.LINKED.hazy.adsl.net.cmu.edu.sun4m.54
 | 
						|
          via MS.5.6.hazy.adsl.net.cmu.edu.sun4_51;
 | 
						|
          Mon, 22 Mar 1999 23:14:55 -0500 (EST)
 | 
						|
Message-ID: <4qxlJ0200anI01hK40@andrew.cmu.edu>
 | 
						|
Date: Mon, 22 Mar 1999 23:14:55 -0500 (EST)
 | 
						|
From: Erik Riedel <riedel+@CMU.EDU>
 | 
						|
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
Subject: Re: [HACKERS] optimizer and type question
 | 
						|
Cc: pgsql-hackers@postgreSQL.org
 | 
						|
In-Reply-To: <23292.922151563@sss.pgh.pa.us>
 | 
						|
References: <23292.922151563@sss.pgh.pa.us>
 | 
						|
Sender: owner-pgsql-hackers@postgreSQL.org
 | 
						|
Precedence: bulk
 | 
						|
Status: ROr
 | 
						|
 | 
						|
 | 
						|
OK, building on your high-level explanation, I am attaching a patch that
 | 
						|
attempts to do something "better" than the current code.  Note that I
 | 
						|
have only tested this with the date type and my particular query.  I
 | 
						|
haven't run it through the regression, so consider it "proof of concept"
 | 
						|
at best.  Although hopefully it will serve my purposes.
 | 
						|
 | 
						|
> My thought is that what the staop column ought to be is the OID of the
 | 
						|
> comparison function that was used to determine the sort order of the
 | 
						|
> column.  Without a sort op the lowest and highest keys in the column are
 | 
						|
> not well defined, so it makes no sense to assert "these are the lowest
 | 
						|
> and highest values" without providing the sort op that determined that.
 | 
						|
>
 | 
						|
> (For sufficiently complex data types one could reasonably have multiple
 | 
						|
> ordering operators.  A crude example is sorting on "circumference" and
 | 
						|
> "area" for polygons.)  But typically the sort op will be the "<"
 | 
						|
> operator for the column data type.
 | 
						|
>  
 | 
						|
I changed vacuum.c to do exactly that.  oid of the lt sort op.
 | 
						|
 | 
						|
> So, the vacuum code is definitely broken --- it's not storing the sort
 | 
						|
> op that it used.  The code in gethilokey might be broken too, depending
 | 
						|
> on how it is producing the operator it's trying to match against the
 | 
						|
> tuple.  For example, if the actual operator in the query is any of
 | 
						|
> < <= > >= on int4, then int4lt ought to be used to probe the pg_statistic
 | 
						|
> table.  I'm not sure if we have adequate info in pg_operator or pg_type
 | 
						|
> to let the optimizer code determine the right thing to probe with :-(
 | 
						|
>  
 | 
						|
This indeed seems like a bigger problem.  I thought about somehow using
 | 
						|
type-matching from the sort op and the actual operator in the query - if
 | 
						|
both the left and right type match, then consider them the same for
 | 
						|
purposes of this probe.  That seemed complicated, so I punted in my
 | 
						|
example - it just does the search with relid and attnum and assumes that
 | 
						|
only returns one tuple.  This works in my case (maybe in all cases,
 | 
						|
because of the way vacuum is currently written - ?).
 | 
						|
 | 
						|
> What we really want here is to be able to map datatype values into
 | 
						|
> some sort of numeric range so that we can compute what fraction of the
 | 
						|
> low-key-to-high-key range is on each side of the probe value (the
 | 
						|
> constant taken from the query).  This general concept will apply to
 | 
						|
> many scalar types, so what we want is a type-specific mapping function
 | 
						|
> and a less-specific fraction-computing-function.  Offhand I'd say that
 | 
						|
> we want intltsel() and floatltsel(), plus conversion routines that can
 | 
						|
> produce either int4 or float8 from a data type as seems appropriate.
 | 
						|
> Anything that couldn't map to one or the other would have to supply its
 | 
						|
> own selectivity function.
 | 
						|
>  
 | 
						|
This is what my example then does.  Uses the stored sort op to get the
 | 
						|
type and then uses typinput to convert from the string to an int4.
 | 
						|
 | 
						|
Then puts the int4 back into string format because that's what everyone
 | 
						|
was expecting.
 | 
						|
 | 
						|
It seems to work for my particular query.  I now get:
 | 
						|
 | 
						|
(selfuncs) gethilokey() obj 18663 attr 11 opid 1096 (ignored)
 | 
						|
(selfuncs) gethilokey() found op 1087 in pg_proc
 | 
						|
(selfuncs) gethilokey() found type 1082 in pg_type
 | 
						|
(selfuncs) gethilokey() going to use 1084 to convert type 1082
 | 
						|
(selfuncs) gethilokey() have low -2921 high -396
 | 
						|
(selfuncs) intltsel() high -396 low -2921 val -486
 | 
						|
(plancat) restriction_selectivity() for func 103 op 1096 rel 18663 attr
 | 
						|
11 const -486 flag 3 returns 0.964356
 | 
						|
NOTICE:  QUERY PLAN:
 | 
						|
 | 
						|
Sort  (cost=34467.88 size=0 width=0)
 | 
						|
 ->  Aggregate  (cost=34467.88 size=0 width=0)
 | 
						|
  ->  Group  (cost=34467.88 size=0 width=0)
 | 
						|
   ->  Sort  (cost=34467.88 size=0 width=0)
 | 
						|
    ->  Seq Scan on lineitem  (cost=34467.88 size=579166 width=44)
 | 
						|
 | 
						|
including my printfs, which exist in the patch as well.
 | 
						|
 | 
						|
Selectivity is now the expected 96% and the size estimate for the seq
 | 
						|
scan is much closer to correct.
 | 
						|
 | 
						|
Again, not tested with anything besides date, so caveat not-tested.
 | 
						|
 | 
						|
Hope this helps.
 | 
						|
 | 
						|
Erik
 | 
						|
 | 
						|
----------------------[optimizer_fix.sh]------------------------
 | 
						|
 | 
						|
#! /bin/sh
 | 
						|
# This is a shell archive, meaning:
 | 
						|
# 1. Remove everything above the #! /bin/sh line.
 | 
						|
# 2. Save the resulting text in a file.
 | 
						|
# 3. Execute the file with /bin/sh (not csh) to create:
 | 
						|
#	selfuncs.c.diff
 | 
						|
#	vacuum.c.diff
 | 
						|
# This archive created: Mon Mar 22 22:58:14 1999
 | 
						|
export PATH; PATH=/bin:/usr/bin:$PATH
 | 
						|
if test -f 'selfuncs.c.diff'
 | 
						|
then
 | 
						|
	echo shar: "will not over-write existing file 'selfuncs.c.diff'"
 | 
						|
else
 | 
						|
cat << \SHAR_EOF > 'selfuncs.c.diff'
 | 
						|
***
 | 
						|
/afs/ece.cmu.edu/project/lcs/lcs-004/er1p/postgres/611/src/backend/utils/adt
 | 
						|
/selfuncs.c	Thu Mar 11 23:59:35 1999
 | 
						|
---
 | 
						|
/afs/ece.cmu.edu/project/lcs/lcs-004/er1p/postgres/615/src/backend/utils/adt
 | 
						|
/selfuncs.c	Mon Mar 22 22:57:25 1999
 | 
						|
***************
 | 
						|
*** 32,37 ****
 | 
						|
--- 32,40 ----
 | 
						|
  #include "utils/lsyscache.h"	/* for get_oprrest() */
 | 
						|
  #include "catalog/pg_statistic.h"
 | 
						|
  
 | 
						|
+ #include "catalog/pg_proc.h"    /* for Form_pg_proc */
 | 
						|
+ #include "catalog/pg_type.h"    /* for Form_pg_type */
 | 
						|
+ 
 | 
						|
  /* N is not a valid var/constant or relation id */
 | 
						|
  #define NONVALUE(N)		((N) == -1)
 | 
						|
  
 | 
						|
***************
 | 
						|
*** 103,110 ****
 | 
						|
  				bottom;
 | 
						|
  
 | 
						|
  	result = (float64) palloc(sizeof(float64data));
 | 
						|
! 	if (NONVALUE(attno) || NONVALUE(relid))
 | 
						|
  		*result = 1.0 / 3;
 | 
						|
  	else
 | 
						|
  	{
 | 
						|
  		/* XXX			val = atol(value); */
 | 
						|
--- 106,114 ----
 | 
						|
  				bottom;
 | 
						|
  
 | 
						|
  	result = (float64) palloc(sizeof(float64data));
 | 
						|
! 	if (NONVALUE(attno) || NONVALUE(relid)) {
 | 
						|
  		*result = 1.0 / 3;
 | 
						|
+ 	}
 | 
						|
  	else
 | 
						|
  	{
 | 
						|
  		/* XXX			val = atol(value); */
 | 
						|
***************
 | 
						|
*** 117,130 ****
 | 
						|
  		}
 | 
						|
  		high = atol(highchar);
 | 
						|
  		low = atol(lowchar);
 | 
						|
  		if ((flag & SEL_RIGHT && val < low) ||
 | 
						|
  			(!(flag & SEL_RIGHT) && val > high))
 | 
						|
  		{
 | 
						|
  			float32data nvals;
 | 
						|
  
 | 
						|
  			nvals = getattdispersion(relid, (int) attno);
 | 
						|
! 			if (nvals == 0)
 | 
						|
  				*result = 1.0 / 3.0;
 | 
						|
  			else
 | 
						|
  			{
 | 
						|
  				*result = 3.0 * (float64data) nvals;
 | 
						|
--- 121,136 ----
 | 
						|
  		}
 | 
						|
  		high = atol(highchar);
 | 
						|
  		low = atol(lowchar);
 | 
						|
+ 		printf("(selfuncs) intltsel() high %d low %d val %d\n",high,low,val);
 | 
						|
  		if ((flag & SEL_RIGHT && val < low) ||
 | 
						|
  			(!(flag & SEL_RIGHT) && val > high))
 | 
						|
  		{
 | 
						|
  			float32data nvals;
 | 
						|
  
 | 
						|
  			nvals = getattdispersion(relid, (int) attno);
 | 
						|
! 			if (nvals == 0) {
 | 
						|
  				*result = 1.0 / 3.0;
 | 
						|
+ 			}
 | 
						|
  			else
 | 
						|
  			{
 | 
						|
  				*result = 3.0 * (float64data) nvals;
 | 
						|
***************
 | 
						|
*** 336,341 ****
 | 
						|
--- 342,353 ----
 | 
						|
  {
 | 
						|
  	Relation	rel;
 | 
						|
  	HeapScanDesc scan;
 | 
						|
+ 	/* this assumes there is only one row in the statistics table for any
 | 
						|
particular */
 | 
						|
+ 	/* relid, attnum pair - could be more complicated if staop is also
 | 
						|
used.         */
 | 
						|
+ 	/* at the moment, if there are multiple rows, this code ends up
 | 
						|
picking the      */
 | 
						|
+ 	/* "first" one                                                       
 | 
						|
   - er1p  */
 | 
						|
+ 	/* the actual "ignoring" is done in the call to heap_beginscan()
 | 
						|
below, where    */
 | 
						|
+ 	/* we only mention 2 of the 3 keys in this array                     
 | 
						|
   - er1p  */
 | 
						|
  	static ScanKeyData key[3] = {
 | 
						|
  		{0, Anum_pg_statistic_starelid, F_OIDEQ, {0, 0, F_OIDEQ}},
 | 
						|
  		{0, Anum_pg_statistic_staattnum, F_INT2EQ, {0, 0, F_INT2EQ}},
 | 
						|
***************
 | 
						|
*** 344,355 ****
 | 
						|
  	bool		isnull;
 | 
						|
  	HeapTuple	tuple;
 | 
						|
  
 | 
						|
  	rel = heap_openr(StatisticRelationName);
 | 
						|
  
 | 
						|
  	key[0].sk_argument = ObjectIdGetDatum(relid);
 | 
						|
  	key[1].sk_argument = Int16GetDatum((int16) attnum);
 | 
						|
  	key[2].sk_argument = ObjectIdGetDatum(opid);
 | 
						|
! 	scan = heap_beginscan(rel, 0, SnapshotNow, 3, key);
 | 
						|
  	tuple = heap_getnext(scan, 0);
 | 
						|
  	if (!HeapTupleIsValid(tuple))
 | 
						|
  	{
 | 
						|
--- 356,377 ----
 | 
						|
  	bool		isnull;
 | 
						|
  	HeapTuple	tuple;
 | 
						|
  
 | 
						|
+ 	HeapTuple tup;
 | 
						|
+ 	Form_pg_proc proc;
 | 
						|
+ 	Form_pg_type typ;
 | 
						|
+ 	Oid which_op;
 | 
						|
+ 	Oid which_type;
 | 
						|
+ 	int32 low_value;
 | 
						|
+ 	int32 high_value;
 | 
						|
+ 
 | 
						|
  	rel = heap_openr(StatisticRelationName);
 | 
						|
  
 | 
						|
  	key[0].sk_argument = ObjectIdGetDatum(relid);
 | 
						|
  	key[1].sk_argument = Int16GetDatum((int16) attnum);
 | 
						|
  	key[2].sk_argument = ObjectIdGetDatum(opid);
 | 
						|
! 	printf("(selfuncs) gethilokey() obj %d attr %d opid %d (ignored)\n",
 | 
						|
! 	       key[0].sk_argument,key[1].sk_argument,key[2].sk_argument);
 | 
						|
! 	scan = heap_beginscan(rel, 0, SnapshotNow, 2, key);
 | 
						|
  	tuple = heap_getnext(scan, 0);
 | 
						|
  	if (!HeapTupleIsValid(tuple))
 | 
						|
  	{
 | 
						|
***************
 | 
						|
*** 376,383 ****
 | 
						|
--- 398,461 ----
 | 
						|
  								&isnull));
 | 
						|
  	if (isnull)
 | 
						|
  		elog(DEBUG, "gethilokey: low key is null");
 | 
						|
+ 
 | 
						|
  	heap_endscan(scan);
 | 
						|
  	heap_close(rel);
 | 
						|
+ 
 | 
						|
+ 	/* now we deal with type conversion issues                           
 | 
						|
         */
 | 
						|
+ 	/* when intltsel() calls this routine (who knows what other callers
 | 
						|
might do)  */
 | 
						|
+ 	/* it assumes that it can call atol() on the strings and then use
 | 
						|
integer      */
 | 
						|
+ 	/* comparison from there.  what we are going to do here, then, is try
 | 
						|
to use   */
 | 
						|
+ 	/* the type information from Anum_pg_statistic_staop to convert the
 | 
						|
high       */
 | 
						|
+ 	/* and low values                                                   
 | 
						|
- er1p    */
 | 
						|
+ 
 | 
						|
+ 	/* WARNING: this code has only been tested with the date type and has
 | 
						|
NOT      */
 | 
						|
+ 	/* been regression tested.  consider it "sample" code of what might
 | 
						|
be the     */
 | 
						|
+ 	/* right kind of thing to do                                        
 | 
						|
- er1p    */
 | 
						|
+ 
 | 
						|
+ 	/* get the 'op' from pg_statistic and look it up in pg_proc */
 | 
						|
+ 	which_op = heap_getattr(tuple,
 | 
						|
+ 				Anum_pg_statistic_staop,
 | 
						|
+ 				RelationGetDescr(rel),
 | 
						|
+ 				&isnull);
 | 
						|
+ 	if (InvalidOid == which_op) {
 | 
						|
+ 	  /* ignore all this stuff, try conversion only if we have a valid staop */
 | 
						|
+ 	  /* note that there is an accompanying change to 'vacuum analyze' that  */
 | 
						|
+ 	  /* gets this set to something useful.                                  */
 | 
						|
+ 	} else {
 | 
						|
+ 	  /* staop looks valid, so let's see what we can do about conversion */
 | 
						|
+ 	  tup = SearchSysCacheTuple(PROOID, ObjectIdGetDatum(which_op), 0, 0, 0);
 | 
						|
+ 	  if (!HeapTupleIsValid(tup)) {
 | 
						|
+ 	    elog(ERROR, "selfuncs: unable to find op in pg_proc %d", which_op);
 | 
						|
+ 	  }
 | 
						|
+ 	  printf("(selfuncs) gethilokey() found op %d in pg_proc\n",which_op);
 | 
						|
+ 	  
 | 
						|
+ 	  /* use that to determine the type of stahikey and stalokey via pg_type */
 | 
						|
+ 	  proc = (Form_pg_proc) GETSTRUCT(tup);
 | 
						|
+ 	  which_type = proc->proargtypes[0]; /* XXX - use left and right
 | 
						|
separately? */
 | 
						|
+ 	  tup = SearchSysCacheTuple(TYPOID, ObjectIdGetDatum(which_type), 0, 0, 0);
 | 
						|
+ 	  if (!HeapTupleIsValid(tup)) {
 | 
						|
+ 	    elog(ERROR, "selfuncs: unable to find type in pg_type %d", which_type);
 | 
						|
+ 	  }
 | 
						|
+ 	  printf("(selfuncs) gethilokey() found type %d in pg_type\n",which_type);
 | 
						|
+ 	  
 | 
						|
+ 	  /* and use that type to get the conversion function to int4 */
 | 
						|
+ 	  typ = (Form_pg_type) GETSTRUCT(tup);
 | 
						|
+ 	  printf("(selfuncs) gethilokey() going to use %d to convert type
 | 
						|
%d\n",typ->typinput,which_type);
 | 
						|
+ 	  
 | 
						|
+ 	  /* and convert the low and high strings */
 | 
						|
+ 	  low_value = (int32) fmgr(typ->typinput, *low, -1);
 | 
						|
+ 	  high_value = (int32) fmgr(typ->typinput, *high, -1);
 | 
						|
+ 	  printf("(selfuncs) gethilokey() have low %d high
 | 
						|
%d\n",low_value,high_value);
 | 
						|
+ 	  
 | 
						|
+ 	  /* now we have int4's, which we put back into strings because
 | 
						|
that's what out  */
 | 
						|
+ 	  /* callers (intltsel() at least) expect                            
 | 
						|
    - er1p */
 | 
						|
+ 	  pfree(*low); pfree(*high); /* let's not leak the old strings */
 | 
						|
+ 	  *low = int4out(low_value);
 | 
						|
+ 	  *high = int4out(high_value);
 | 
						|
+ 
 | 
						|
+ 	  /* XXX - this probably leaks the two tups we got from
 | 
						|
SearchSysCacheTuple() - er1p */
 | 
						|
+ 	}
 | 
						|
  }
 | 
						|
  
 | 
						|
  float64
 | 
						|
SHAR_EOF
 | 
						|
fi
 | 
						|
if test -f 'vacuum.c.diff'
 | 
						|
then
 | 
						|
	echo shar: "will not over-write existing file 'vacuum.c.diff'"
 | 
						|
else
 | 
						|
cat << \SHAR_EOF > 'vacuum.c.diff'
 | 
						|
***
 | 
						|
/afs/ece.cmu.edu/project/lcs/lcs-004/er1p/postgres/611/src/backend/commands/
 | 
						|
vacuum.c	Thu Mar 11 23:59:09 1999
 | 
						|
---
 | 
						|
/afs/ece.cmu.edu/project/lcs/lcs-004/er1p/postgres/615/src/backend/commands/
 | 
						|
vacuum.c	Mon Mar 22 21:23:15 1999
 | 
						|
***************
 | 
						|
*** 1842,1848 ****
 | 
						|
  					i = 0;
 | 
						|
  					values[i++] = (Datum) relid;		/* 1 */
 | 
						|
  					values[i++] = (Datum) attp->attnum; /* 2 */
 | 
						|
! 					values[i++] = (Datum) InvalidOid;	/* 3 */
 | 
						|
  					fmgr_info(stats->outfunc, &out_function);
 | 
						|
  					out_string = (*fmgr_faddr(&out_function)) (stats->min,
 | 
						|
stats->attr->atttypid);
 | 
						|
  					values[i++] = (Datum) fmgr(F_TEXTIN, out_string);
 | 
						|
--- 1842,1848 ----
 | 
						|
  					i = 0;
 | 
						|
  					values[i++] = (Datum) relid;		/* 1 */
 | 
						|
  					values[i++] = (Datum) attp->attnum; /* 2 */
 | 
						|
! 					values[i++] = (Datum) stats->f_cmplt.fn_oid;	/* 3 */ /* get the
 | 
						|
'<' oid, instead of 'invalid' - er1p */
 | 
						|
  					fmgr_info(stats->outfunc, &out_function);
 | 
						|
  					out_string = (*fmgr_faddr(&out_function)) (stats->min,
 | 
						|
stats->attr->atttypid);
 | 
						|
  					values[i++] = (Datum) fmgr(F_TEXTIN, out_string);
 | 
						|
SHAR_EOF
 | 
						|
fi
 | 
						|
exit 0
 | 
						|
#	End of shell archive
 | 
						|
 | 
						|
 | 
						|
 | 
						|
From owner-pgsql-hackers@hub.org Tue Mar 23 12:31:05 1999
 | 
						|
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA17491
 | 
						|
	for <maillist@candle.pha.pa.us>; Tue, 23 Mar 1999 12:31:04 -0500 (EST)
 | 
						|
Received: from hub.org (majordom@hub.org [209.47.145.100]) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id MAA08839 for <maillist@candle.pha.pa.us>; Tue, 23 Mar 1999 12:08:14 -0500 (EST)
 | 
						|
Received: from localhost (majordom@localhost)
 | 
						|
	by hub.org (8.9.2/8.9.1) with SMTP id MAA93649;
 | 
						|
	Tue, 23 Mar 1999 12:04:57 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers@hub.org)
 | 
						|
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 23 Mar 1999 12:03:00 +0000 (EST)
 | 
						|
Received: (from majordom@localhost)
 | 
						|
	by hub.org (8.9.2/8.9.1) id MAA93355
 | 
						|
	for pgsql-hackers-outgoing; Tue, 23 Mar 1999 12:02:55 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 | 
						|
Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [206.210.65.6])
 | 
						|
	by hub.org (8.9.2/8.9.1) with ESMTP id MAA93336
 | 
						|
	for <pgsql-hackers@postgreSQL.org>; Tue, 23 Mar 1999 12:02:43 -0500 (EST)
 | 
						|
	(envelope-from tgl@sss.pgh.pa.us)
 | 
						|
Received: from sss.sss.pgh.pa.us (localhost [127.0.0.1])
 | 
						|
	by sss.sss.pgh.pa.us (8.9.1/8.9.1) with ESMTP id MAA24455;
 | 
						|
	Tue, 23 Mar 1999 12:01:57 -0500 (EST)
 | 
						|
To: Erik Riedel <riedel+@CMU.EDU>
 | 
						|
cc: pgsql-hackers@postgreSQL.org
 | 
						|
Subject: Re: [HACKERS] optimizer and type question 
 | 
						|
In-reply-to: Your message of Mon, 22 Mar 1999 23:14:55 -0500 (EST) 
 | 
						|
             <4qxlJ0200anI01hK40@andrew.cmu.edu> 
 | 
						|
Date: Tue, 23 Mar 1999 12:01:57 -0500
 | 
						|
Message-ID: <24453.922208517@sss.pgh.pa.us>
 | 
						|
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
Sender: owner-pgsql-hackers@postgreSQL.org
 | 
						|
Precedence: bulk
 | 
						|
Status: RO
 | 
						|
 | 
						|
Erik Riedel <riedel+@CMU.EDU> writes:
 | 
						|
> OK, building on your high-level explanation, I am attaching a patch that
 | 
						|
> attempts to do something "better" than the current code.  Note that I
 | 
						|
> have only tested this with the date type and my particular query.
 | 
						|
 | 
						|
Glad to see you working on this.  I don't like the details of your
 | 
						|
patch too much though ;-).  Here are some suggestions for making it
 | 
						|
better.
 | 
						|
 | 
						|
1. I think just removing staop from the lookup in gethilokey is OK for
 | 
						|
now, though I'm dubious about Bruce's thought that we could delete that
 | 
						|
field entirely.  As you observe, vacuum will not currently put more
 | 
						|
than one tuple for a column into pg_statistic, so we can just do the
 | 
						|
lookup with relid and attno and leave it at that.  But I think we ought
 | 
						|
to leave the field there, with the idea that vacuum might someday
 | 
						|
compute more than one statistic for a data column.  Fixing vacuum to
 | 
						|
put its sort op into the field is a good idea in the meantime.
 | 
						|
 | 
						|
2. The type conversion you're doing in gethilokey is a mess; I think
 | 
						|
what you ought to make it do is simply the inbound conversion of the
 | 
						|
string from pg_statistic into the internal representation for the
 | 
						|
column's datatype, and return that value as a Datum.  It also needs
 | 
						|
a cleaner success/failure return convention --- this business with
 | 
						|
"n" return is ridiculously type-specific.  Also, the best and easiest
 | 
						|
way to find the type to convert to is to look up the column type in
 | 
						|
the info for the given relid, not search pg_proc with the staop value.
 | 
						|
(I'm not sure that will even work, since there are pg_proc entries
 | 
						|
with wildcard argument types.)
 | 
						|
 | 
						|
3. The atol() calls currently found in intltsel are a type-specific
 | 
						|
cheat on what is conceptually a two-step process:
 | 
						|
  * Convert the string stored in pg_statistic back to the internal
 | 
						|
    form for the column data type.
 | 
						|
  * Generate a numeric representation of the data value that can be
 | 
						|
    used as an estimate of the range of values in the table.
 | 
						|
The second step is trivial for integers, which may obscure the fact
 | 
						|
that there are two steps involved, but nonetheless there are.  If
 | 
						|
you think about applying selectivity logic to strings, say, it
 | 
						|
becomes clear that the second step is a necessary component of the
 | 
						|
process.  Furthermore, the second step must also be applied to the
 | 
						|
probe value that's being passed into the selectivity operator.
 | 
						|
(The probe value is already in internal form, of course; but it is
 | 
						|
not necessarily in a useful numeric form.)
 | 
						|
 | 
						|
We can do the first of these steps by applying the appropriate "XXXin"
 | 
						|
conversion function for the column data type, as you have done.  The
 | 
						|
interesting question is how to do the second one.  A really clean
 | 
						|
solution would require adding a column to pg_type that points to a
 | 
						|
function that will do the appropriate conversion.  I'd be inclined to
 | 
						|
make all of these functions return "double" (float8) and just have one
 | 
						|
top-level selectivity routine for all data types that can use
 | 
						|
range-based selectivity logic.
 | 
						|
 | 
						|
We could probably hack something together that would not use an explicit
 | 
						|
conversion function for each data type, but instead would rely on
 | 
						|
type-specific assumptions inside the selectivity routines.  We'd need many
 | 
						|
more selectivity routines though (at least one for each of int, float4,
 | 
						|
float8, and text data types) so I'm not sure we'd really save any work
 | 
						|
compared to doing it right.
 | 
						|
 | 
						|
BTW, now that I look at this issue it's real clear that the selectivity
 | 
						|
entries in pg_operator are horribly broken.  The intltsel/intgtsel
 | 
						|
selectivity routines are currently applied to 32 distinct data types:
 | 
						|
 | 
						|
regression=> select distinct typname,oprleft from pg_operator, pg_type
 | 
						|
regression-> where pg_type.oid = oprleft
 | 
						|
regression-> and oprrest in (103,104);
 | 
						|
typname  |oprleft
 | 
						|
---------+-------
 | 
						|
_aclitem |   1034
 | 
						|
abstime  |    702
 | 
						|
bool     |     16
 | 
						|
box      |    603
 | 
						|
bpchar   |   1042
 | 
						|
char     |     18
 | 
						|
cidr     |    650
 | 
						|
circle   |    718
 | 
						|
date     |   1082
 | 
						|
datetime |   1184
 | 
						|
float4   |    700
 | 
						|
float8   |    701
 | 
						|
inet     |    869
 | 
						|
int2     |     21
 | 
						|
int4     |     23
 | 
						|
int8     |     20
 | 
						|
line     |    628
 | 
						|
lseg     |    601
 | 
						|
macaddr  |    829
 | 
						|
money    |    790
 | 
						|
name     |     19
 | 
						|
numeric  |   1700
 | 
						|
oid      |     26
 | 
						|
oid8     |     30
 | 
						|
path     |    602
 | 
						|
point    |    600
 | 
						|
polygon  |    604
 | 
						|
text     |     25
 | 
						|
time     |   1083
 | 
						|
timespan |   1186
 | 
						|
timestamp|   1296
 | 
						|
varchar  |   1043
 | 
						|
(32 rows)
 | 
						|
 | 
						|
many of which are very obviously not compatible with integer for *any*
 | 
						|
purpose.  It looks to me like a lot of data types were added to
 | 
						|
pg_operator just by copy-and-paste, without paying attention to whether
 | 
						|
the selectivity routines were actually correct for the data type.
 | 
						|
 | 
						|
As the code stands today, the bogus entries don't matter because
 | 
						|
gethilokey always fails, so we always get 1/3 as the selectivity
 | 
						|
estimate for any comparison operator (except = and != of course).
 | 
						|
I had actually noticed that fact and assumed that it was supposed
 | 
						|
to work that way :-(.  But, clearly, there is code in here that
 | 
						|
is *trying* to be smarter.
 | 
						|
 | 
						|
As soon as we fix gethilokey so that it can succeed, we will start
 | 
						|
getting essentially-random selectivity estimates for those data types
 | 
						|
that aren't actually binary-compatible with integer.  That will not do;
 | 
						|
we have to do something about the issue.
 | 
						|
 | 
						|
			regards, tom lane
 | 
						|
 | 
						|
 | 
						|
From tgl@sss.pgh.pa.us Tue Mar 23 12:31:02 1999
 | 
						|
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA17484
 | 
						|
	for <maillist@candle.pha.pa.us>; Tue, 23 Mar 1999 12:31:01 -0500 (EST)
 | 
						|
Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [206.210.65.6]) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id MAA09042 for <maillist@candle.pha.pa.us>; Tue, 23 Mar 1999 12:10:55 -0500 (EST)
 | 
						|
Received: from sss.sss.pgh.pa.us (localhost [127.0.0.1])
 | 
						|
	by sss.sss.pgh.pa.us (8.9.1/8.9.1) with ESMTP id MAA24474;
 | 
						|
	Tue, 23 Mar 1999 12:09:52 -0500 (EST)
 | 
						|
To: Bruce Momjian <maillist@candle.pha.pa.us>
 | 
						|
cc: riedel+@CMU.EDU, pgsql-hackers@postgreSQL.org
 | 
						|
Subject: Re: [HACKERS] optimizer and type question 
 | 
						|
In-reply-to: Your message of Mon, 22 Mar 1999 21:25:45 -0500 (EST) 
 | 
						|
             <199903230225.VAA01641@candle.pha.pa.us> 
 | 
						|
Date: Tue, 23 Mar 1999 12:09:52 -0500
 | 
						|
Message-ID: <24471.922208992@sss.pgh.pa.us>
 | 
						|
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
Status: RO
 | 
						|
 | 
						|
Bruce Momjian <maillist@candle.pha.pa.us> writes:
 | 
						|
> What we really need is some way to determine how far the requested value
 | 
						|
> is from the min/max values.  With int, we just do (val-min)/(max-min). 
 | 
						|
> That works, but how do we do that for types that don't support division.
 | 
						|
> Strings come to mind in this case.
 | 
						|
 | 
						|
What I'm envisioning is that we still apply the (val-min)/(max-min)
 | 
						|
logic, but apply it to numeric values that are produced in a
 | 
						|
type-dependent way.
 | 
						|
 | 
						|
For ints and floats the conversion is trivial, of course.
 | 
						|
 | 
						|
For strings, the first thing that comes to mind is to return 0 for a
 | 
						|
null string and the value of the first byte for a non-null string.
 | 
						|
This would give you one-part-in-256 selectivity which is plenty good
 | 
						|
enough for what the selectivity code needs to do.  (Actually, it's
 | 
						|
only that good if the strings' first bytes are pretty well spread out.
 | 
						|
If you have a table containing English words, for example, you might
 | 
						|
only get about one part in 26 this way, since the first bytes will
 | 
						|
probably only run from A to Z.  Might be better to use the first two
 | 
						|
characters of the string to compute the selectivity representation.)
 | 
						|
 | 
						|
In general, you can apply this logic as long as you can come up with
 | 
						|
some numerical approximation to the data type's sorting order.  It
 | 
						|
doesn't have to be exact.
 | 
						|
 | 
						|
			regards, tom lane
 | 
						|
 | 
						|
From owner-pgsql-hackers@hub.org Tue Mar 23 12:31:03 1999
 | 
						|
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA17488
 | 
						|
	for <maillist@candle.pha.pa.us>; Tue, 23 Mar 1999 12:31:02 -0500 (EST)
 | 
						|
Received: from hub.org (majordom@hub.org [209.47.145.100]) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id MAA09987 for <maillist@candle.pha.pa.us>; Tue, 23 Mar 1999 12:21:34 -0500 (EST)
 | 
						|
Received: from localhost (majordom@localhost)
 | 
						|
	by hub.org (8.9.2/8.9.1) with SMTP id MAA95155;
 | 
						|
	Tue, 23 Mar 1999 12:18:33 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers@hub.org)
 | 
						|
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 23 Mar 1999 12:17:00 +0000 (EST)
 | 
						|
Received: (from majordom@localhost)
 | 
						|
	by hub.org (8.9.2/8.9.1) id MAA94857
 | 
						|
	for pgsql-hackers-outgoing; Tue, 23 Mar 1999 12:16:56 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 | 
						|
Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [206.210.65.6])
 | 
						|
	by hub.org (8.9.2/8.9.1) with ESMTP id MAA94469
 | 
						|
	for <pgsql-hackers@postgreSQL.org>; Tue, 23 Mar 1999 12:11:33 -0500 (EST)
 | 
						|
	(envelope-from tgl@sss.pgh.pa.us)
 | 
						|
Received: from sss.sss.pgh.pa.us (localhost [127.0.0.1])
 | 
						|
	by sss.sss.pgh.pa.us (8.9.1/8.9.1) with ESMTP id MAA24474;
 | 
						|
	Tue, 23 Mar 1999 12:09:52 -0500 (EST)
 | 
						|
To: Bruce Momjian <maillist@candle.pha.pa.us>
 | 
						|
cc: riedel+@CMU.EDU, pgsql-hackers@postgreSQL.org
 | 
						|
Subject: Re: [HACKERS] optimizer and type question 
 | 
						|
In-reply-to: Your message of Mon, 22 Mar 1999 21:25:45 -0500 (EST) 
 | 
						|
             <199903230225.VAA01641@candle.pha.pa.us> 
 | 
						|
Date: Tue, 23 Mar 1999 12:09:52 -0500
 | 
						|
Message-ID: <24471.922208992@sss.pgh.pa.us>
 | 
						|
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
Sender: owner-pgsql-hackers@postgreSQL.org
 | 
						|
Precedence: bulk
 | 
						|
Status: RO
 | 
						|
 | 
						|
Bruce Momjian <maillist@candle.pha.pa.us> writes:
 | 
						|
> What we really need is some way to determine how far the requested value
 | 
						|
> is from the min/max values.  With int, we just do (val-min)/(max-min). 
 | 
						|
> That works, but how do we do that for types that don't support division.
 | 
						|
> Strings come to mind in this case.
 | 
						|
 | 
						|
What I'm envisioning is that we still apply the (val-min)/(max-min)
 | 
						|
logic, but apply it to numeric values that are produced in a
 | 
						|
type-dependent way.
 | 
						|
 | 
						|
For ints and floats the conversion is trivial, of course.
 | 
						|
 | 
						|
For strings, the first thing that comes to mind is to return 0 for a
 | 
						|
null string and the value of the first byte for a non-null string.
 | 
						|
This would give you one-part-in-256 selectivity which is plenty good
 | 
						|
enough for what the selectivity code needs to do.  (Actually, it's
 | 
						|
only that good if the strings' first bytes are pretty well spread out.
 | 
						|
If you have a table containing English words, for example, you might
 | 
						|
only get about one part in 26 this way, since the first bytes will
 | 
						|
probably only run from A to Z.  Might be better to use the first two
 | 
						|
characters of the string to compute the selectivity representation.)
 | 
						|
 | 
						|
In general, you can apply this logic as long as you can come up with
 | 
						|
some numerical approximation to the data type's sorting order.  It
 | 
						|
doesn't have to be exact.
 | 
						|
 | 
						|
			regards, tom lane
 | 
						|
 | 
						|
 | 
						|
From owner-pgsql-hackers@hub.org Thu Jul  1 20:39:19 1999
 | 
						|
Received: from hub.org (hub.org [209.167.229.1])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA15403
 | 
						|
	for <maillist@candle.pha.pa.us>; Thu, 1 Jul 1999 20:39:18 -0400 (EDT)
 | 
						|
Received: from hub.org (hub.org [209.167.229.1])
 | 
						|
	by hub.org (8.9.3/8.9.3) with ESMTP id UAA45018;
 | 
						|
	Thu, 1 Jul 1999 20:20:27 -0400 (EDT)
 | 
						|
	(envelope-from owner-pgsql-hackers@hub.org)
 | 
						|
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Thu, 01 Jul 1999 20:15:30 +0000 (EDT)
 | 
						|
Received: (from majordom@localhost)
 | 
						|
	by hub.org (8.9.3/8.9.3) id UAA44474
 | 
						|
	for pgsql-hackers-outgoing; Thu, 1 Jul 1999 20:15:28 -0400 (EDT)
 | 
						|
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 | 
						|
X-Authentication-Warning: hub.org: majordom set sender to owner-pgsql-hackers@postgreSQL.org using -f
 | 
						|
Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34])
 | 
						|
	by hub.org (8.9.3/8.9.3) with ESMTP id UAA44058
 | 
						|
	for <pgsql-hackers@postgreSQL.org>; Thu, 1 Jul 1999 20:12:10 -0400 (EDT)
 | 
						|
	(envelope-from Inoue@tpf.co.jp)
 | 
						|
Received: from cadzone ([126.0.1.40] (may be forged))
 | 
						|
          by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
 | 
						|
   id JAA00279 for <pgsql-hackers@postgreSQL.org>; Fri, 02 Jul 1999 09:11:58 +0900
 | 
						|
From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
 | 
						|
To: "pgsql-hackers" <pgsql-hackers@postgreSQL.org>
 | 
						|
Subject: [HACKERS] Optimization FAQ ?
 | 
						|
Date: Fri, 2 Jul 1999 09:14:10 +0900
 | 
						|
Message-ID: <000401bec41f$ce81dcc0$2801007e@cadzone.tpf.co.jp>
 | 
						|
MIME-Version: 1.0
 | 
						|
Content-Type: text/plain;
 | 
						|
	charset="iso-2022-jp"
 | 
						|
Content-Transfer-Encoding: 7bit
 | 
						|
X-Priority: 3 (Normal)
 | 
						|
X-MSMail-Priority: Normal
 | 
						|
X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
 | 
						|
X-MimeOLE: Produced By Microsoft MimeOLE V4.72.2106.4
 | 
						|
Importance: Normal
 | 
						|
Sender: owner-pgsql-hackers@postgreSQL.org
 | 
						|
Precedence: bulk
 | 
						|
Status: RO
 | 
						|
 | 
						|
Hello all,
 | 
						|
 | 
						|
I got the following result.
 | 
						|
It's FAQ ?
 | 
						|
 | 
						|
drop table int2t;
 | 
						|
create table int2t (id int2 primary key);
 | 
						|
 | 
						|
explain select * from int2t where id=1;
 | 
						|
  NOTICE:  QUERY PLAN:
 | 
						|
 | 
						|
  Seq Scan on int2t  (cost=43.00 rows=2 width=2) 
 | 
						|
 | 
						|
explain select * from int2t where id=1::int2;
 | 
						|
  NOTICE:  QUERY PLAN:
 | 
						|
 | 
						|
  Index Scan using int2t_pkey on int2t  (cost=2.05 rows=2 width=2) 
 | 
						|
 | 
						|
explain select * from int2t where id='1';
 | 
						|
  NOTICE:  QUERY PLAN:
 | 
						|
 | 
						|
  Index Scan using int2t_pkey on int2t  (cost=2.05 rows=2 width=2) 
 | 
						|
 | 
						|
Right behavior ?
 | 
						|
 | 
						|
Regards.
 | 
						|
 | 
						|
Hiroshi Inoue
 | 
						|
Inoue@tpf.co.jp
 | 
						|
 | 
						|
 | 
						|
From owner-pgsql-hackers@hub.org Thu Jan 20 18:45:32 2000
 | 
						|
Received: from renoir.op.net (root@renoir.op.net [207.29.195.4])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id TAA00672
 | 
						|
	for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 19:45:30 -0500 (EST)
 | 
						|
Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.20 $) with ESMTP id TAA01989 for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 19:39:15 -0500 (EST)
 | 
						|
Received: from localhost (majordom@localhost)
 | 
						|
	by hub.org (8.9.3/8.9.3) with SMTP id TAA00957;
 | 
						|
	Thu, 20 Jan 2000 19:35:19 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers)
 | 
						|
Received: by hub.org (bulk_mailer v1.5); Thu, 20 Jan 2000 19:33:34 -0500
 | 
						|
Received: (from majordom@localhost)
 | 
						|
	by hub.org (8.9.3/8.9.3) id TAA00581
 | 
						|
	for pgsql-hackers-outgoing; Thu, 20 Jan 2000 19:32:37 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 | 
						|
Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
 | 
						|
	by hub.org (8.9.3/8.9.3) with ESMTP id TAA98940
 | 
						|
	for <pgsql-hackers@postgreSQL.org>; Thu, 20 Jan 2000 19:31:49 -0500 (EST)
 | 
						|
	(envelope-from tgl@sss.pgh.pa.us)
 | 
						|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						|
	by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id TAA25390
 | 
						|
	for <pgsql-hackers@postgreSQL.org>; Thu, 20 Jan 2000 19:31:32 -0500 (EST)
 | 
						|
To: pgsql-hackers@postgreSQL.org
 | 
						|
Subject: [HACKERS] Some notes on optimizer cost estimates
 | 
						|
Date: Thu, 20 Jan 2000 19:31:32 -0500
 | 
						|
Message-ID: <25387.948414692@sss.pgh.pa.us>
 | 
						|
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
Sender: owner-pgsql-hackers@postgreSQL.org
 | 
						|
Status: OR
 | 
						|
 | 
						|
I have been spending some time measuring actual runtimes for various
 | 
						|
sequential-scan and index-scan query plans, and have learned that the
 | 
						|
current Postgres optimizer's cost estimation equations are not very
 | 
						|
close to reality at all.
 | 
						|
 | 
						|
Presently we estimate the cost of a sequential scan as
 | 
						|
 | 
						|
	Nblocks + CPU_PAGE_WEIGHT * Ntuples
 | 
						|
 | 
						|
--- that is, the unit of cost is the time to read one disk page,
 | 
						|
and we have a "fudge factor" that relates CPU time per tuple to
 | 
						|
disk time per page.  (The default CPU_PAGE_WEIGHT is 0.033, which
 | 
						|
is probably too high for modern hardware --- 0.01 seems like it
 | 
						|
might be a better default, at least for simple queries.)  OK,
 | 
						|
it's a simplistic model, but not too unreasonable so far.
 | 
						|
 | 
						|
The cost of an index scan is measured in these same terms as
 | 
						|
 | 
						|
	Nblocks + CPU_PAGE_WEIGHT * Ntuples +
 | 
						|
	  CPU_INDEX_PAGE_WEIGHT * Nindextuples
 | 
						|
 | 
						|
Here Ntuples is the number of tuples selected by the index qual
 | 
						|
condition (typically, it's less than the total table size used in
 | 
						|
sequential-scan estimation).  CPU_INDEX_PAGE_WEIGHT essentially
 | 
						|
estimates the cost of scanning an index tuple; by default it's 0.017 or
 | 
						|
half CPU_PAGE_WEIGHT.  Nblocks is estimated as the index size plus an
 | 
						|
appropriate fraction of the main table size.
 | 
						|
 | 
						|
There are two big problems with this:
 | 
						|
 | 
						|
1. Since main-table tuples are visited in index order, we'll be hopping
 | 
						|
around from page to page in the table.  The current cost estimation
 | 
						|
method essentially assumes that the buffer cache plus OS disk cache will
 | 
						|
be 100% efficient --- we will never have to read the same page of the
 | 
						|
main table twice in a scan, due to having discarded it between
 | 
						|
references.  This of course is unreasonably optimistic.  Worst case
 | 
						|
is that we'd fetch a main-table page for each selected tuple, but in
 | 
						|
most cases that'd be unreasonably pessimistic.
 | 
						|
 | 
						|
2. The cost of a disk page fetch is estimated at 1.0 unit for both
 | 
						|
sequential and index scans.  In reality, sequential access is *much*
 | 
						|
cheaper than the quasi-random accesses performed by an index scan.
 | 
						|
This is partly a matter of physical disk seeks, and partly a matter
 | 
						|
of benefitting (or not) from any read-ahead logic the OS may employ.
 | 
						|
 | 
						|
As best I can measure on my hardware, the cost of a nonsequential
 | 
						|
disk read should be estimated at 4 to 5 times the cost of a sequential
 | 
						|
one --- I'm getting numbers like 2.2 msec per disk page for sequential
 | 
						|
scans, and as much as 11 msec per page for index scans.  I don't
 | 
						|
know, however, if this ratio is similar enough on other platforms
 | 
						|
to be useful for cost estimating.  We could make it a parameter like
 | 
						|
we do for CPU_PAGE_WEIGHT ... but you know and I know that no one
 | 
						|
ever bothers to adjust those numbers in the field ...
 | 
						|
 | 
						|
The other effect that needs to be modeled, and currently is not, is the
 | 
						|
"hit rate" of buffer cache.  Presumably, this is 100% for tables smaller
 | 
						|
than the cache and drops off as the table size increases --- but I have
 | 
						|
no particular thoughts on the form of the dependency.  Does anyone have
 | 
						|
ideas here?  The problem is complicated by the fact that we don't really
 | 
						|
know how big the cache is; we know the number of buffers Postgres has,
 | 
						|
but we have no idea how big a disk cache the kernel is keeping.  As near
 | 
						|
as I can tell, finding a hit in the kernel disk cache is not a lot more
 | 
						|
expensive than having the page sitting in Postgres' own buffers ---
 | 
						|
certainly it's much much cheaper than a disk read.
 | 
						|
 | 
						|
BTW, if you want to do some measurements of your own, try turning on
 | 
						|
PGOPTIONS="-d 2 -te".  This will dump a lot of interesting numbers
 | 
						|
into the postmaster log, if your platform supports getrusage().
 | 
						|
 | 
						|
			regards, tom lane
 | 
						|
 | 
						|
************
 | 
						|
 | 
						|
From owner-pgsql-hackers@hub.org Thu Jan 20 20:26:33 2000
 | 
						|
Received: from hub.org (hub.org [216.126.84.1])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA06630
 | 
						|
	for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 21:26:32 -0500 (EST)
 | 
						|
Received: from localhost (majordom@localhost)
 | 
						|
	by hub.org (8.9.3/8.9.3) with SMTP id VAA35022;
 | 
						|
	Thu, 20 Jan 2000 21:22:08 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers)
 | 
						|
Received: by hub.org (bulk_mailer v1.5); Thu, 20 Jan 2000 21:20:35 -0500
 | 
						|
Received: (from majordom@localhost)
 | 
						|
	by hub.org (8.9.3/8.9.3) id VAA34569
 | 
						|
	for pgsql-hackers-outgoing; Thu, 20 Jan 2000 21:19:38 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 | 
						|
Received: from hercules.cs.ucsb.edu (hercules.cs.ucsb.edu [128.111.41.30])
 | 
						|
	by hub.org (8.9.3/8.9.3) with ESMTP id VAA34534
 | 
						|
	for <pgsql-hackers@postgreSQL.org>; Thu, 20 Jan 2000 21:19:26 -0500 (EST)
 | 
						|
	(envelope-from xun@cs.ucsb.edu)
 | 
						|
Received: from xp10-06.dialup.commserv.ucsb.edu (root@xp10-06.dialup.commserv.ucsb.edu [128.111.253.249])
 | 
						|
	by hercules.cs.ucsb.edu (8.8.6/8.8.6) with ESMTP id SAA04655
 | 
						|
	for <pgsql-hackers@postgreSQL.org>; Thu, 20 Jan 2000 18:19:22 -0800 (PST)
 | 
						|
Received: from xp10-06.dialup.commserv.ucsb.edu (xun@localhost)
 | 
						|
	by xp10-06.dialup.commserv.ucsb.edu (8.9.3/8.9.3) with ESMTP id SAA22377
 | 
						|
	for <pgsql-hackers@postgreSQL.org>; Thu, 20 Jan 2000 18:19:40 -0800
 | 
						|
Message-Id: <200001210219.SAA22377@xp10-06.dialup.commserv.ucsb.edu>
 | 
						|
To: pgsql-hackers@postgreSQL.org
 | 
						|
Reply-to: xun@cs.ucsb.edu
 | 
						|
Subject: Re. [HACKERS] Some notes on optimizer cost estimates
 | 
						|
Date: Thu, 20 Jan 2000 18:19:40 -0800
 | 
						|
From: Xun Cheng <xun@cs.ucsb.edu>
 | 
						|
Sender: owner-pgsql-hackers@postgreSQL.org
 | 
						|
Status: OR
 | 
						|
 | 
						|
I'm very glad you bring up this cost estimate issue.
 | 
						|
Recent work in database research have argued a more
 | 
						|
detailed disk access cost model should be used for
 | 
						|
large queries especially joins.
 | 
						|
Traditional cost estimate only considers the number of
 | 
						|
disk pages accessed. However a more detailed model
 | 
						|
would consider three parameters: avg. seek, avg. latency
 | 
						|
and avg. page transfer. For old disk, typical values are
 | 
						|
SEEK=9.5 milliseconds, LATENCY=8.3 ms, TRANSFER=2.6ms.
 | 
						|
A sequential continuous reading of a table (assuming
 | 
						|
1000 continuous pages) would cost
 | 
						|
(SEEK+LATENCY+1000*TRANFER=2617.8ms); while quasi-randomly
 | 
						|
reading 200 times with 2 continuous pages/time would
 | 
						|
cost (SEEK+200*LATENCY+400*TRANSFER=2700ms).
 | 
						|
Someone from IBM lab re-studied the traditional
 | 
						|
ad hoc join algorithms (nested, sort-merge, hash) using the detailed cost model
 | 
						|
and found some interesting results.
 | 
						|
 | 
						|
>I have been spending some time measuring actual runtimes for various
 | 
						|
>sequential-scan and index-scan query plans, and have learned that the
 | 
						|
>current Postgres optimizer's cost estimation equations are not very
 | 
						|
>close to reality at all.
 | 
						|
 | 
						|
One interesting question I'd like to ask is if this non-closeness
 | 
						|
really affects the optimal choice of postgresql's query optimizer.
 | 
						|
And to what degree the effects might be? My point is that
 | 
						|
if the optimizer estimated the cost for sequential-scan is 10 and
 | 
						|
the cost for index-scan is 20 while the actual costs are 10 vs. 40,
 | 
						|
it should be ok because the optimizer would still choose sequential-scan
 | 
						|
as it should.
 | 
						|
 | 
						|
>1. Since main-table tuples are visited in index order, we'll be hopping
 | 
						|
>around from page to page in the table.
 | 
						|
 | 
						|
I'm not sure about the implementation in postgresql. One thing you might
 | 
						|
be able to do is to first collect all must-read page addresses from 
 | 
						|
the index scan and then order them before the actual ordered page fetching.
 | 
						|
It would at least avoid the same page being read twice (not entirely
 | 
						|
true depending on the context (like in join) and algo.)
 | 
						|
 | 
						|
>The current cost estimation
 | 
						|
>method essentially assumes that the buffer cache plus OS disk cache will
 | 
						|
>be 100% efficient --- we will never have to read the same page of the
 | 
						|
>main table twice in a scan, due to having discarded it between
 | 
						|
>references.  This of course is unreasonably optimistic.  Worst case
 | 
						|
>is that we'd fetch a main-table page for each selected tuple, but in
 | 
						|
>most cases that'd be unreasonably pessimistic.
 | 
						|
 | 
						|
This is actually the motivation that I asked before if postgresql
 | 
						|
has a raw disk facility. That way we have much control on this cache
 | 
						|
issue. Of course only if we can provide some algo. better than OS
 | 
						|
cache algo. (depending on the context, like large joins), a raw disk
 | 
						|
facility will be worthwhile (besides the recoverability).
 | 
						|
 | 
						|
Actually I have another question for you guys which is somehow related
 | 
						|
to this cost estimation issue. You know the difference between OLTP
 | 
						|
and OLAP. My question is how you target postgresql on both kinds
 | 
						|
of applications or just OLTP. From what I know OLTP and OLAP would
 | 
						|
have a big difference in query characteristics and thus 
 | 
						|
optimization difference. If postgresql is only targeted on
 | 
						|
OLTP, the above cost estimation issue might not be that
 | 
						|
important. However for OLAP, large tables and large queries are
 | 
						|
common and optimization would be difficult.
 | 
						|
 | 
						|
xun
 | 
						|
 | 
						|
 | 
						|
************
 | 
						|
 | 
						|
From owner-pgsql-hackers@hub.org Thu Jan 20 20:41:44 2000
 | 
						|
Received: from hub.org (hub.org [216.126.84.1])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA07020
 | 
						|
	for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 21:41:43 -0500 (EST)
 | 
						|
Received: from localhost (majordom@localhost)
 | 
						|
	by hub.org (8.9.3/8.9.3) with SMTP id VAA40222;
 | 
						|
	Thu, 20 Jan 2000 21:34:08 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers)
 | 
						|
Received: by hub.org (bulk_mailer v1.5); Thu, 20 Jan 2000 21:32:35 -0500
 | 
						|
Received: (from majordom@localhost)
 | 
						|
	by hub.org (8.9.3/8.9.3) id VAA38388
 | 
						|
	for pgsql-hackers-outgoing; Thu, 20 Jan 2000 21:31:38 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 | 
						|
Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
 | 
						|
	by hub.org (8.9.3/8.9.3) with ESMTP id VAA37422
 | 
						|
	for <pgsql-hackers@postgreSQL.org>; Thu, 20 Jan 2000 21:31:02 -0500 (EST)
 | 
						|
	(envelope-from tgl@sss.pgh.pa.us)
 | 
						|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						|
	by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id VAA26761;
 | 
						|
	Thu, 20 Jan 2000 21:30:41 -0500 (EST)
 | 
						|
To: "Hiroshi Inoue" <Inoue@tpf.co.jp>
 | 
						|
cc: pgsql-hackers@postgreSQL.org
 | 
						|
Subject: Re: [HACKERS] Some notes on optimizer cost estimates 
 | 
						|
In-reply-to: <000b01bf63b1$093cbd40$2801007e@tpf.co.jp> 
 | 
						|
References: <000b01bf63b1$093cbd40$2801007e@tpf.co.jp>
 | 
						|
Comments: In-reply-to "Hiroshi Inoue" <Inoue@tpf.co.jp>
 | 
						|
	message dated "Fri, 21 Jan 2000 10:44:20 +0900"
 | 
						|
Date: Thu, 20 Jan 2000 21:30:41 -0500
 | 
						|
Message-ID: <26758.948421841@sss.pgh.pa.us>
 | 
						|
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
Sender: owner-pgsql-hackers@postgreSQL.org
 | 
						|
Status: ORr
 | 
						|
 | 
						|
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
 | 
						|
> I've wondered why we cound't analyze database without vacuum.
 | 
						|
> We couldn't run vacuum light-heartedly because it acquires an
 | 
						|
> exclusive lock for the target table. 
 | 
						|
 | 
						|
There is probably no real good reason, except backwards compatibility,
 | 
						|
why the ANALYZE function (obtaining pg_statistic data) is part of
 | 
						|
VACUUM at all --- it could just as easily be a separate command that
 | 
						|
would only use read access on the database.  Bruce is thinking about
 | 
						|
restructuring VACUUM, so maybe now is a good time to think about
 | 
						|
splitting out the ANALYZE code too.
 | 
						|
 | 
						|
> In addition,vacuum error occurs with analyze option in most
 | 
						|
> cases AFAIK. 
 | 
						|
 | 
						|
Still, with current sources?  What's the error message?  I fixed
 | 
						|
a problem with pg_statistic tuples getting too big...
 | 
						|
 | 
						|
			regards, tom lane
 | 
						|
 | 
						|
************
 | 
						|
 | 
						|
From tgl@sss.pgh.pa.us Thu Jan 20 21:10:28 2000
 | 
						|
Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id WAA08412
 | 
						|
	for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 22:10:26 -0500 (EST)
 | 
						|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						|
	by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id WAA27080;
 | 
						|
	Thu, 20 Jan 2000 22:10:28 -0500 (EST)
 | 
						|
To: Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						|
cc: Hiroshi Inoue <Inoue@tpf.co.jp>, pgsql-hackers@postgresql.org
 | 
						|
Subject: Re: [HACKERS] Some notes on optimizer cost estimates 
 | 
						|
In-reply-to: <200001210248.VAA07186@candle.pha.pa.us> 
 | 
						|
References: <200001210248.VAA07186@candle.pha.pa.us>
 | 
						|
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						|
	message dated "Thu, 20 Jan 2000 21:48:57 -0500"
 | 
						|
Date: Thu, 20 Jan 2000 22:10:28 -0500
 | 
						|
Message-ID: <27077.948424228@sss.pgh.pa.us>
 | 
						|
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
Status: OR
 | 
						|
 | 
						|
Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | 
						|
> It is nice that ANALYZE is done during vacuum.  I can't imagine why you
 | 
						|
> would want to do an analyze without adding a vacuum to it.  I guess
 | 
						|
> that's why I made them the same command.
 | 
						|
 | 
						|
Well, the main bad thing about ANALYZE being part of VACUUM is that
 | 
						|
it adds to the length of time that VACUUM is holding an exclusive
 | 
						|
lock on the table.  I think it'd make more sense for it to be a
 | 
						|
separate command.
 | 
						|
 | 
						|
I have also been thinking about how to make ANALYZE produce a more
 | 
						|
reliable estimate of the most common value.  The three-element list
 | 
						|
that it keeps now is a good low-cost hack, but it really doesn't
 | 
						|
produce a trustworthy answer unless the MCV is pretty darn C (since
 | 
						|
it will never pick up on the MCV at all until there are at least
 | 
						|
two occurrences in three adjacent tuples).  The only idea I've come
 | 
						|
up with is to use a larger list, which would be slower and take
 | 
						|
more memory.  I think that'd be OK in a separate command, but I
 | 
						|
hesitate to do it inside VACUUM --- VACUUM has its own considerable
 | 
						|
memory requirements, and there's still the issue of not holding down
 | 
						|
an exclusive lock longer than you have to.
 | 
						|
 | 
						|
			regards, tom lane
 | 
						|
 | 
						|
From Inoue@tpf.co.jp Thu Jan 20 21:08:32 2000
 | 
						|
Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id WAA08225
 | 
						|
	for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 22:08:29 -0500 (EST)
 | 
						|
Received: from cadzone ([126.0.1.40] (may be forged))
 | 
						|
          by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
 | 
						|
   id MAA04148; Fri, 21 Jan 2000 12:08:30 +0900
 | 
						|
From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
 | 
						|
To: "Bruce Momjian" <pgman@candle.pha.pa.us>, "Tom Lane" <tgl@sss.pgh.pa.us>
 | 
						|
Cc: <pgsql-hackers@postgreSQL.org>
 | 
						|
Subject: RE: [HACKERS] Some notes on optimizer cost estimates
 | 
						|
Date: Fri, 21 Jan 2000 12:14:10 +0900
 | 
						|
Message-ID: <001301bf63bd$95cbe680$2801007e@tpf.co.jp>
 | 
						|
MIME-Version: 1.0
 | 
						|
Content-Type: text/plain;
 | 
						|
	charset="iso-8859-1"
 | 
						|
Content-Transfer-Encoding: 7bit
 | 
						|
X-Priority: 3 (Normal)
 | 
						|
X-MSMail-Priority: Normal
 | 
						|
X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
 | 
						|
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
 | 
						|
In-Reply-To: <200001210248.VAA07186@candle.pha.pa.us>
 | 
						|
Importance: Normal
 | 
						|
Status: OR
 | 
						|
 | 
						|
> -----Original Message-----
 | 
						|
> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
 | 
						|
> 
 | 
						|
> > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
 | 
						|
> > > I've wondered why we cound't analyze database without vacuum.
 | 
						|
> > > We couldn't run vacuum light-heartedly because it acquires an
 | 
						|
> > > exclusive lock for the target table. 
 | 
						|
> > 
 | 
						|
> > There is probably no real good reason, except backwards compatibility,
 | 
						|
> > why the ANALYZE function (obtaining pg_statistic data) is part of
 | 
						|
> > VACUUM at all --- it could just as easily be a separate command that
 | 
						|
> > would only use read access on the database.  Bruce is thinking about
 | 
						|
> > restructuring VACUUM, so maybe now is a good time to think about
 | 
						|
> > splitting out the ANALYZE code too.
 | 
						|
> 
 | 
						|
> I put it in vacuum because at the time I didn't know how to do such
 | 
						|
> things and vacuum already scanned the table.  I just linked on the the
 | 
						|
> scan.  Seemed like a good idea at the time.
 | 
						|
> 
 | 
						|
> It is nice that ANALYZE is done during vacuum.  I can't imagine why you
 | 
						|
> would want to do an analyze without adding a vacuum to it.  I guess
 | 
						|
> that's why I made them the same command.
 | 
						|
> 
 | 
						|
> If I made them separate commands, both would have to scan the table,
 | 
						|
> though the analyze could do it without the exclusive lock, which would
 | 
						|
> be good.
 | 
						|
>
 | 
						|
 | 
						|
The functionality of VACUUM and ANALYZE is quite different.
 | 
						|
I don't prefer to charge VACUUM more than now about analyzing
 | 
						|
database.  Probably looong lock,more aborts .... 
 | 
						|
Various kind of analysis would be possible by splitting out ANALYZE.
 | 
						|
 
 | 
						|
Regards.
 | 
						|
 | 
						|
Hiroshi Inoue
 | 
						|
Inoue@tpf.co.jp
 | 
						|
 | 
						|
From owner-pgsql-hackers@hub.org Fri Jan 21 11:01:59 2000
 | 
						|
Received: from hub.org (hub.org [216.126.84.1])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA07821
 | 
						|
	for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 12:01:57 -0500 (EST)
 | 
						|
Received: from localhost (majordom@localhost)
 | 
						|
	by hub.org (8.9.3/8.9.3) with SMTP id LAA77357;
 | 
						|
	Fri, 21 Jan 2000 11:52:25 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers)
 | 
						|
Received: by hub.org (bulk_mailer v1.5); Fri, 21 Jan 2000 11:50:46 -0500
 | 
						|
Received: (from majordom@localhost)
 | 
						|
	by hub.org (8.9.3/8.9.3) id LAA76756
 | 
						|
	for pgsql-hackers-outgoing; Fri, 21 Jan 2000 11:49:50 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 | 
						|
Received: from eclipse.pacifier.com (eclipse.pacifier.com [199.2.117.78])
 | 
						|
	by hub.org (8.9.3/8.9.3) with ESMTP id LAA76594
 | 
						|
	for <pgsql-hackers@postgreSQL.org>; Fri, 21 Jan 2000 11:49:01 -0500 (EST)
 | 
						|
	(envelope-from dhogaza@pacifier.com)
 | 
						|
Received: from desktop (dsl-dhogaza.pacifier.net [216.65.147.68])
 | 
						|
	by eclipse.pacifier.com (8.9.3/8.9.3pop) with SMTP id IAA00225;
 | 
						|
	Fri, 21 Jan 2000 08:47:26 -0800 (PST)
 | 
						|
Message-Id: <3.0.1.32.20000121081044.01036290@mail.pacifier.com>
 | 
						|
X-Sender: dhogaza@mail.pacifier.com
 | 
						|
X-Mailer: Windows Eudora Pro Version 3.0.1 (32)
 | 
						|
Date: Fri, 21 Jan 2000 08:10:44 -0800
 | 
						|
To: xun@cs.ucsb.edu, pgsql-hackers@postgreSQL.org
 | 
						|
From: Don Baccus <dhogaza@pacifier.com>
 | 
						|
Subject: Re: Re. [HACKERS] Some notes on optimizer cost estimates
 | 
						|
In-Reply-To: <200001210219.SAA22377@xp10-06.dialup.commserv.ucsb.edu>
 | 
						|
Mime-Version: 1.0
 | 
						|
Content-Type: text/plain; charset="us-ascii"
 | 
						|
Sender: owner-pgsql-hackers@postgreSQL.org
 | 
						|
Status: OR
 | 
						|
 | 
						|
At 06:19 PM 1/20/00 -0800, Xun Cheng wrote:
 | 
						|
>I'm very glad you bring up this cost estimate issue.
 | 
						|
>Recent work in database research have argued a more
 | 
						|
>detailed disk access cost model should be used for
 | 
						|
>large queries especially joins.
 | 
						|
>Traditional cost estimate only considers the number of
 | 
						|
>disk pages accessed. However a more detailed model
 | 
						|
>would consider three parameters: avg. seek, avg. latency
 | 
						|
>and avg. page transfer. For old disk, typical values are
 | 
						|
>SEEK=9.5 milliseconds, LATENCY=8.3 ms, TRANSFER=2.6ms.
 | 
						|
>A sequential continuous reading of a table (assuming
 | 
						|
>1000 continuous pages) would cost
 | 
						|
>(SEEK+LATENCY+1000*TRANFER=2617.8ms); while quasi-randomly
 | 
						|
>reading 200 times with 2 continuous pages/time would
 | 
						|
>cost (SEEK+200*LATENCY+400*TRANSFER=2700ms).
 | 
						|
>Someone from IBM lab re-studied the traditional
 | 
						|
>ad hoc join algorithms (nested, sort-merge, hash) using the detailed cost
 | 
						|
model
 | 
						|
>and found some interesting results.
 | 
						|
 | 
						|
One complication when doing an index scan is that you are
 | 
						|
accessing two separate files (table and index), which can frequently
 | 
						|
be expected to cause an considerable increase in average seek time.
 | 
						|
 | 
						|
Oracle and other commercial databases recommend spreading indices and
 | 
						|
tables over several spindles if at all possible in order to minimize
 | 
						|
this effect.
 | 
						|
 | 
						|
I suspect it also helps their optimizer make decisions that are
 | 
						|
more consistently good for customers with the largest and most
 | 
						|
complex databases and queries, by making cost estimates more predictably
 | 
						|
reasonable.
 | 
						|
 | 
						|
Still...this doesn't help with the question about the effect of the
 | 
						|
filesystem system cache.  I wandered around the web for a little bit
 | 
						|
last night, and found one summary of a paper by Osterhout on the
 | 
						|
effect of the Solaris cache on a fileserver serving diskless workstations.
 | 
						|
There was reference to the hierarchy involved (i.e. the local workstation
 | 
						|
cache is faster than the fileserver's cache which has to be read via
 | 
						|
the network which in turn is faster than reading from the fileserver's
 | 
						|
disk).  It appears the rule-of-thumb for the cache-hit ratio on reads,
 | 
						|
presumably based on measuring some internal Sun systems, used in their
 | 
						|
calculations was 80%.
 | 
						|
 | 
						|
Just a datapoint to think about.
 | 
						|
 | 
						|
There's also considerable operating system theory on paging systems
 | 
						|
that might be useful for thinking about trying to estimate the
 | 
						|
Postgres cache/hit ratio.  Then again, maybe Postgres could just
 | 
						|
keep count of how many pages of a given table are in the cache at
 | 
						|
any given time?  Or simply keep track of the current ratio of hits
 | 
						|
and misses?
 | 
						|
 | 
						|
>>I have been spending some time measuring actual runtimes for various
 | 
						|
>>sequential-scan and index-scan query plans, and have learned that the
 | 
						|
>>current Postgres optimizer's cost estimation equations are not very
 | 
						|
>>close to reality at all.
 | 
						|
 | 
						|
>One interesting question I'd like to ask is if this non-closeness
 | 
						|
>really affects the optimal choice of postgresql's query optimizer.
 | 
						|
>And to what degree the effects might be? My point is that
 | 
						|
>if the optimizer estimated the cost for sequential-scan is 10 and
 | 
						|
>the cost for index-scan is 20 while the actual costs are 10 vs. 40,
 | 
						|
>it should be ok because the optimizer would still choose sequential-scan
 | 
						|
>as it should.
 | 
						|
 | 
						|
This is crucial, of course - if there are only two types of scans 
 | 
						|
available, what ever heuristic is used only has to be accurate enough
 | 
						|
to pick the right one.  Once the choice is made, it doesn't really
 | 
						|
matter (from the optimizer's POV) just how long it will actually take,
 | 
						|
the time will be spent and presumably it will be shorter than the
 | 
						|
alternative.
 | 
						|
 | 
						|
How frequently will the optimizer choose wrongly if:
 | 
						|
 | 
						|
1. All of the tables and indices were in PG buffer cache or filesystem
 | 
						|
   cache? (i.e. fixed access times for both types of scans)
 | 
						|
 | 
						|
or
 | 
						|
 | 
						|
2. The table's so big that only a small fraction can reside in RAM
 | 
						|
   during the scan and join, which means that the non-sequential
 | 
						|
   disk access pattern of the indexed scan is much more expensive.
 | 
						|
 | 
						|
Also, if you pick sequential scans more frequently based on a presumption
 | 
						|
that index scans are expensive due to increased average seek time, how
 | 
						|
often will this penalize the heavy-duty user that invests in extra
 | 
						|
drives and lots of RAM?
 | 
						|
 | 
						|
...
 | 
						|
 | 
						|
>>The current cost estimation
 | 
						|
>>method essentially assumes that the buffer cache plus OS disk cache will
 | 
						|
>>be 100% efficient --- we will never have to read the same page of the
 | 
						|
>>main table twice in a scan, due to having discarded it between
 | 
						|
>>references.  This of course is unreasonably optimistic.  Worst case
 | 
						|
>>is that we'd fetch a main-table page for each selected tuple, but in
 | 
						|
>>most cases that'd be unreasonably pessimistic.
 | 
						|
>
 | 
						|
>This is actually the motivation that I asked before if postgresql
 | 
						|
>has a raw disk facility. That way we have much control on this cache
 | 
						|
>issue. Of course only if we can provide some algo. better than OS
 | 
						|
>cache algo. (depending on the context, like large joins), a raw disk
 | 
						|
>facility will be worthwhile (besides the recoverability).
 | 
						|
 | 
						|
Postgres does have control over its buffer cache.  The one thing that
 | 
						|
raw disk I/O would give you is control over where blocks are placed,
 | 
						|
meaning you could more accurately model the cost of retrieving them.
 | 
						|
So presumably the cache could be tuned to the allocation algorithm
 | 
						|
used to place various structures on the disk.
 | 
						|
 | 
						|
I still wonder just how much gain you get by this approach.  Compared,
 | 
						|
to, say simply spending $2,000 on a gigabyte of RAM.  Heck, PCs even
 | 
						|
support a couple gigs of RAM now.
 | 
						|
 | 
						|
>Actually I have another question for you guys which is somehow related
 | 
						|
>to this cost estimation issue. You know the difference between OLTP
 | 
						|
>and OLAP. My question is how you target postgresql on both kinds
 | 
						|
>of applications or just OLTP. From what I know OLTP and OLAP would
 | 
						|
>have a big difference in query characteristics and thus 
 | 
						|
>optimization difference. If postgresql is only targeted on
 | 
						|
>OLTP, the above cost estimation issue might not be that
 | 
						|
>important. However for OLAP, large tables and large queries are
 | 
						|
>common and optimization would be difficult.
 | 
						|
 | 
						|
 | 
						|
 | 
						|
- Don Baccus, Portland OR <dhogaza@pacifier.com>
 | 
						|
  Nature photos, on-line guides, Pacific Northwest
 | 
						|
  Rare Bird Alert Service and other goodies at
 | 
						|
  http://donb.photo.net.
 | 
						|
 | 
						|
************
 | 
						|
 | 
						|
From pgsql-hackers-owner+M6019@hub.org Mon Aug 21 11:47:56 2000
 | 
						|
Received: from hub.org (root@hub.org [216.126.84.1])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id LAA07289
 | 
						|
	for <pgman@candle.pha.pa.us>; Mon, 21 Aug 2000 11:47:55 -0400 (EDT)
 | 
						|
Received: from hub.org (majordom@localhost [127.0.0.1])
 | 
						|
	by hub.org (8.10.1/8.10.1) with SMTP id e7LFlpT03383;
 | 
						|
	Mon, 21 Aug 2000 11:47:51 -0400 (EDT)
 | 
						|
Received: from mail.fct.unl.pt (fct1.si.fct.unl.pt [193.136.120.1])
 | 
						|
	by hub.org (8.10.1/8.10.1) with SMTP id e7LFlaT03243
 | 
						|
	for <pgsql-hackers@postgresql.org>; Mon, 21 Aug 2000 11:47:37 -0400 (EDT)
 | 
						|
Received: (qmail 7416 invoked by alias); 21 Aug 2000 15:54:33 -0000
 | 
						|
Received: (qmail 7410 invoked from network); 21 Aug 2000 15:54:32 -0000
 | 
						|
Received: from eros.si.fct.unl.pt (193.136.120.112)
 | 
						|
  by fct1.si.fct.unl.pt with SMTP; 21 Aug 2000 15:54:32 -0000
 | 
						|
Date: Mon, 21 Aug 2000 16:48:08 +0100 (WEST)
 | 
						|
From: =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>
 | 
						|
X-Sender: tiago@eros.si.fct.unl.pt
 | 
						|
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
cc: pgsql-hackers@postgresql.org
 | 
						|
Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan,
 | 
						|
	constant-->index scan 
 | 
						|
In-Reply-To: <1731.966868649@sss.pgh.pa.us>
 | 
						|
Message-ID: <Pine.LNX.4.21.0008211626250.25226-100000@eros.si.fct.unl.pt>
 | 
						|
MIME-Version: 1.0
 | 
						|
Content-Type: TEXT/PLAIN; charset=US-ASCII
 | 
						|
X-Mailing-List: pgsql-hackers@postgresql.org
 | 
						|
Precedence: bulk
 | 
						|
Sender: pgsql-hackers-owner@hub.org
 | 
						|
Status: ORr
 | 
						|
 | 
						|
On Mon, 21 Aug 2000, Tom Lane wrote:
 | 
						|
 | 
						|
> >   One thing it might be interesting (please tell me if you think
 | 
						|
> > otherwise) would be to improve pg with better statistical information, by
 | 
						|
> > using, for example, histograms.
 | 
						|
> 
 | 
						|
> Yes, that's been on the todo list for a while.
 | 
						|
 | 
						|
  If it's ok and nobody is working on that, I'll look on that subject.
 | 
						|
  I'll start by looking at the analize portion of vacuum. I'm thinking in
 | 
						|
using arrays for the histogram (I've never used the array data type of
 | 
						|
postgres).
 | 
						|
  Should I use 7.0.2 or the cvs version?
 | 
						|
  
 | 
						|
 | 
						|
> Interesting article.  We do most of what she talks about, but we don't
 | 
						|
> have anything like the ClusterRatio statistic.  We need it --- that was
 | 
						|
> just being discussed a few days ago in another thread.  Do you have any
 | 
						|
> reference on exactly how DB2 defines that stat?
 | 
						|
 | 
						|
 | 
						|
  I don't remember seeing that information spefically. From what I've
 | 
						|
read I can speculate:
 | 
						|
 | 
						|
  1. They have clusterratios for both indexes and the relation itself.
 | 
						|
  2. They might use an index even if there is no "order by" if the table
 | 
						|
has a low clusterratio: just to get the RIDs, then sort the RIDs and
 | 
						|
fetch.
 | 
						|
  3. One possible way to calculate this ratio:
 | 
						|
     a) for tables
 | 
						|
         SeqScan
 | 
						|
            if tuple points to a next tuple on the same page then its
 | 
						|
"good"
 | 
						|
        ratio = # good tuples / # all tuples
 | 
						|
     b) for indexes (high speculation ratio here)
 | 
						|
          foreach pointed RID in index
 | 
						|
             if RID is in same page of next RID in index than mark as
 | 
						|
"good"
 | 
						|
 | 
						|
  I suspect that if a tuple size is big (relative to page size) than the
 | 
						|
cluster ratio is always low.
 | 
						|
 | 
						|
  A tuple might also be "good" if it pointed to the next page.
 | 
						|
 | 
						|
Tiago
 | 
						|
 | 
						|
 | 
						|
From pgsql-hackers-owner+M6152@hub.org Wed Aug 23 13:00:33 2000
 | 
						|
Received: from hub.org (root@hub.org [216.126.84.1])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA10259
 | 
						|
	for <pgman@candle.pha.pa.us>; Wed, 23 Aug 2000 13:00:33 -0400 (EDT)
 | 
						|
Received: from hub.org (majordom@localhost [127.0.0.1])
 | 
						|
	by hub.org (8.10.1/8.10.1) with SMTP id e7NGsPN83008;
 | 
						|
	Wed, 23 Aug 2000 12:54:25 -0400 (EDT)
 | 
						|
Received: from mail.fct.unl.pt (fct1.si.fct.unl.pt [193.136.120.1])
 | 
						|
	by hub.org (8.10.1/8.10.1) with SMTP id e7NGniN81749
 | 
						|
	for <pgsql-hackers@postgresql.org>; Wed, 23 Aug 2000 12:49:44 -0400 (EDT)
 | 
						|
Received: (qmail 9869 invoked by alias); 23 Aug 2000 15:10:04 -0000
 | 
						|
Received: (qmail 9860 invoked from network); 23 Aug 2000 15:10:04 -0000
 | 
						|
Received: from eros.si.fct.unl.pt (193.136.120.112)
 | 
						|
  by fct1.si.fct.unl.pt with SMTP; 23 Aug 2000 15:10:04 -0000
 | 
						|
Date: Wed, 23 Aug 2000 16:03:42 +0100 (WEST)
 | 
						|
From: =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>
 | 
						|
X-Sender: tiago@eros.si.fct.unl.pt
 | 
						|
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
cc: Jules Bean <jules@jellybean.co.uk>, pgsql-hackers@postgresql.org
 | 
						|
Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan,
 | 
						|
	constant-->index scan 
 | 
						|
In-Reply-To: <27971.967041030@sss.pgh.pa.us>
 | 
						|
Message-ID: <Pine.LNX.4.21.0008231543340.4273-100000@eros.si.fct.unl.pt>
 | 
						|
MIME-Version: 1.0
 | 
						|
Content-Type: TEXT/PLAIN; charset=US-ASCII
 | 
						|
X-Mailing-List: pgsql-hackers@postgresql.org
 | 
						|
Precedence: bulk
 | 
						|
Sender: pgsql-hackers-owner@hub.org
 | 
						|
Status: ORr
 | 
						|
 | 
						|
Hi!
 | 
						|
 | 
						|
On Wed, 23 Aug 2000, Tom Lane wrote:
 | 
						|
 | 
						|
> Yes, we know about that one.  We have stats about the most common value
 | 
						|
> in a column, but no information about how the less-common values are
 | 
						|
> distributed.  We definitely need stats about several top values not just
 | 
						|
> one, because this phenomenon of a badly skewed distribution is pretty
 | 
						|
> common.
 | 
						|
 | 
						|
 | 
						|
  An end-biased histogram has stats on top values and also on the least
 | 
						|
frequent values. So if a there is a selection on a value that is well
 | 
						|
bellow average, the selectivity estimation will be more acurate. On some
 | 
						|
research papers I've read, it's refered that this is a better approach
 | 
						|
than equi-width histograms (which are said to be the "industry" standard).
 | 
						|
 | 
						|
  I not sure whether to use a table or a array attribute on pg_stat for
 | 
						|
the histogram, the problem is what could be expected from the size of the
 | 
						|
attribute (being a text). I'm very affraid of the cost of going through
 | 
						|
several tuples on a table (pg_histogram?) during the optimization phase.
 | 
						|
 | 
						|
  One other idea would be to only have better statistics for special
 | 
						|
attributes requested by the user... something like "analyze special
 | 
						|
table(column)".
 | 
						|
 | 
						|
Best Regards,
 | 
						|
Tiago
 | 
						|
 | 
						|
 | 
						|
 | 
						|
From pgsql-hackers-owner+M6160@hub.org Thu Aug 24 00:21:39 2000
 | 
						|
Received: from hub.org (root@hub.org [216.126.84.1])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA27662
 | 
						|
	for <pgman@candle.pha.pa.us>; Thu, 24 Aug 2000 00:21:38 -0400 (EDT)
 | 
						|
Received: from hub.org (majordom@localhost [127.0.0.1])
 | 
						|
	by hub.org (8.10.1/8.10.1) with SMTP id e7O46w585951;
 | 
						|
	Thu, 24 Aug 2000 00:06:58 -0400 (EDT)
 | 
						|
Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
 | 
						|
	by hub.org (8.10.1/8.10.1) with ESMTP id e7O3uv583775
 | 
						|
	for <pgsql-hackers@postgresql.org>; Wed, 23 Aug 2000 23:56:57 -0400 (EDT)
 | 
						|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						|
	by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id XAA20973;
 | 
						|
	Wed, 23 Aug 2000 23:56:35 -0400 (EDT)
 | 
						|
To: =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>
 | 
						|
cc: Jules Bean <jules@jellybean.co.uk>, pgsql-hackers@postgresql.org
 | 
						|
Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan, constant-->index scan 
 | 
						|
In-reply-to: <Pine.LNX.4.21.0008231543340.4273-100000@eros.si.fct.unl.pt> 
 | 
						|
References: <Pine.LNX.4.21.0008231543340.4273-100000@eros.si.fct.unl.pt>
 | 
						|
Comments: In-reply-to =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>
 | 
						|
	message dated "Wed, 23 Aug 2000 16:03:42 +0100"
 | 
						|
Date: Wed, 23 Aug 2000 23:56:35 -0400
 | 
						|
Message-ID: <20970.967089395@sss.pgh.pa.us>
 | 
						|
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
X-Mailing-List: pgsql-hackers@postgresql.org
 | 
						|
Precedence: bulk
 | 
						|
Sender: pgsql-hackers-owner@hub.org
 | 
						|
Status: OR
 | 
						|
 | 
						|
=?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt> writes:
 | 
						|
>   One other idea would be to only have better statistics for special
 | 
						|
> attributes requested by the user... something like "analyze special
 | 
						|
> table(column)".
 | 
						|
 | 
						|
This might actually fall out "for free" from the cheapest way of
 | 
						|
implementing the stats.  We've talked before about scanning btree
 | 
						|
indexes directly to obtain data values in sorted order, which makes
 | 
						|
it very easy to find the most common values.  If you do that, you
 | 
						|
get good stats for exactly those columns that the user has created
 | 
						|
indexes on.  A tad indirect but I bet it'd be effective...
 | 
						|
 | 
						|
			regards, tom lane
 | 
						|
 | 
						|
From pgsql-hackers-owner+M6165@hub.org Thu Aug 24 05:33:02 2000
 | 
						|
Received: from hub.org (root@hub.org [216.126.84.1])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id FAA14309
 | 
						|
	for <pgman@candle.pha.pa.us>; Thu, 24 Aug 2000 05:33:01 -0400 (EDT)
 | 
						|
Received: from hub.org (majordom@localhost [127.0.0.1])
 | 
						|
	by hub.org (8.10.1/8.10.1) with SMTP id e7O9X0584670;
 | 
						|
	Thu, 24 Aug 2000 05:33:00 -0400 (EDT)
 | 
						|
Received: from athena.office.vi.net (office-gwb.fulham.vi.net [194.88.77.158])
 | 
						|
	by hub.org (8.10.1/8.10.1) with ESMTP id e7O9Ix581216
 | 
						|
	for <pgsql-hackers@postgresql.org>; Thu, 24 Aug 2000 05:19:03 -0400 (EDT)
 | 
						|
Received: from grommit.office.vi.net [192.168.1.200] (mail)
 | 
						|
	by athena.office.vi.net with esmtp (Exim 3.12 #1 (Debian))
 | 
						|
	id 13Rt2Y-00073I-00; Thu, 24 Aug 2000 10:11:14 +0100
 | 
						|
Received: from jules by grommit.office.vi.net with local (Exim 3.12 #1 (Debian))
 | 
						|
	id 13Rt2Y-0005GV-00; Thu, 24 Aug 2000 10:11:14 +0100
 | 
						|
Date: Thu, 24 Aug 2000 10:11:14 +0100
 | 
						|
From: Jules Bean <jules@jellybean.co.uk>
 | 
						|
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
Cc: Tiago Ant?o <tra@fct.unl.pt>, pgsql-hackers@postgresql.org
 | 
						|
Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan, constant-->index scan
 | 
						|
Message-ID: <20000824101113.N17510@grommit.office.vi.net>
 | 
						|
References: <1731.966868649@sss.pgh.pa.us> <Pine.LNX.4.21.0008211626250.25226-100000@eros.si.fct.unl.pt> <20000823133418.F17510@grommit.office.vi.net> <27971.967041030@sss.pgh.pa.us>
 | 
						|
Mime-Version: 1.0
 | 
						|
Content-Type: text/plain; charset=us-ascii
 | 
						|
Content-Disposition: inline
 | 
						|
User-Agent: Mutt/1.2i
 | 
						|
In-Reply-To: <27971.967041030@sss.pgh.pa.us>; from tgl@sss.pgh.pa.us on Wed, Aug 23, 2000 at 10:30:30AM -0400
 | 
						|
X-Mailing-List: pgsql-hackers@postgresql.org
 | 
						|
Precedence: bulk
 | 
						|
Sender: pgsql-hackers-owner@hub.org
 | 
						|
Status: OR
 | 
						|
 | 
						|
On Wed, Aug 23, 2000 at 10:30:30AM -0400, Tom Lane wrote:
 | 
						|
> Jules Bean <jules@jellybean.co.uk> writes:
 | 
						|
> > I have in a table a 'category' column which takes a small number of
 | 
						|
> > (basically fixed) values.  Here by 'small', I mean ~1000, while the
 | 
						|
> > table itself has ~10 000 000 rows. Some categories have many, many
 | 
						|
> > more rows than others.  In particular, there's one category which hits
 | 
						|
> > over half the rows.  Because of this (AIUI) postgresql assumes
 | 
						|
> > that the query
 | 
						|
> >	select ... from thistable where category='something'
 | 
						|
> > is best served by a seqscan, even though there is an index on
 | 
						|
> > category.
 | 
						|
> 
 | 
						|
> Yes, we know about that one.  We have stats about the most common value
 | 
						|
> in a column, but no information about how the less-common values are
 | 
						|
> distributed.  We definitely need stats about several top values not just
 | 
						|
> one, because this phenomenon of a badly skewed distribution is pretty
 | 
						|
> common.
 | 
						|
 | 
						|
ISTM that that might be enough, in fact.
 | 
						|
 | 
						|
If you have stats telling you that the most popular value is 'xyz',
 | 
						|
and that it constitutes 50% of the rows (i.e. 5 000 000) then you can
 | 
						|
conclude that, on average, other entries constitute a mere 5 000
 | 
						|
000/999 ~~ 5000 entries, and it would be definitely be enough.
 | 
						|
(That's assuming you store the number of distinct values somewhere).
 | 
						|
 | 
						|
 | 
						|
> BTW, if your highly-popular value is actually a dummy value ('UNKNOWN'
 | 
						|
> or something like that), a fairly effective workaround is to replace the
 | 
						|
> dummy entries with NULL.  The system does account for NULLs separately
 | 
						|
> from real values, so you'd then get stats based on the most common
 | 
						|
> non-dummy value.
 | 
						|
 | 
						|
I can't really do that.  Even if I could, the distribution is very
 | 
						|
skewed -- so the next most common makes up a very high proportion of
 | 
						|
what's left.  I forget the figures exactly.
 | 
						|
 | 
						|
Jules
 | 
						|
 | 
						|
From pgsql-hackers-owner+M6154@hub.org Wed Aug 23 14:36:41 2000
 | 
						|
Received: from hub.org (root@hub.org [216.126.84.1])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA11076
 | 
						|
	for <pgman@candle.pha.pa.us>; Wed, 23 Aug 2000 13:36:41 -0400 (EDT)
 | 
						|
Received: from hub.org (majordom@localhost [127.0.0.1])
 | 
						|
	by hub.org (8.10.1/8.10.1) with SMTP id e7NHTqN92431;
 | 
						|
	Wed, 23 Aug 2000 13:29:52 -0400 (EDT)
 | 
						|
Received: from mail.fct.unl.pt (fct1.si.fct.unl.pt [193.136.120.1])
 | 
						|
	by hub.org (8.10.1/8.10.1) with SMTP id e7NHM1N90883
 | 
						|
	for <pgsql-hackers@hub.org>; Wed, 23 Aug 2000 13:22:01 -0400 (EDT)
 | 
						|
Received: (qmail 13816 invoked by alias); 23 Aug 2000 17:29:02 -0000
 | 
						|
Received: (qmail 13807 invoked from network); 23 Aug 2000 17:29:02 -0000
 | 
						|
Received: from eros.si.fct.unl.pt (193.136.120.112)
 | 
						|
  by fct1.si.fct.unl.pt with SMTP; 23 Aug 2000 17:29:02 -0000
 | 
						|
Date: Wed, 23 Aug 2000 18:22:40 +0100 (WEST)
 | 
						|
From: =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>
 | 
						|
X-Sender: tiago@eros.si.fct.unl.pt
 | 
						|
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
cc: =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>,
 | 
						|
        PostgreSQL Hackers list <pgsql-hackers@hub.org>
 | 
						|
Subject: Re: [HACKERS] analyze.c 
 | 
						|
In-Reply-To: <28154.967041988@sss.pgh.pa.us>
 | 
						|
Message-ID: <Pine.LNX.4.21.0008231742420.5111-100000@eros.si.fct.unl.pt>
 | 
						|
MIME-Version: 1.0
 | 
						|
Content-Type: TEXT/PLAIN; charset=US-ASCII
 | 
						|
X-Mailing-List: pgsql-hackers@postgresql.org
 | 
						|
Precedence: bulk
 | 
						|
Sender: pgsql-hackers-owner@hub.org
 | 
						|
Status: ORr
 | 
						|
 | 
						|
 | 
						|
 | 
						|
On Wed, 23 Aug 2000, Tom Lane wrote:
 | 
						|
 | 
						|
> > What's the big reason not to do that? I know that
 | 
						|
> > there is some code in analyze.c (like comparing) that uses other parts of
 | 
						|
> > pg, but that seems to be easily fixed.
 | 
						|
> 
 | 
						|
> Are you proposing not to do any comparisons?  It will be interesting to
 | 
						|
> see how you can compute a histogram without any idea of equality or
 | 
						|
> ordering.  But if you want that, then you still need the function-call
 | 
						|
> manager as well as the type-specific comparison routines for every
 | 
						|
> datatype that you might be asked to operate on (don't forget
 | 
						|
> user-defined types here).
 | 
						|
 | 
						|
   I forgot user defined data types :-(, but regarding histograms I think
 | 
						|
the code can be made external (at least for testing purposes):
 | 
						|
   1. I was not suggesting not to do any comparisons, but I think the only
 | 
						|
comparison I need is equality, I don't need order as I don't need to
 | 
						|
calculate mins or maxs (I just need mins and maxes on frequencies, NOT on 
 | 
						|
dat itself) to make a histogram.
 | 
						|
   2. The mapping to text guarantees that I have (PQgetvalue returns
 | 
						|
always char* and pg_statistics keeps a "text" anyway) a way of knowing
 | 
						|
about equality regardless of type.
 | 
						|
 | 
						|
   But at least anything relating to order has to be in.
 | 
						|
 | 
						|
> >   I'm leaning toward the implementation of end-biased histograms. There is
 | 
						|
> > an introductory reference in the IEEE Data Engineering Bulletin, september
 | 
						|
> > 1995 (available on microsoft research site).
 | 
						|
> 
 | 
						|
> Sounds interesting.  Can you give us an exact URL?
 | 
						|
 | 
						|
http://www.research.microsoft.com/research/db/debull/default.htm
 | 
						|
 | 
						|
BTW, you can get access to SIGMOD CDs with lots of goodies for a very low
 | 
						|
price (at least in 1999 it was a bargain), check out ACM membership for
 | 
						|
sigmod.
 | 
						|
 | 
						|
I've been reading something about implementation of histograms, and,
 | 
						|
AFAIK, in practice histograms is just a cool name for no more than:
 | 
						|
   1. top ten with frequency for each
 | 
						|
   2. the same for top ten worse
 | 
						|
   3. average for the rest
 | 
						|
 | 
						|
I'm writing code get this info (outside pg for now - for testing
 | 
						|
purposes).
 | 
						|
 | 
						|
Best Regards,
 | 
						|
Tiago
 | 
						|
PS - again: I'm starting, so, some of my comments can be completly dumb.
 | 
						|
 | 
						|
From pgsql-hackers-owner+M7514@hub.org Sun Oct 15 20:38:12 2000
 | 
						|
Received: from hub.org (hub.org [216.126.84.1])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id TAA18459
 | 
						|
	for <pgman@candle.pha.pa.us>; Sun, 15 Oct 2000 19:38:12 -0400 (EDT)
 | 
						|
Received: from hub.org.org (localhost [127.0.0.1])
 | 
						|
	by hub.org (8.10.1/8.10.1) with SMTP id e9FNaUR59496;
 | 
						|
	Sun, 15 Oct 2000 19:36:30 -0400 (EDT)
 | 
						|
Received: from sss.pgh.pa.us (sss.pgh.pa.us [209.114.132.154])
 | 
						|
	by hub.org (8.10.1/8.10.1) with ESMTP id e9FNYuR58276
 | 
						|
	for <pgsql-hackers@postgresql.org>; Sun, 15 Oct 2000 19:34:56 -0400 (EDT)
 | 
						|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						|
	by sss.pgh.pa.us (8.11.0/8.11.0) with ESMTP id e9FNXaB06046;
 | 
						|
	Sun, 15 Oct 2000 19:33:36 -0400 (EDT)
 | 
						|
To: Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						|
cc: Jules Bean <jules@jellybean.co.uk>,
 | 
						|
        Alfred Perlstein <bright@wintelcom.net>, pgsql-hackers@postgresql.org
 | 
						|
Subject: Re: [HACKERS] Performance on inserts 
 | 
						|
In-reply-to: <200010152320.TAA17944@candle.pha.pa.us> 
 | 
						|
References: <200010152320.TAA17944@candle.pha.pa.us>
 | 
						|
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						|
	message dated "Sun, 15 Oct 2000 19:20:35 -0400"
 | 
						|
Date: Sun, 15 Oct 2000 19:33:36 -0400
 | 
						|
Message-ID: <6043.971652816@sss.pgh.pa.us>
 | 
						|
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
X-Mailing-List: pgsql-hackers@postgresql.org
 | 
						|
Precedence: bulk
 | 
						|
Sender: pgsql-hackers-owner@hub.org
 | 
						|
Status: ORr
 | 
						|
 | 
						|
Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | 
						|
> However, assume tab2.col2 equals 3.  I assume this would cause an index
 | 
						|
> scan because the executor doesn't know about the most common value,
 | 
						|
> right? Is it worth trying to improve that?
 | 
						|
 | 
						|
Oh, I see: you are assuming that a nestloop join is being done, and
 | 
						|
wondering if it's worthwhile to switch dynamically between seqscan
 | 
						|
and indexscan for each scan of the inner relation, depending on exactly
 | 
						|
what value is being supplied from the outer relation for that scan.
 | 
						|
Hmm.
 | 
						|
 | 
						|
Not sure if it's worth the trouble or not.  Nestloop is usually a
 | 
						|
last-resort join strategy anyway, and is unlikely to be picked when the
 | 
						|
tables are large enough to make performance be a big issue.
 | 
						|
 | 
						|
			regards, tom lane
 | 
						|
 | 
						|
From tgl@sss.pgh.pa.us Mon Oct 16 01:48:27 2000
 | 
						|
Received: from sss.pgh.pa.us (sss.pgh.pa.us [209.114.132.154])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA01602
 | 
						|
	for <pgman@candle.pha.pa.us>; Mon, 16 Oct 2000 00:48:26 -0400 (EDT)
 | 
						|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						|
	by sss.pgh.pa.us (8.11.1/8.11.1) with ESMTP id e9G4mu521809;
 | 
						|
	Mon, 16 Oct 2000 00:48:56 -0400 (EDT)
 | 
						|
To: Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						|
cc: Jules Bean <jules@jellybean.co.uk>,
 | 
						|
        Alfred Perlstein <bright@wintelcom.net>, pgsql-hackers@postgresql.org
 | 
						|
Subject: Re: [HACKERS] Performance on inserts 
 | 
						|
In-reply-to: <200010160441.AAA01374@candle.pha.pa.us> 
 | 
						|
References: <200010160441.AAA01374@candle.pha.pa.us>
 | 
						|
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						|
	message dated "Mon, 16 Oct 2000 00:41:49 -0400"
 | 
						|
Date: Mon, 16 Oct 2000 00:48:56 -0400
 | 
						|
Message-ID: <21806.971671736@sss.pgh.pa.us>
 | 
						|
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
Status: OR
 | 
						|
 | 
						|
Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | 
						|
>> So an inner indexscan for tab1 is definitely a possible plan.
 | 
						|
 | 
						|
> Yes, that was my point, that a nested loop could easily be involved if
 | 
						|
> the joined table has a restriction.  Is there a TODO item here?
 | 
						|
 | 
						|
More like a "to investigate" --- I'm not sold on the idea that a
 | 
						|
dynamic switch in plan types would be a win.  Maybe it would be,
 | 
						|
but...
 | 
						|
 | 
						|
One thing to think about is that it'd be critically dependent on having
 | 
						|
accurate statistics.  Currently, the planner only places bets on the
 | 
						|
average behavior over a whole join.  If you make a separate bet on each
 | 
						|
scan, then you open up the risk of betting wrong every time, should
 | 
						|
your stats be out-of-date or otherwise misleading.
 | 
						|
 | 
						|
			regards, tom lane
 | 
						|
 | 
						|
From pgsql-hackers-owner+M29943@postgresql.org Thu Oct  3 18:18:27 2002
 | 
						|
Return-path: <pgsql-hackers-owner+M29943@postgresql.org>
 | 
						|
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
						|
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g93MIOU23771
 | 
						|
	for <pgman@candle.pha.pa.us>; Thu, 3 Oct 2002 18:18:25 -0400 (EDT)
 | 
						|
Received: from localhost (postgresql.org [64.49.215.8])
 | 
						|
	by postgresql.org (Postfix) with ESMTP
 | 
						|
	id B9F51476570; Thu,  3 Oct 2002 18:18:21 -0400 (EDT)
 | 
						|
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
						|
	by postgresql.org (Postfix) with SMTP
 | 
						|
	id E083B4761B0; Thu,  3 Oct 2002 18:18:19 -0400 (EDT)
 | 
						|
Received: from localhost (postgresql.org [64.49.215.8])
 | 
						|
	by postgresql.org (Postfix) with ESMTP id 13ADC476063
 | 
						|
	for <pgsql-hackers@postgresql.org>; Thu,  3 Oct 2002 18:18:17 -0400 (EDT)
 | 
						|
Received: from acorn.he.net (acorn.he.net [64.71.137.130])
 | 
						|
	by postgresql.org (Postfix) with ESMTP id 3AEC8475FFF
 | 
						|
	for <pgsql-hackers@postgresql.org>; Thu,  3 Oct 2002 18:18:16 -0400 (EDT)
 | 
						|
Received: from CurtisVaio ([63.164.0.47] (may be forged)) by acorn.he.net (8.8.6/8.8.2) with SMTP id PAA19215; Thu, 3 Oct 2002 15:18:14 -0700
 | 
						|
From: "Curtis Faith" <curtis@galtair.com>
 | 
						|
To: "Tom Lane" <tgl@sss.pgh.pa.us>
 | 
						|
cc: "Pgsql-Hackers" <pgsql-hackers@postgresql.org>
 | 
						|
Subject: Re: [HACKERS] Advice: Where could I be of help? 
 | 
						|
Date: Thu, 3 Oct 2002 18:17:55 -0400
 | 
						|
Message-ID: <DMEEJMCDOJAKPPFACMPMGEBNCEAA.curtis@galtair.com>
 | 
						|
MIME-Version: 1.0
 | 
						|
Content-Type: text/plain;
 | 
						|
	charset="iso-8859-1"
 | 
						|
Content-Transfer-Encoding: 7bit
 | 
						|
X-Priority: 3 (Normal)
 | 
						|
X-MSMail-Priority: Normal
 | 
						|
X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0)
 | 
						|
In-Reply-To: <13379.1033675158@sss.pgh.pa.us>
 | 
						|
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700
 | 
						|
Importance: Normal
 | 
						|
X-Virus-Scanned: by AMaViS new-20020517
 | 
						|
Precedence: bulk
 | 
						|
Sender: pgsql-hackers-owner@postgresql.org
 | 
						|
X-Virus-Scanned: by AMaViS new-20020517
 | 
						|
Status: OR
 | 
						|
 | 
						|
tom lane wrote:
 | 
						|
> But more globally, I think that our worst problems these days have to do
 | 
						|
> with planner misestimations leading to bad plans.  The planner is
 | 
						|
> usually *capable* of generating a good plan, but all too often it picks
 | 
						|
> the wrong one.  We need work on improving the cost modeling equations
 | 
						|
> to be closer to reality.  If that's at all close to your sphere of
 | 
						|
> interest then I think it should be #1 priority --- it's both localized,
 | 
						|
> which I think is important for a first project, and potentially a
 | 
						|
> considerable win.
 | 
						|
 | 
						|
This seems like a very interesting problem. One of the ways that I thought
 | 
						|
would be interesting and would solve the problem of trying to figure out the
 | 
						|
right numbers is to have certain guesses for the actual values based on
 | 
						|
statistics gathered during vacuum and general running and then have the
 | 
						|
planner run the "best" plan.
 | 
						|
 | 
						|
Then during execution if the planner turned out to be VERY wrong about
 | 
						|
certain assumptions the execution system could update the stats that led to
 | 
						|
those wrong assumptions. That way the system would seek the correct values
 | 
						|
automatically. We could also gather the stats that the system produces for
 | 
						|
certain actual databases and then use those to make smarter initial guesses.
 | 
						|
 | 
						|
I've found that I can never predict costs. I always end up testing
 | 
						|
empirically and find myself surprised at the results.
 | 
						|
 | 
						|
We should be able to make the executor smart enough to keep count of actual
 | 
						|
costs (or a statistical approximation) without introducing any significant
 | 
						|
overhead.
 | 
						|
 | 
						|
tom lane also wrote:
 | 
						|
> There is no "cache flushing".  We have a shared buffer cache management
 | 
						|
> algorithm that's straight LRU across all buffers.  There's been some
 | 
						|
> interest in smarter cache-replacement code; I believe Neil Conway is
 | 
						|
> messing around with an LRU-2 implementation right now.  If you've got
 | 
						|
> better ideas we're all ears.
 | 
						|
 | 
						|
Hmmm, this is the area that I think could lead to huge performance gains.
 | 
						|
 | 
						|
Consider a simple system with a table tbl_master that gets read by each
 | 
						|
process many times but with very infrequent inserts and that contains about
 | 
						|
3,000 rows. The single but heavily used index for this table is contained in
 | 
						|
a btree with a depth of three with 20 - 8K pages in the first two levels of
 | 
						|
the btree.
 | 
						|
 | 
						|
Another table tbl_detail with 10 indices that gets very frequent inserts.
 | 
						|
There are over 300,000 rows. Some queries result in index scans over the
 | 
						|
approximatley 5,000 8K pages in the index.
 | 
						|
 | 
						|
There is a 40M shared cache for this system.
 | 
						|
 | 
						|
Everytime a query which requires the index scan runs it will blow out the
 | 
						|
entire cache since the scan will load more blocks than the cache holds. Only
 | 
						|
blocks that are accessed while the scan is going will survive. LRU is bad,
 | 
						|
bad, bad!
 | 
						|
 | 
						|
LRU-2 might be better but it seems like it still won't give enough priority
 | 
						|
to the most frequently used blocks. I don't see how it would do better for
 | 
						|
the above case.
 | 
						|
 | 
						|
I once implemented a modified cache algorithm that was based on the clock
 | 
						|
algorithm for VM page caches. VM paging is similar to databases in that
 | 
						|
there is definite locality of reference and certain pages are MUCH more
 | 
						|
likely to be requested.
 | 
						|
 | 
						|
The basic idea was to have a flag in each block that represented the access
 | 
						|
time in clock intervals. Imagine a clock hand sweeping across a clock, every
 | 
						|
access is like a tiny movement in the clock hand. Blocks that are not
 | 
						|
accessed during a sweep are candidates for removal.
 | 
						|
 | 
						|
My modification was to use access counts to increase the durability of the
 | 
						|
more accessed blocks. Each time a block is accessed it's flag is shifted
 | 
						|
left (up to a maximum number of shifts - ShiftN ) and 1 is added to it.
 | 
						|
Every so many cache accesses (and synchronously when the cache is full) a
 | 
						|
pass is made over each block, right shifting the flags (a clock sweep). This
 | 
						|
can also be done one block at a time each access so the clock is directly
 | 
						|
linked to the cache access rate. Any blocks with 0 are placed into a doubly
 | 
						|
linked list of candidates for removal. New cache blocks are allocated from
 | 
						|
the list of candidates. Accesses of blocks in the candidate list just
 | 
						|
removes them from the list.
 | 
						|
 | 
						|
An index root node page would likely be accessed frequently enough so that
 | 
						|
all it's bits would be set so it would take ShiftN clock sweeps.
 | 
						|
 | 
						|
This algorithm increased the cache hit ratio from 40% to about 90% for the
 | 
						|
cases I tested when compared to a simple LRU mechanism. The paging ratio is
 | 
						|
greatly dependent on the ratio of the actual database size to the cache
 | 
						|
size.
 | 
						|
 | 
						|
The bottom line that it is very important to keep blocks that are frequently
 | 
						|
accessed in the cache. The top levels of large btrees are accessed many
 | 
						|
hundreds (actually a power of the number of keys in each page) of times more
 | 
						|
frequently than the leaf pages. LRU can be the worst possible algorithm for
 | 
						|
something like an index or table scan of large tables since it flushes a
 | 
						|
large number of potentially frequently accessed blocks in favor of ones that
 | 
						|
are very unlikely to be retrieved again.
 | 
						|
 | 
						|
tom lane also wrote:
 | 
						|
> This is an interesting area.  Keep in mind though that Postgres is a
 | 
						|
> portable DB that tries to be agnostic about what kernel and filesystem
 | 
						|
> it's sitting on top of --- and in any case it does not run as root, so
 | 
						|
> has very limited ability to affect what the kernel/filesystem do.
 | 
						|
> I'm not sure how much can be done without losing those portability
 | 
						|
> advantages.
 | 
						|
 | 
						|
The kinds of things I was thinking about should be very portable. I found
 | 
						|
that simply writing the cache in order of the file system offset results in
 | 
						|
very greatly improved performance since it lets the head seek in smaller
 | 
						|
increments and much more smoothly, especially with modern disks. Most of the
 | 
						|
time the file system will create files are large sequential bytes on the
 | 
						|
physical disks in order. It might be in a few chunks but those chunks will
 | 
						|
be sequential and fairly large.
 | 
						|
 | 
						|
tom lane also wrote:
 | 
						|
> Well, not really all that isolated.  The bottom-level index code doesn't
 | 
						|
> know whether you're doing INSERT or UPDATE, and would have no easy
 | 
						|
> access to the original tuple if it did know.  The original theory about
 | 
						|
> this was that the planner could detect the situation where the index(es)
 | 
						|
> don't overlap the set of columns being changed by the UPDATE, which
 | 
						|
> would be nice since there'd be zero runtime overhead.  Unfortunately
 | 
						|
> that breaks down if any BEFORE UPDATE triggers are fired that modify the
 | 
						|
> tuple being stored.  So all in all it turns out to be a tad messy to fit
 | 
						|
> this in :-(.  I am unconvinced that the impact would be huge anyway,
 | 
						|
> especially as of 7.3 which has a shortcut path for dead index entries.
 | 
						|
 | 
						|
Well, this probably is not the right place to start then.
 | 
						|
 | 
						|
- Curtis
 | 
						|
 | 
						|
 | 
						|
---------------------------(end of broadcast)---------------------------
 | 
						|
TIP 4: Don't 'kill -9' the postmaster
 | 
						|
 | 
						|
From pgsql-hackers-owner+M29945@postgresql.org Thu Oct  3 18:47:34 2002
 | 
						|
Return-path: <pgsql-hackers-owner+M29945@postgresql.org>
 | 
						|
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
						|
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g93MlWU26068
 | 
						|
	for <pgman@candle.pha.pa.us>; Thu, 3 Oct 2002 18:47:32 -0400 (EDT)
 | 
						|
Received: from localhost (postgresql.org [64.49.215.8])
 | 
						|
	by postgresql.org (Postfix) with ESMTP
 | 
						|
	id F2AAE476306; Thu,  3 Oct 2002 18:47:27 -0400 (EDT)
 | 
						|
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
						|
	by postgresql.org (Postfix) with SMTP
 | 
						|
	id E7B5247604F; Thu,  3 Oct 2002 18:47:24 -0400 (EDT)
 | 
						|
Received: from localhost (postgresql.org [64.49.215.8])
 | 
						|
	by postgresql.org (Postfix) with ESMTP id 9ADCC4761A1
 | 
						|
	for <pgsql-hackers@postgresql.org>; Thu,  3 Oct 2002 18:47:18 -0400 (EDT)
 | 
						|
Received: from sss.pgh.pa.us (unknown [192.204.191.242])
 | 
						|
	by postgresql.org (Postfix) with ESMTP id DDB0B476187
 | 
						|
	for <pgsql-hackers@postgresql.org>; Thu,  3 Oct 2002 18:47:17 -0400 (EDT)
 | 
						|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						|
	by sss.pgh.pa.us (8.12.5/8.12.5) with ESMTP id g93MlIhR015091;
 | 
						|
	Thu, 3 Oct 2002 18:47:18 -0400 (EDT)
 | 
						|
To: "Curtis Faith" <curtis@galtair.com>
 | 
						|
cc: "Pgsql-Hackers" <pgsql-hackers@postgresql.org>
 | 
						|
Subject: Re: [HACKERS] Advice: Where could I be of help? 
 | 
						|
In-Reply-To: <DMEEJMCDOJAKPPFACMPMGEBNCEAA.curtis@galtair.com> 
 | 
						|
References: <DMEEJMCDOJAKPPFACMPMGEBNCEAA.curtis@galtair.com>
 | 
						|
Comments: In-reply-to "Curtis Faith" <curtis@galtair.com>
 | 
						|
	message dated "Thu, 03 Oct 2002 18:17:55 -0400"
 | 
						|
Date: Thu, 03 Oct 2002 18:47:17 -0400
 | 
						|
Message-ID: <15090.1033685237@sss.pgh.pa.us>
 | 
						|
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
X-Virus-Scanned: by AMaViS new-20020517
 | 
						|
Precedence: bulk
 | 
						|
Sender: pgsql-hackers-owner@postgresql.org
 | 
						|
X-Virus-Scanned: by AMaViS new-20020517
 | 
						|
Status: OR
 | 
						|
 | 
						|
"Curtis Faith" <curtis@galtair.com> writes:
 | 
						|
> Then during execution if the planner turned out to be VERY wrong about
 | 
						|
> certain assumptions the execution system could update the stats that led to
 | 
						|
> those wrong assumptions. That way the system would seek the correct values
 | 
						|
> automatically.
 | 
						|
 | 
						|
That has been suggested before, but I'm unsure how to make it work.
 | 
						|
There are a lot of parameters involved in any planning decision and it's
 | 
						|
not obvious which ones to tweak, or in which direction, if the plan
 | 
						|
turns out to be bad.  But if you can come up with some ideas, go to
 | 
						|
it!
 | 
						|
 | 
						|
> Everytime a query which requires the index scan runs it will blow out the
 | 
						|
> entire cache since the scan will load more blocks than the cache
 | 
						|
> holds.
 | 
						|
 | 
						|
Right, that's the scenario that kills simple LRU ...
 | 
						|
 | 
						|
> LRU-2 might be better but it seems like it still won't give enough priority
 | 
						|
> to the most frequently used blocks.
 | 
						|
 | 
						|
Blocks touched more than once per query (like the upper-level index
 | 
						|
blocks) will survive under LRU-2.  Blocks touched once per query won't.
 | 
						|
Seems to me that it should be a win.
 | 
						|
 | 
						|
> My modification was to use access counts to increase the durability of the
 | 
						|
> more accessed blocks.
 | 
						|
 | 
						|
You could do it that way too, but I'm unsure whether the extra
 | 
						|
complexity will buy anything.  Ultimately, I think an LRU-anything
 | 
						|
algorithm is equivalent to a clock sweep for those pages that only get
 | 
						|
touched once per some-long-interval: the single-touch guys get recycled
 | 
						|
in order of last use, which seems just like a clock sweep around the
 | 
						|
cache.  The guys with some amount of preference get excluded from the
 | 
						|
once-around sweep.  To determine whether LRU-2 is better or worse than
 | 
						|
some other preference algorithm requires a finer grain of analysis than
 | 
						|
this.  I'm not a fan of "more complex must be better", so I'd want to see
 | 
						|
why it's better before buying into it ...
 | 
						|
 | 
						|
> The kinds of things I was thinking about should be very portable. I found
 | 
						|
> that simply writing the cache in order of the file system offset results in
 | 
						|
> very greatly improved performance since it lets the head seek in smaller
 | 
						|
> increments and much more smoothly, especially with modern disks.
 | 
						|
 | 
						|
Shouldn't the OS be responsible for scheduling those writes
 | 
						|
appropriately?  Ye good olde elevator algorithm ought to handle this;
 | 
						|
and it's at least one layer closer to the actual disk layout than we
 | 
						|
are, thus more likely to issue the writes in a good order.  It's worth
 | 
						|
experimenting with, perhaps, but I'm pretty dubious about it.
 | 
						|
 | 
						|
BTW, one other thing that Vadim kept saying we should do is alter the
 | 
						|
cache management strategy to retain dirty blocks in memory (ie, give
 | 
						|
some amount of preference to as-yet-unwritten dirty pages compared to
 | 
						|
clean pages).  There is no reliability cost here since the WAL will let
 | 
						|
us reconstruct any dirty pages if we crash before they get written; and
 | 
						|
the periodic checkpoints will ensure that we eventually write a dirty
 | 
						|
block and thus it will become available for recycling.  This seems like
 | 
						|
a promising line of thought that's orthogonal to the basic
 | 
						|
LRU-vs-whatever issue.  Nobody's got round to looking at it yet though.
 | 
						|
I've got no idea how much preference should be given to a dirty block
 | 
						|
--- not infinite, probably, but some.
 | 
						|
 | 
						|
			regards, tom lane
 | 
						|
 | 
						|
---------------------------(end of broadcast)---------------------------
 | 
						|
TIP 5: Have you checked our extensive FAQ?
 | 
						|
 | 
						|
http://www.postgresql.org/users-lounge/docs/faq.html
 | 
						|
 | 
						|
From pgsql-hackers-owner+M29974@postgresql.org Fri Oct  4 01:28:54 2002
 | 
						|
Return-path: <pgsql-hackers-owner+M29974@postgresql.org>
 | 
						|
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
						|
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g945SpU13476
 | 
						|
	for <pgman@candle.pha.pa.us>; Fri, 4 Oct 2002 01:28:52 -0400 (EDT)
 | 
						|
Received: from localhost (postgresql.org [64.49.215.8])
 | 
						|
	by postgresql.org (Postfix) with ESMTP
 | 
						|
	id 63999476BB2; Fri,  4 Oct 2002 01:26:56 -0400 (EDT)
 | 
						|
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
						|
	by postgresql.org (Postfix) with SMTP
 | 
						|
	id BB7CA476B85; Fri,  4 Oct 2002 01:26:54 -0400 (EDT)
 | 
						|
Received: from localhost (postgresql.org [64.49.215.8])
 | 
						|
	by postgresql.org (Postfix) with ESMTP id 5FD7E476759
 | 
						|
	for <pgsql-hackers@postgresql.org>; Fri,  4 Oct 2002 01:26:52 -0400 (EDT)
 | 
						|
Received: from mclean.mail.mindspring.net (mclean.mail.mindspring.net [207.69.200.57])
 | 
						|
	by postgresql.org (Postfix) with ESMTP id 1F4A14766D8
 | 
						|
	for <pgsql-hackers@postgresql.org>; Fri,  4 Oct 2002 01:26:51 -0400 (EDT)
 | 
						|
Received: from 1cust163.tnt1.st-thomas.vi.da.uu.net ([200.58.4.163] helo=CurtisVaio)
 | 
						|
	by mclean.mail.mindspring.net with smtp (Exim 3.33 #1)
 | 
						|
	id 17xKzB-0000yK-00; Fri, 04 Oct 2002 01:26:49 -0400
 | 
						|
From: "Curtis Faith" <curtis@galtair.com>
 | 
						|
To: "Tom Lane" <tgl@sss.pgh.pa.us>
 | 
						|
cc: "Pgsql-Hackers" <pgsql-hackers@postgresql.org>
 | 
						|
Subject: Re: [HACKERS] Advice: Where could I be of help? 
 | 
						|
Date: Fri, 4 Oct 2002 01:26:36 -0400
 | 
						|
Message-ID: <DMEEJMCDOJAKPPFACMPMIECECEAA.curtis@galtair.com>
 | 
						|
MIME-Version: 1.0
 | 
						|
Content-Type: text/plain;
 | 
						|
	charset="iso-8859-1"
 | 
						|
Content-Transfer-Encoding: 7bit
 | 
						|
X-Priority: 3 (Normal)
 | 
						|
X-MSMail-Priority: Normal
 | 
						|
X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0)
 | 
						|
In-Reply-To: <15090.1033685237@sss.pgh.pa.us>
 | 
						|
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700
 | 
						|
Importance: Normal
 | 
						|
X-Virus-Scanned: by AMaViS new-20020517
 | 
						|
Precedence: bulk
 | 
						|
Sender: pgsql-hackers-owner@postgresql.org
 | 
						|
X-Virus-Scanned: by AMaViS new-20020517
 | 
						|
Status: OR
 | 
						|
 | 
						|
I wrote:
 | 
						|
 | 
						|
> > My modification was to use access counts to increase the
 | 
						|
> durability of the
 | 
						|
> > more accessed blocks.
 | 
						|
>
 | 
						|
 | 
						|
tom lane replies:
 | 
						|
> You could do it that way too, but I'm unsure whether the extra
 | 
						|
> complexity will buy anything.  Ultimately, I think an LRU-anything
 | 
						|
> algorithm is equivalent to a clock sweep for those pages that only get
 | 
						|
> touched once per some-long-interval: the single-touch guys get recycled
 | 
						|
> in order of last use, which seems just like a clock sweep around the
 | 
						|
> cache.  The guys with some amount of preference get excluded from the
 | 
						|
> once-around sweep.  To determine whether LRU-2 is better or worse than
 | 
						|
> some other preference algorithm requires a finer grain of analysis than
 | 
						|
> this.  I'm not a fan of "more complex must be better", so I'd want to see
 | 
						|
> why it's better before buying into it ...
 | 
						|
 | 
						|
I'm definitely not a fan of "more complex must be better either". In fact,
 | 
						|
its surprising how often the real performance problems are easy to fix
 | 
						|
and simple while many person years are spent solving the issue everyone
 | 
						|
"knows" must be causing the performance problems only to find little gain.
 | 
						|
 | 
						|
The key here is empirical testing. If the cache hit ratio for LRU-2 is
 | 
						|
much better then there may be no need here. OTOH, it took less than
 | 
						|
less than 30 lines or so of code to do what I described, so I don't consider
 | 
						|
it too, too "more complex" :=} We should run a test which includes
 | 
						|
running indexes (or is indices the PostgreSQL convention?) that are three
 | 
						|
or more times the size of the cache to see how well LRU-2 works. Is there
 | 
						|
any cache performance reporting built into pgsql?
 | 
						|
 | 
						|
tom lane wrote:
 | 
						|
> Shouldn't the OS be responsible for scheduling those writes
 | 
						|
> appropriately?  Ye good olde elevator algorithm ought to handle this;
 | 
						|
> and it's at least one layer closer to the actual disk layout than we
 | 
						|
> are, thus more likely to issue the writes in a good order.  It's worth
 | 
						|
> experimenting with, perhaps, but I'm pretty dubious about it.
 | 
						|
 | 
						|
I wasn't proposing anything other than changing the order of the writes,
 | 
						|
not actually ensuring that they get written that way at the level you
 | 
						|
describe above. This will help a lot on brain-dead file systems that
 | 
						|
can't do this ordering and probably also in cases where the number
 | 
						|
of blocks in the cache is very large.
 | 
						|
 | 
						|
On a related note, while looking at the code, it seems to me that we
 | 
						|
are writing out the buffer cache synchronously, so there won't be
 | 
						|
any possibility of the file system reordering anyway. This appears to be
 | 
						|
a huge performance problem. I've read claims  in the archives that
 | 
						|
that the buffers are written asynchronously but my read of the
 | 
						|
code says otherwise. Can someone point out my error?
 | 
						|
 | 
						|
I only see calls that ultimately call FileWrite or write(2) which will
 | 
						|
block without a O_NOBLOCK open. I thought one of the main reasons
 | 
						|
for having a WAL is so that you can write out the buffer's asynchronously.
 | 
						|
 | 
						|
What am I missing?
 | 
						|
 | 
						|
I wrote:
 | 
						|
> > Then during execution if the planner turned out to be VERY wrong about
 | 
						|
> > certain assumptions the execution system could update the stats
 | 
						|
> that led to
 | 
						|
> > those wrong assumptions. That way the system would seek the
 | 
						|
> correct values
 | 
						|
> > automatically.
 | 
						|
 | 
						|
tom lane replied:
 | 
						|
> That has been suggested before, but I'm unsure how to make it work.
 | 
						|
> There are a lot of parameters involved in any planning decision and it's
 | 
						|
> not obvious which ones to tweak, or in which direction, if the plan
 | 
						|
> turns out to be bad.  But if you can come up with some ideas, go to
 | 
						|
> it!
 | 
						|
 | 
						|
I'll have to look at the current planner before I can suggest
 | 
						|
anything concrete.
 | 
						|
 | 
						|
- Curtis
 | 
						|
 | 
						|
 | 
						|
---------------------------(end of broadcast)---------------------------
 | 
						|
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
 | 
						|
 |