mirror of
https://github.com/postgres/postgres.git
synced 2025-06-13 07:41:39 +03:00
Code review for HeapTupleHeader changes. Add version number to page headers
(overlaying low byte of page size) and add HEAP_HASOID bit to t_infomask, per earlier discussion. Simplify scheme for overlaying fields in tuple header (no need for cmax to live in more than one place). Don't try to clear infomask status bits in tqual.c --- not safe to do it there. Don't try to force output table of a SELECT INTO to have OIDs, either. Get rid of unnecessarily complex three-state scheme for TupleDesc.tdhasoids, which has already caused one recent failure. Improve documentation.
This commit is contained in:
@ -4,13 +4,17 @@
|
||||
|
||||
<abstract>
|
||||
<para>
|
||||
A description of the database file default page format.
|
||||
A description of the database file page format.
|
||||
</para>
|
||||
</abstract>
|
||||
|
||||
<para>
|
||||
This section provides an overview of the page format used by <productname>PostgreSQL</productname>
|
||||
tables. User-defined access methods need not use this page format.
|
||||
This section provides an overview of the page format used by
|
||||
<productname>PostgreSQL</productname> tables and indexes. (Index
|
||||
access methods need not use this page format. At present, all index
|
||||
methods do use this basic format, but the data kept on index metapages
|
||||
usually doesn't follow the item layout rules exactly.) TOAST tables
|
||||
and sequences are formatted just like a regular table.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
@ -18,15 +22,13 @@ In the following explanation, a
|
||||
<firstterm>byte</firstterm>
|
||||
is assumed to contain 8 bits. In addition, the term
|
||||
<firstterm>item</firstterm>
|
||||
refers to data that is stored in <productname>PostgreSQL</productname> tables.
|
||||
refers to an individual data value that is stored on a page. In a table,
|
||||
an item is a tuple (row); in an index, an item is an index entry.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
|
||||
<xref linkend="page-table"> shows how pages in both normal
|
||||
<productname>PostgreSQL</productname> tables and
|
||||
<productname>PostgreSQL</productname> indexes (e.g., a B-tree index)
|
||||
are structured. This structure is also used for toast tables and sequences.
|
||||
<xref linkend="page-table"> shows the basic layout of a page.
|
||||
There are five parts to each page.
|
||||
|
||||
</para>
|
||||
@ -48,12 +50,13 @@ Item
|
||||
|
||||
<row>
|
||||
<entry>PageHeaderData</entry>
|
||||
<entry>20 bytes long. Contains general information about the page to allow to access it.</entry>
|
||||
<entry>20 bytes long. Contains general information about the page, including
|
||||
free space pointers.</entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry>itemPointerData</entry>
|
||||
<entry>List of (offset,length) pairs pointing to the actual item.</entry>
|
||||
<entry>ItemPointerData</entry>
|
||||
<entry>Array of (offset,length) pairs pointing to the actual items.</entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
@ -62,13 +65,14 @@ Item
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry>items</entry>
|
||||
<entry>The actual items themselves. Different access method have different data here.</entry>
|
||||
<entry>Items</entry>
|
||||
<entry>The actual items themselves.</entry>
|
||||
</row>
|
||||
|
||||
<row>
|
||||
<entry>Special Space</entry>
|
||||
<entry>Access method specific data. Different method store different data. Unused by normal tables.</entry>
|
||||
<entry>Index access method specific data. Different methods store different
|
||||
data. Empty in ordinary tables.</entry>
|
||||
</row>
|
||||
|
||||
</tbody>
|
||||
@ -78,11 +82,12 @@ Item
|
||||
<para>
|
||||
|
||||
The first 20 bytes of each page consists of a page header
|
||||
(PageHeaderData). It's format is detailed in <xref
|
||||
(PageHeaderData). Its format is detailed in <xref
|
||||
linkend="pageheaderdata-table">. The first two fields deal with WAL
|
||||
related stuff. This is followed by three 2-byte integer fields
|
||||
(<firstterm>lower</firstterm>, <firstterm>upper</firstterm>, and
|
||||
<firstterm>special</firstterm>). These represent byte offsets to the start
|
||||
(<structfield>pd_lower</structfield>, <structfield>pd_upper</structfield>,
|
||||
and <structfield>pd_special</structfield>). These represent byte offsets to
|
||||
the start
|
||||
of unallocated space, to the end of unallocated space, and to the start of
|
||||
the special space.
|
||||
|
||||
@ -104,7 +109,7 @@ Item
|
||||
<row>
|
||||
<entry>pd_lsn</entry>
|
||||
<entry>XLogRecPtr</entry>
|
||||
<entry>6 bytes</entry>
|
||||
<entry>8 bytes</entry>
|
||||
<entry>LSN: next byte after last byte of xlog</entry>
|
||||
</row>
|
||||
<row>
|
||||
@ -132,38 +137,51 @@ Item
|
||||
<entry>Offset to start of special space.</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>pd_opaque</entry>
|
||||
<entry>OpaqueData</entry>
|
||||
<entry>pd_pagesize_version</entry>
|
||||
<entry>uint16</entry>
|
||||
<entry>2 bytes</entry>
|
||||
<entry>AM-generic information. Currently just stores the page size.</entry>
|
||||
<entry>Page size and layout version number information.</entry>
|
||||
</row>
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</table>
|
||||
|
||||
<para>
|
||||
All the details may be found in src/include/storage/bufpage.h.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Special space is a region at the end of the page that is allocated at page
|
||||
initialization time and contains information specific to an access method.
|
||||
The last 2 bytes of the page header, <firstterm>opaque</firstterm>,
|
||||
currently only stores the page size. Page size is stored in each page
|
||||
because frames in the buffer pool may be subdivided into equal sized pages
|
||||
on a frame by frame basis within a table (is this true? - mvo).
|
||||
|
||||
The last 2 bytes of the page header,
|
||||
<structfield>pd_pagesize_version</structfield>, store both the page size
|
||||
and a version indicator. Beginning with
|
||||
<productname>PostgreSQL</productname> 7.3 the version number is 1; prior
|
||||
releases used version number 0. (The basic page layout and header format
|
||||
has not changed, but the layout of heap tuple headers has.) The page size
|
||||
is basically only present as a cross-check; there is no support for having
|
||||
more than one page size in an installation.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
|
||||
Following the page header are item identifiers
|
||||
(<firstterm>ItemIdData</firstterm>). New item identifiers are allocated
|
||||
from the first four bytes of unallocated space. Because an item
|
||||
identifier is never moved until it is freed, its index may be used to
|
||||
indicate the location of an item on a page. In fact, every pointer to an
|
||||
item (<firstterm>ItemPointer</firstterm>, also know as
|
||||
<firstterm>CTID</firstterm>) created by
|
||||
<productname>PostgreSQL</productname> consists of a frame number and an
|
||||
index of an item identifier. An item identifier contains a byte-offset to
|
||||
(<type>ItemIdData</type>), each requiring four bytes.
|
||||
An item identifier contains a byte-offset to
|
||||
the start of an item, its length in bytes, and a set of attribute bits
|
||||
which affect its interpretation.
|
||||
New item identifiers are allocated
|
||||
as needed from the beginning of the unallocated space.
|
||||
The number of item identifiers present can be determined by looking at
|
||||
<structfield>pd_lower</>, which is increased to allocate a new identifier.
|
||||
Because an item
|
||||
identifier is never moved until it is freed, its index may be used on a
|
||||
long-term basis to reference an item, even when the item itself is moved
|
||||
around on the page to compact free space. In fact, every pointer to an
|
||||
item (<type>ItemPointer</type>, also known as
|
||||
<type>CTID</type>) created by
|
||||
<productname>PostgreSQL</productname> consists of a page number and the
|
||||
index of an item identifier.
|
||||
|
||||
</para>
|
||||
|
||||
@ -171,8 +189,8 @@ Item
|
||||
|
||||
The items themselves are stored in space allocated backwards from the end
|
||||
of unallocated space. The exact structure varies depending on what the
|
||||
table is to contain. Sequences and tables both use a structure named
|
||||
<firstterm>HeapTupleHeaderData</firstterm>, describe below.
|
||||
table is to contain. Tables and sequences both use a structure named
|
||||
<type>HeapTupleHeaderData</type>, described below.
|
||||
|
||||
</para>
|
||||
|
||||
@ -180,20 +198,33 @@ Item
|
||||
|
||||
The final section is the "special section" which may contain anything the
|
||||
access method wishes to store. Ordinary tables do not use this at all
|
||||
(indicated by setting the offset to the pagesize).
|
||||
(indicated by setting <structfield>pd_special</> to equal the pagesize).
|
||||
|
||||
</para>
|
||||
|
||||
<para>
|
||||
|
||||
All tuples are structured the same way. A header of around 31 bytes
|
||||
followed by an optional null bitmask and the data. The header is detailed
|
||||
below in <xref linkend="heaptupleheaderdata-table">. The null bitmask is
|
||||
only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in the
|
||||
<firstterm>t_infomask</firstterm>. If it is present it takes up the space
|
||||
between the end of the header and the beginning of the data, as indicated
|
||||
by the <firstterm>t_hoff</firstterm> field. In this list of bits, a 1 bit
|
||||
indicates not-null, a 0 bit is a null.
|
||||
All table tuples are structured the same way. There is a fixed-size
|
||||
header (occupying 23 bytes on most machines), followed by an optional null
|
||||
bitmap, an optional object ID field, and the user data. The header is
|
||||
detailed
|
||||
in <xref linkend="heaptupleheaderdata-table">. The actual user data
|
||||
(fields of the tuple) begins at the offset indicated by
|
||||
<structfield>t_hoff</>, which must always be a multiple of the MAXALIGN
|
||||
distance for the platform.
|
||||
The null bitmap is
|
||||
only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in
|
||||
<structfield>t_infomask</structfield>. If it is present it begins just after
|
||||
the fixed header and occupies enough bytes to have one bit per data column
|
||||
(that is, <structfield>t_natts</> bits altogether). In this list of bits, a
|
||||
1 bit indicates not-null, a 0 bit is a null. When the bitmap is not
|
||||
present, all columns are assumed not-null.
|
||||
The object ID is only present if the <firstterm>HEAP_HASOID</firstterm> bit
|
||||
is set in <structfield>t_infomask</structfield>. If present, it appears just
|
||||
before the <structfield>t_hoff</> boundary. Any padding needed to make
|
||||
<structfield>t_hoff</> a MAXALIGN multiple will appear between the null
|
||||
bitmap and the object ID. (This in turn ensures that the object ID is
|
||||
suitably aligned.)
|
||||
|
||||
</para>
|
||||
|
||||
@ -210,36 +241,36 @@ Item
|
||||
</row>
|
||||
</thead>
|
||||
<tbody>
|
||||
<row>
|
||||
<entry>t_oid</entry>
|
||||
<entry>Oid</entry>
|
||||
<entry>4 bytes</entry>
|
||||
<entry>OID of this tuple</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>t_cmin</entry>
|
||||
<entry>CommandId</entry>
|
||||
<entry>4 bytes</entry>
|
||||
<entry>insert CID stamp</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>t_cmax</entry>
|
||||
<entry>CommandId</entry>
|
||||
<entry>4 bytes</entry>
|
||||
<entry>delete CID stamp</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>t_xmin</entry>
|
||||
<entry>TransactionId</entry>
|
||||
<entry>4 bytes</entry>
|
||||
<entry>insert XID stamp</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>t_cmin</entry>
|
||||
<entry>CommandId</entry>
|
||||
<entry>4 bytes</entry>
|
||||
<entry>insert CID stamp (overlays with t_xmax)</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>t_xmax</entry>
|
||||
<entry>TransactionId</entry>
|
||||
<entry>4 bytes</entry>
|
||||
<entry>delete XID stamp</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>t_cmax</entry>
|
||||
<entry>CommandId</entry>
|
||||
<entry>4 bytes</entry>
|
||||
<entry>delete CID stamp (overlays with t_xvac)</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>t_xvac</entry>
|
||||
<entry>TransactionId</entry>
|
||||
<entry>4 bytes</entry>
|
||||
<entry>XID for VACUUM operation moving tuple</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>t_ctid</entry>
|
||||
<entry>ItemPointerData</entry>
|
||||
@ -256,30 +287,28 @@ Item
|
||||
<entry>t_infomask</entry>
|
||||
<entry>uint16</entry>
|
||||
<entry>2 bytes</entry>
|
||||
<entry>Various flags</entry>
|
||||
<entry>various flags</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>t_hoff</entry>
|
||||
<entry>uint8</entry>
|
||||
<entry>1 byte</entry>
|
||||
<entry>length of tuple header. Also offset of data.</entry>
|
||||
<entry>offset to user data</entry>
|
||||
</row>
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</table>
|
||||
|
||||
<para>
|
||||
|
||||
All the details may be found in src/include/storage/bufpage.h.
|
||||
|
||||
All the details may be found in src/include/access/htup.h.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
|
||||
Interpreting the actual data can only be done with information obtained
|
||||
from other tables, mostly <firstterm>pg_attribute</firstterm>. The
|
||||
particular fields are <firstterm>attlen</firstterm> and
|
||||
<firstterm>attalign</firstterm>. There is no way to directly get a
|
||||
particular fields are <structfield>attlen</structfield> and
|
||||
<structfield>attalign</structfield>. There is no way to directly get a
|
||||
particular attribute, except when there are only fixed width fields and no
|
||||
NULLs. All this trickery is wrapped up in the functions
|
||||
<firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
|
||||
@ -293,7 +322,7 @@ Item
|
||||
the next. Then make sure you have the right alignment. If the field is a
|
||||
fixed width field, then all the bytes are simply placed. If it's a
|
||||
variable length field (attlen == -1) then it's a bit more complicated,
|
||||
using the variable length structure <firstterm>varattrib</firstterm>.
|
||||
using the variable length structure <type>varattrib</type>.
|
||||
Depending on the flags, the data may be either inline, compressed or in
|
||||
another table (TOAST).
|
||||
|
||||
|
Reference in New Issue
Block a user