postgres

mirror of https://github.com/postgres/postgres.git synced 2025-11-21 00:42:43 +03:00

Author	SHA1	Message	Date
Andres Freund	0fd38e1370	Don't skip SQL backends in logical decoding for visibility computation. The logical decoding patchset introduced PROC_IN_LOGICAL_DECODING flag PGXACT flag, that allows such backends to be skipped when computing the xmin horizon/snapshots. That's fine and sensible for walsenders streaming out logical changes, but not at all fine for SQL backends doing logical decoding. If the latter set that flag any change they have performed outside of logical decoding will not be regarded as visible - which e.g. can lead to that change being vacuumed away. Note that not setting the flag for SQL backends isn't particularly bothersome - the SQL backend doesn't do streaming, so it only runs for a limited amount of time. Per buildfarm member 'tick' and Alvaro. Backpatch to 9.4, where logical decoding was introduced.	2014-12-02 23:47:08 +01:00
Heikki Linnakangas	0bd624d63b	Distinguish XLOG_FPI records generated for hint-bit updates. Add a new XLOG_FPI_FOR_HINT record type, and use that for full-page images generated for hint bit updates, when checksums are enabled. The new record type is replayed exactly the same as XLOG_FPI, but allows them to be tallied separately e.g. in pg_xlogdump.	2014-11-24 11:09:08 +02:00
Heikki Linnakangas	2c03216d83	Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.	2014-11-20 18:46:41 +02:00
Peter Eisentraut	a15d387c22	Improve logical decoding log messages suggestions from Robert Haas	2014-11-13 20:44:34 -05:00
Andres Freund	89fd41b390	Fix and improve cache invalidation logic for logical decoding. There are basically three situations in which logical decoding needs to perform cache invalidation. During/After replaying a transaction with catalog changes, when skipping a uninteresting transaction that performed catalog changes and when erroring out while replaying a transaction. Unfortunately these three cases were all done slightly differently - partially because `8de3e410fa`, which greatly simplifies matters, got committed in the midst of the development of logical decoding. The actually problematic case was when logical decoding skipped transaction commits (and thus processed invalidations). When used via the SQL interface cache invalidation could access the catalog - bad, because we didn't set up enough state to allow that correctly. It'd not be hard to setup sufficient state, but the simpler solution is to always perform cache invalidation outside a valid transaction. Also make the different cache invalidation cases look as similar as possible, to ease code review. This fixes the assertion failure reported by Antonin Houska in 53EE02D9.7040702@gmail.com. The presented testcase has been expanded into a regression test. Backpatch to 9.4, where logical decoding was introduced.	2014-11-13 20:34:31 +01:00
Andres Freund	5a2c184058	Fix xmin/xmax horizon computation during logical decoding initialization. When building the initial historic catalog snapshot there were scenarios where snapbuild.c would use incorrect xmin/xmax values when starting from a xl_running_xacts record. The values used were always a bit suspect, but happened to be correct in the easy to test cases. Notably the values used when the the initial snapshot was computed while no other transactions were running were correct. This is likely to be the cause of the occasional buildfarm failures on animals markhor and tick; but it's quite possible to reproduce problems without CLOBBER_CACHE_ALWAYS. Backpatch to 9.4, where logical decoding was introduced.	2014-11-13 20:34:30 +01:00
Andres Freund	ec5896aed3	Fix several weaknesses in slot and logical replication on-disk serialization. Heikki noticed in 544E23C0.8090605@vmware.com that slot.c and snapbuild.c were missing the FIN_CRC32 call when computing/checking checksums of on disk files. That doesn't lower the the error detection capabilities of the checksum, but is inconsistent with other usages. In a followup mail Heikki also noticed that, contrary to a comment, the 'version' and 'length' struct fields of replication slot's on disk data where not covered by the checksum. That's not likely to lead to actually missed corruption as those fields are cross checked with the expected version and the actual file length. But it's wrong nonetheless. As fixing these issues makes existing on disk files unreadable, bump the expected versions of on disk files for both slots and logical decoding historic catalog snapshots. This means that loading old files will fail with ERROR: "replication slot file ... has unsupported version 1" and ERROR: "snapbuild state file ... has unsupported version 1 instead of 2" respectively. Given the low likelihood of anybody already using these new features in a production setup that seems acceptable. Fixing these issues made me notice that there's no regression test covering the loading of historic snapshot from disk - so add one. Backpatch to 9.4 where these features were introduced.	2014-11-12 18:52:49 +01:00
Peter Eisentraut	8339f33d68	Message improvements	2014-11-11 20:02:30 -05:00
Alvaro Herrera	7516f52594	BRIN: Block Range Indexes BRIN is a new index access method intended to accelerate scans of very large tables, without the maintenance overhead of btrees or other traditional indexes. They work by maintaining "summary" data about block ranges. Bitmap index scans work by reading each summary tuple and comparing them with the query quals; all pages in the range are returned in a lossy TID bitmap if the quals are consistent with the values in the summary tuple, otherwise not. Normal index scans are not supported because these indexes do not store TIDs. As new tuples are added into the index, the summary information is updated (if the block range in which the tuple is added is already summarized) or not; in the latter case, a subsequent pass of VACUUM or the brin_summarize_new_values() function will create the summary information. For data types with natural 1-D sort orders, the summary info consists of the maximum and the minimum values of each indexed column within each page range. This type of operator class we call "Minmax", and we supply a bunch of them for most data types with B-tree opclasses. Since the BRIN code is generalized, other approaches are possible for things such as arrays, geometric types, ranges, etc; even for things such as enum types we could do something different than minmax with better results. In this commit I only include minmax. Catalog version bumped due to new builtin catalog entries. There's more that could be done here, but this is a good step forwards. Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera, with contribution by Heikki Linnakangas. Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas. Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo. PS: The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633.	2014-11-07 16:38:14 -03:00
Heikki Linnakangas	5028f22f6e	Switch to CRC-32C in WAL and other places. The old algorithm was found to not be the usual CRC-32 algorithm, used by Ethernet et al. We were using a non-reflected lookup table with code meant for a reflected lookup table. That's a strange combination that AFAICS does not correspond to any bit-wise CRC calculation, which makes it difficult to reason about its properties. Although it has worked well in practice, seems safer to use a well-known algorithm. Since we're changing the algorithm anyway, we might as well choose a different polynomial. The Castagnoli polynomial has better error-correcting properties than the traditional CRC-32 polynomial, even if we had implemented it correctly. Another reason for picking that is that some new CPUs have hardware support for calculating CRC-32C, but not CRC-32, let alone our strange variant of it. This patch doesn't add any support for such hardware, but a future patch could now do that. The old algorithm is kept around for tsquery and pg_trgm, which use the values in indexes that need to remain compatible so that pg_upgrade works. While we're at it, share the old lookup table for CRC-32 calculation between hstore, ltree and core. They all use the same table, so might as well.	2014-11-04 11:39:48 +02:00
Andres Freund	0ef3c29a4b	Improve documentation about binary/textual output mode for output plugins. Also improve related error message as it contributed to the confusion. Discussion: CAB7nPqQrqFzjqCjxu4GZzTrD9kpj6HMn9G5aOOMwt1WZ8NfqeA@mail.gmail.com, CAB7nPqQXc_+g95zWnqaa=mVQ4d3BVRs6T41frcEYi2ocUrR3+A@mail.gmail.com Per discussion between Michael Paquier, Robert Haas and Andres Freund Backpatch to 9.4 where logical decoding was introduced.	2014-10-01 13:22:17 +02:00
Peter Eisentraut	303f4d1012	Assorted message fixes and improvements	2014-09-05 01:25:27 -04:00
Andres Freund	5a64cb740d	Fix s/pluggins/plugins/ typo in two comments. Michael Paquier	2014-09-01 12:01:29 +02:00
Andres Freund	8fff977e29	Declare two variables in snapbuild.c as static. Neither is accessed externally, I just seem to have missed the static when writing the code.	2014-08-31 23:53:12 +02:00
Heikki Linnakangas	54685338e3	Move log_newpage and log_newpage_buffer to xlog.c. log_newpage is used by many indexams, in addition to heap, but for historical reasons it's always been part of the heapam rmgr. Starting with 9.3, we have another WAL record type for logging an image of a page, XLOG_FPI. Simplify things by moving log_newpage and log_newpage_buffer to xlog.c, and switch to using the XLOG_FPI record type. Bump the WAL version number because the code to replay the old HEAP_NEWPAGE records is removed.	2014-07-31 16:48:55 +03:00
Andres Freund	626bfad6cc	Fix decoding of consecutive MULTI_INSERTs emitted by one heap_multi_insert(). Commit `1b86c81d2d` fixed the decoding of toasted columns for the rows contained in one xl_heap_multi_insert record. But that's not actually enough, because heap_multi_insert() will actually first toast all passed in rows and then emit several *_multi_insert records; one for each page it fills with tuples. Add a XLOG_HEAP_LAST_MULTI_INSERT flag which is set in xl_heap_multi_insert->flag denoting that this multi_insert record is the last emitted by one heap_multi_insert() call. Then use that flag in decode.c to only set clear_toast_afterwards in the right situation. Expand the number of rows inserted via COPY in the corresponding regression test to make sure that more than one heap page is filled with tuples by one heap_multi_insert() call. Backpatch to 9.4 like the previous commit.	2014-07-12 14:28:19 +02:00
Andres Freund	1b86c81d2d	Fix decoding of MULTI_INSERTs when rows other than the last are toasted. When decoding the results of a HEAP2_MULTI_INSERT (currently only generated by COPY FROM) toast columns for all but the last tuple weren't replaced by their actual contents before being handed to the output plugin. The reassembled toast datums where disregarded after every REORDER_BUFFER_CHANGE_(INSERT\|UPDATE\|DELETE) which is correct for plain inserts, updates, deletes, but not multi inserts - there we generate several REORDER_BUFFER_CHANGE_INSERTs for a single xl_heap_multi_insert record. To solve the problem add a clear_toast_afterwards boolean to ReorderBufferChange's union member that's used by modifications. All row changes but multi_inserts always set that to true, but multi_insert sets it only for the last change generated. Add a regression test covering decoding of multi_inserts - there was none at all before. Backpatch to 9.4 where logical decoding was introduced. Bug found by Petr Jelinek.	2014-07-06 15:58:01 +02:00
Andres Freund	a36a8fa376	Rename logical decoding's pg_llog directory to pg_logical. The old name wasn't very descriptive as of actual contents of the directory, which are historical snapshots in the snapshots/ subdirectory and mappingdata for rewritten tuples in mappings/. There's been a fair amount of discussion what would be a good name. I'm settling for pg_logical because it's likely that further data around logical decoding and replication will need saving in the future. Also add the missing entry for the directory into storage.sgml's list of PGDATA contents. Bumps catversion as the data directories won't be compatible.	2014-07-02 21:07:47 +02:00
Andres Freund	1cbc948010	Check interrupts during logical decoding more frequently. When reading large amounts of preexisting WAL during logical decoding using the SQL interface we possibly could fail to check interrupts in due time. Similarly the same could happen on systems with a very high WAL volume while creating a new logical replication slot, independent of the used interface. Previously these checks where only performed in xlogreader's read_page callbacks, while waiting for new WAL to be produced. That's not sufficient though, if there's never a need to wait. Walsender's send loop already contains a interrupt check. Backpatch to 9.4 where the logical decoding feature was introduced.	2014-06-30 10:49:39 +02:00
Andres Freund	e04a9ccd2c	Consistency improvements for slot and decoding code. Change the order of checks in similar functions to be the same; remove a parameter that's not needed anymore; rename a memory context and expand a couple of comments. Per review comments from Amit Kapila	2014-06-12 13:33:27 +02:00
Andres Freund	fe7337f2dc	Fix off-by-one in decoding causing one-record events to be skipped. A ReorderBufferTransaction's end_lsn, the sentPtr advocated by walsender keepalive messages, and the end location remembered by the decoding get_changes SQL functions all use the location of the last read record + 1. I.e. the LSN points to the beginning of the next record. That cannot realistically be changed without changing the replication protocol because that's how keepalive messages have worked since 9.0. The bug is that the logic inside the snapshot builder, which decides whether a transaction's contents should be decoded, assumed the start location would point towards the last byte of the last record. The reason this didn't actually cause visible problems is that currently that decision is only made for commit records. Since interesting transactions always have at least one additional record - containing actual data - we'd never skip a transaction. But if there ever were transactions, or other events, with just one record containing important information, we'd skip them after stopping and restarting logical decoding.	2014-06-05 18:27:11 +02:00
Heikki Linnakangas	57b7e83b0d	Fix misc typos in comments.	2014-05-23 08:16:21 -04:00
Tom Lane	c1907f0cc4	Fix a bunch of functions that were declared static then defined not-static. Per testing with a compiler that whines about this.	2014-05-17 17:57:53 -04:00
Tom Lane	6c42b2b10a	Fix unaligned accesses in DecodeUpdate(). The xl_heap_header_len structures in an XLOG_HEAP_UPDATE record aren't necessarily aligned adequately. The regular replay function for these records is aware of that, but decode.c didn't get the memo. I'm not sure why the buildfarm failed to catch this; the test_decoding test certainly blows up real good on my old HPPA box. Also, I'm pretty sure that the address arithmetic was wrong for the case of XLOG_HEAP_CONTAINS_OLD and not XLOG_HEAP_CONTAINS_NEW_TUPLE, though this apparently can't happen when logical decoding is active.	2014-05-17 15:53:21 -04:00
Heikki Linnakangas	03e2b1017c	Fix thinko in logical decoding of commit-prepared records. The decoding of prepared transaction commits accidentally used the XID of the transaction performing the COMMIT PREPARED, not the XID of the prepared transaction. Before `bb38fb0d43` that lead to those transactions not being decoded, afterwards to a assertion failure.	2014-05-16 10:53:10 +03:00
Robert Haas	f1d8dd3647	Code review for logical decoding patch. Post-commit review identified a number of places where addition was used instead of multiplication or memory wasn't zeroed where it should have been. This commit also fixes one case where a structure member was mis-initialized, and moves another memory allocation closer to the place where the allocated storage is used for clarity. Andres Freund	2014-05-09 10:44:04 -04:00
Bruce Momjian	0a78320057	pgindent run for 9.4 This includes removing tabs after periods in C comments, which was applied to back branches, so this change should not effect backpatching.	2014-05-06 12:12:18 -04:00
Heikki Linnakangas	377790fbd7	Pass sensible value to memset() when randomizing reorderbuffer's tuple slab. This is entirely harmless, but still wrong. Noticed by coverity. Andres Freund	2014-05-05 16:22:15 +03:00
Heikki Linnakangas	c834576839	Use Size instead of uint32 to store result of sizeof() Silences coverity and is more consistent with other functions in the same file. Andres Freund	2014-05-05 16:17:16 +03:00
Tom Lane	203b0d132f	Improve error messages in reorderbuffer.c. Be more clear about failure cases in relfilenode->relation lookup, and fix some other places that were inconsistent or not per our message style guidelines. Andres Freund and Tom Lane	2014-04-30 18:16:53 -04:00
Tom Lane	2d00190495	Rationalize common/relpath.[hc]. Commit `a730183926` created rather a mess by putting dependencies on backend-only include files into include/common. We really shouldn't do that. To clean it up: * Move TABLESPACE_VERSION_DIRECTORY back to its longtime home in catalog/catalog.h. We won't consider this symbol part of the FE/BE API. * Push enum ForkNumber from relfilenode.h into relpath.h. We'll consider relpath.h as the source of truth for fork numbers, since relpath.c was already partially serving that function, and anyway relfilenode.h was kind of a random place for that enum. * So, relfilenode.h now includes relpath.h rather than vice-versa. This direction of dependency is fine. (That allows most, but not quite all, of the existing explicit #includes of relpath.h to go away again.) * Push forkname_to_number from catalog.c to relpath.c, just to centralize fork number stuff a bit better. * Push GetDatabasePath from catalog.c to relpath.c; it was rather odd that the previous commit didn't keep this together with relpath(). * To avoid needing relfilenode.h in common/, redefine the underlying function (now called GetRelationPath) as taking separate OID arguments, and make the APIs using RelFileNode or RelFileNodeBackend into macro wrappers. (The macros have a potential multiple-eval risk, but none of the existing call sites have an issue with that; one of them had such a risk already anyway.) * Fix failure to follow the directions when "init" fork type was added; specifically, the errhint in forkname_to_number wasn't updated, and neither was the SGML documentation for pg_relation_size(). * Fix tablespace-path-too-long check in CreateTableSpace() to account for fork-name component of maximum-length pathnames. This requires putting FORKNAMECHARS into a header file, but it was rather useless (and actually unreferenced) where it was. The last couple of items are potentially back-patchable bug fixes, if anyone is sufficiently excited about them; but personally I'm not. Per a gripe from Christoph Berg about how include/common wasn't self-contained.	2014-04-30 17:30:50 -04:00
Heikki Linnakangas	150a9df528	Fix a few more misc typos in comments.	2014-04-10 00:53:55 +03:00
Heikki Linnakangas	5b075ae893	Fix misc typos in comments.	2014-04-09 23:16:35 +03:00
Robert Haas	3f0e4be453	Fix thinko in logical decoding code. Andres Freund	2014-03-31 13:03:18 -04:00
Alvaro Herrera	f88d4cfc9d	Setup error context callback for transaction lock waits With this in place, a session blocking behind another one because of tuple locks will get a context line mentioning the relation name, tuple TID, and operation being done on tuple. For example: LOG: process 11367 still waiting for ShareLock on transaction 717 after 1000.108 ms DETAIL: Process holding the lock: 11366. Wait queue: 11367. CONTEXT: while updating tuple (0,2) in relation "foo" STATEMENT: UPDATE foo SET value = 3; Most usefully, the new line is displayed by log entries due to log_lock_waits, although of course it will be printed by any other log message as well. Author: Christian Kruse, some tweaks by Álvaro Herrera Reviewed-by: Amit Kapila, Andres Freund, Tom Lane, Robert Haas	2014-03-19 15:10:36 -03:00
Fujii Masao	2bccced110	Fix typos in comments. Thom Brown	2014-03-17 20:47:28 +09:00
Robert Haas	890194f14d	Comment fixes related to logical decoding. Andres Freund, per complaints by Peter Eisentraut.	2014-03-12 14:03:09 -04:00
Tom Lane	ea177a3ba7	Remove unportable use of anonymous unions from reorderbuffer.h. In `b89e151054` I had assumed it was ok to use anonymous unions as struct members, but while a longstanding extension in many compilers, it's only been standardized in C11. To fix, remove one of the anonymous unions which tried to hide some implementation specific enum values and give the other a name. The latter unfortunately requires changes in output plugins, but since the feature has only been added a few days ago... Andres Freund	2014-03-07 17:03:26 -05:00
Robert Haas	406a1a9ef0	Fix some typos introduced by the logical decoding patch. Erik Rijkers	2014-03-05 13:00:22 -05:00
Robert Haas	7e8db2dc42	Minor corrections to logical decoding patch.	2014-03-04 11:07:54 -05:00
Robert Haas	b89e151054	Introduce logical decoding. This feature, building on previous commits, allows the write-ahead log stream to be decoded into a series of logical changes; that is, inserts, updates, and deletes and the transactions which contain them. It is capable of handling decoding even across changes to the schema of the effected tables. The output format is controlled by a so-called "output plugin"; an example is included. To make use of this in a real replication system, the output plugin will need to be modified to produce output in the format appropriate to that system, and to perform filtering. Currently, information can be extracted from the logical decoding system only via SQL; future commits will add the ability to stream changes via walsender. Andres Freund, with review and other contributions from many other people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan, Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve Singer.	2014-03-03 16:32:18 -05:00

1 2

91 Commits