mirror of
				https://github.com/postgres/postgres.git
				synced 2025-10-24 01:29:19 +03:00 
			
		
		
		
	Update README-SSI. Add a section to describe the "dangerous structure" that
SSI is based on, as well as the optimizations about relative commit times and read-only transactions. Plus a bunch of other misc fixes and improvements. Dan Ports
This commit is contained in:
		| @@ -51,13 +51,13 @@ if a transaction can be shown to always do the right thing when it is | ||||
| run alone (before or after any other transaction), it will always do | ||||
| the right thing in any mix of concurrent serializable transactions. | ||||
| Where conflicts with other transactions would result in an | ||||
| inconsistent state within the database, or an inconsistent view of | ||||
| inconsistent state within the database or an inconsistent view of | ||||
| the data, a serializable transaction will block or roll back to | ||||
| prevent the anomaly. The SQL standard provides a specific SQLSTATE | ||||
| for errors generated when a transaction rolls back for this reason, | ||||
| so that transactions can be retried automatically. | ||||
|  | ||||
| Before version 9.1 PostgreSQL did not support a full serializable | ||||
| Before version 9.1, PostgreSQL did not support a full serializable | ||||
| isolation level. A request for serializable transaction isolation | ||||
| actually provided snapshot isolation. This has well known anomalies | ||||
| which can allow data corruption or inconsistent views of the data | ||||
| @@ -77,7 +77,7 @@ Serializable Isolation Implementation Strategies | ||||
|  | ||||
| Techniques for implementing full serializable isolation have been | ||||
| published and in use in many database products for decades. The | ||||
| primary technique which has been used is Strict 2 Phase Locking | ||||
| primary technique which has been used is Strict Two-Phase Locking | ||||
| (S2PL), which operates by blocking writes against data which has been | ||||
| read by concurrent transactions and blocking any access (read or | ||||
| write) against data which has been written by concurrent | ||||
| @@ -112,54 +112,90 @@ visualize the difference between the serializable implementations | ||||
| described above, is to consider that among transactions executing at | ||||
| the serializable transaction isolation level, the results are | ||||
| required to be consistent with some serial (one-at-a-time) execution | ||||
| of the transactions[1]. How is that order determined in each? | ||||
| of the transactions [1]. How is that order determined in each? | ||||
|  | ||||
| S2PL locks rows used by the transaction in a way which blocks | ||||
| conflicting access, so that at the moment of a successful commit it | ||||
| is certain that no conflicting access has occurred. Some transactions | ||||
| may have blocked, essentially being partially serialized with the | ||||
| committing transaction, to allow this. Some transactions may have | ||||
| been rolled back, due to cycles in the blocking. But with S2PL, | ||||
| transactions can always be viewed as having occurred serially, in the | ||||
| order of successful commit. | ||||
| In S2PL, each transaction locks any data it accesses. It holds the | ||||
| locks until committing, preventing other transactions from making | ||||
| conflicting accesses to the same data in the interim. Some | ||||
| transactions may have to be rolled back to prevent deadlock. But | ||||
| successful transactions can always be viewed as having occurred | ||||
| sequentially, in the order they committed. | ||||
|  | ||||
| With snapshot isolation, reads never block writes, nor vice versa, so | ||||
| there is much less actual serialization. The order in which | ||||
| transactions appear to have executed is determined by something more | ||||
| subtle than in S2PL: read/write dependencies. If a transaction | ||||
| attempts to read data which is not visible to it because the | ||||
| transaction which wrote it (or will later write it) is concurrent | ||||
| (one of them was running when the other acquired its snapshot), then | ||||
| the reading transaction appears to have executed first, regardless of | ||||
| the actual sequence of transaction starts or commits (since it sees a | ||||
| database state prior to that in which the other transaction leaves | ||||
| it). If one transaction has both rw-dependencies in (meaning that a | ||||
| concurrent transaction attempts to read data it writes) and out | ||||
| (meaning it attempts to read data a concurrent transaction writes), | ||||
| and a couple other conditions are met, there can appear to be a cycle | ||||
| in execution order of the transactions. This is when the anomalies | ||||
| occur. | ||||
| more concurrency is possible. The order in which transactions appear | ||||
| to have executed is determined by something more subtle than in S2PL: | ||||
| read/write dependencies. If a transaction reads data, it appears to | ||||
| execute after the transaction that wrote the data it is reading. | ||||
| Similarly, if it updates data, it appears to execute after the | ||||
| transaction that wrote the previous version. These dependencies, which | ||||
| we call "wr-dependencies" and "ww-dependencies", are consistent with | ||||
| the commit order, because the first transaction must have committed | ||||
| before the second starts. However, there can also be dependencies | ||||
| between two *concurrent* transactions, i.e. where one was running when | ||||
| the other acquired its snapshot.  These "rw-conflicts" occur when one | ||||
| transaction attempts to read data which is not visible to it because | ||||
| the transaction which wrote it (or will later write it) is | ||||
| concurrent. The reading transaction appears to have executed first, | ||||
| regardless of the actual sequence of transaction starts or commits, | ||||
| because it sees a database state prior to that in which the other | ||||
| transaction leaves it. | ||||
|  | ||||
| SSI works by watching for the conditions mentioned above, and rolling | ||||
| back a transaction when needed to prevent any anomaly. The apparent | ||||
| order of execution will always be consistent with any actual | ||||
| serialization (i.e., a transaction which run by itself can always be | ||||
| considered to have run after any transactions committed before it | ||||
| started and before any transacton which starts after it commits); but | ||||
| among concurrent transactions it will appear that the transaction on | ||||
| the read side of a rw-dependency executed before the transaction on | ||||
| the write side. | ||||
| Anomalies occur when a cycle is created in the graph of dependencies: | ||||
| when a dependency or series of dependencies causes transaction A to | ||||
| appear to have executed before transaction B, but another series of | ||||
| dependencies causes B to appear before A. If that's the case, then | ||||
| the results can't be consistent with any serial execution of the | ||||
| transactions. | ||||
|  | ||||
|  | ||||
| SSI Algorithm | ||||
| ------------- | ||||
|  | ||||
| Serializable transaction in PostgreSQL are implemented using | ||||
| Serializable Snapshot Isolation (SSI), based on the work of Cahill | ||||
| et al. Fundamentally, this allows snapshot isolation to run as it | ||||
| has, while monitoring for conditions which could create a serialization | ||||
| anomaly. | ||||
|  | ||||
| SSI is based on the observation [2] that each snapshot isolation | ||||
| anomaly corresponds to a cycle that contains a "dangerous structure" | ||||
| of two adjacent rw-conflict edges: | ||||
|  | ||||
|       Tin ------> Tpivot ------> Tout | ||||
|             rw             rw | ||||
|  | ||||
| SSI works by watching for this dangerous structure, and rolling | ||||
| back a transaction when needed to prevent any anomaly. This means it | ||||
| only needs to track rw-conflicts between concurrent transactions, not | ||||
| wr- and ww-dependencies. It also means there is a risk of false | ||||
| positives, because not every dangerous structure corresponds to an | ||||
| actual serialization failure. | ||||
|  | ||||
| The PostgreSQL implementation uses two additional optimizations: | ||||
|  | ||||
| * Tout must commit before any other transaction in the cycle | ||||
|   (see proof of Theorem 2.1 of [2]). We only roll back a transaction | ||||
|   if Tout commits before Tpivot and Tin. | ||||
|  | ||||
| * if Tin is read-only, there can only be an anomaly if Tout committed | ||||
|   before Tin takes its snapshot. This optimization is an original | ||||
|   one. Proof: | ||||
|  | ||||
|   - Because there is a cycle, there must be some transaction T0 that | ||||
|     precedes Tin in the serial order. (T0 might be the same as Tout). | ||||
|  | ||||
|   - The dependency between T0 and Tin can't be a rw-conflict, | ||||
|     because Tin was read-only, so it must be a wr-dependency. | ||||
|     Those can only occur if T0 committed before Tin started. | ||||
|  | ||||
|   - Because Tout must commit before any other transaction in the | ||||
|     cycle, it must commit before T0 commits -- and thus before Tin | ||||
|     starts. | ||||
|  | ||||
|  | ||||
| PostgreSQL Implementation | ||||
| ------------------------- | ||||
|  | ||||
| The implementation of serializable transactions for PostgreSQL is | ||||
| accomplished through Serializable Snapshot Isolation (SSI), based on | ||||
| the work of Cahill, et al.  Fundamentally, this allows snapshot | ||||
| isolation to run as it has, while monitoring for conditions which | ||||
| could create a serialization anomaly. | ||||
|  | ||||
|     * Since this technique is based on Snapshot Isolation (SI), those | ||||
| areas in PostgreSQL which don't use SI can't be brought under SSI. | ||||
| This includes system tables, temporary tables, sequences, hint bit | ||||
| @@ -180,7 +216,7 @@ lock or to use SELECT FOR SHARE or SELECT FOR UPDATE. | ||||
|     * Those who want to continue to use snapshot isolation without | ||||
| the additional protections of SSI (and the associated costs of | ||||
| enforcing those protections), can use the REPEATABLE READ transaction | ||||
| isolation level.  This level will retain its legacy behavior, which | ||||
| isolation level.  This level retains its legacy behavior, which | ||||
| is identical to the old SERIALIZABLE implementation and fully | ||||
| consistent with the standard's requirements for the REPEATABLE READ | ||||
| transaction isolation level. | ||||
| @@ -236,7 +272,7 @@ in PostgreSQL, but tailored to the needs of SIREAD predicate locking, | ||||
| are used.  These refer to physical objects actually accessed in the | ||||
| course of executing the query, to model the predicates through | ||||
| inference.  Anyone interested in this subject should review the | ||||
| Hellerstein, Stonebraker and Hamilton paper[2], along with the | ||||
| Hellerstein, Stonebraker and Hamilton paper [3], along with the | ||||
| locking papers referenced from that and the Cahill papers. | ||||
|  | ||||
| Because the SIREAD locks don't block, traditional locking techniques | ||||
| @@ -273,6 +309,15 @@ transaction already holds a write lock on any tuple representing the | ||||
| row, since a rw-dependency would also create a ww-dependency which | ||||
| has more aggressive enforcement and will thus prevent any anomaly. | ||||
|  | ||||
|     * Modifying a heap tuple creates a rw-conflict with any transaction | ||||
| that holds a SIREAD lock on that tuple, or on the page or relation | ||||
| that contains it. | ||||
|  | ||||
|     * Inserting a new tuple creates a rw-conflict with any transaction | ||||
| holding a SIREAD lock on the entire relation. It doesn't conflict with | ||||
| page-level locks, because page-level locks are only used to aggregate | ||||
| tuple locks. Unlike index page locks, they don't lock "gaps" on the page. | ||||
|  | ||||
|  | ||||
| Index AM implementations | ||||
| ------------------------ | ||||
| @@ -296,13 +341,13 @@ need not generate a conflict, although an update which "moves" a row | ||||
| into the scan must generate a conflict.  While correctness allows | ||||
| false positives, they should be minimized for performance reasons. | ||||
|  | ||||
| Several optimizations are possible: | ||||
| Several optimizations are possible, though not all implemented yet: | ||||
|  | ||||
|     * An index scan which is just finding the right position for an | ||||
| index insertion or deletion need not acquire a predicate lock. | ||||
| index insertion or deletion needs not acquire a predicate lock. | ||||
|  | ||||
|     * An index scan which is comparing for equality on the entire key | ||||
| for a unique index need not acquire a predicate lock as long as a key | ||||
| for a unique index needs not acquire a predicate lock as long as a key | ||||
| is found corresponding to a visible tuple which has not been modified | ||||
| by another transaction -- there are no "between or around" gaps to | ||||
| cover. | ||||
| @@ -317,10 +362,10 @@ x = 1 AND x = 2), then no predicate lock is needed. | ||||
|  | ||||
| Other index AM implementation considerations: | ||||
|  | ||||
|     * If a btree search discovers that no root page has yet been | ||||
| created, a predicate lock on the index relation is required; | ||||
| otherwise btree searches must get to the leaf level to determine | ||||
| which tuples match, so predicate locks go there. | ||||
|     * B-tree index searches acquire predicate locks only on the | ||||
| index *leaf* pages needed to lock the appropriate index range. If, | ||||
| however, a search discovers that no root page has yet been created, a | ||||
| predicate lock on the index relation is required. | ||||
|  | ||||
|     * GiST searches can determine that there are no matches at any | ||||
| level of the index, so there must be a predicate lock at each index | ||||
| @@ -346,11 +391,6 @@ to be added from scratch. | ||||
|  | ||||
|    2. The existing in-memory lock structures were not suitable for | ||||
| tracking SIREAD locks. | ||||
|           * The database products used for the prototype | ||||
| implementations for the papers used update-in-place with a rollback | ||||
| log for their MVCC implementations, while PostgreSQL leaves the old | ||||
| version of a row in place and adds a new tuple to represent the row | ||||
| at a new location. | ||||
|           * In PostgreSQL, tuple level locks are not held in RAM for | ||||
| any length of time; lock information is written to the tuples | ||||
| involved in the transactions. | ||||
| @@ -450,18 +490,19 @@ there can't be a rw-conflict from T3 to T0. | ||||
|  | ||||
|           o In both cases, we didn't need the T1 -> T3 edge. | ||||
|  | ||||
|     * Predicate locking in PostgreSQL will start at the tuple level | ||||
| when possible, with automatic conversion of multiple fine-grained | ||||
| locks to coarser granularity as need to avoid resource exhaustion. | ||||
| The amount of memory used for these structures will be configurable, | ||||
| to balance RAM usage against SIREAD lock granularity. | ||||
|     * Predicate locking in PostgreSQL starts at the tuple level | ||||
| when possible. Multiple fine-grained locks are promoted to a single | ||||
| coarser-granularity lock as needed to avoid resource exhaustion.  The | ||||
| amount of memory used for these structures is configurable, to balance | ||||
| RAM usage against SIREAD lock granularity. | ||||
|  | ||||
|     * A process-local copy of locks held by a process and the coarser | ||||
| covering locks with counts, are kept to support granularity promotion | ||||
| decisions with low CPU and locking overhead. | ||||
|     * Each backend keeps a process-local table of the locks it holds. | ||||
| To support granularity promotion decisions with low CPU and locking | ||||
| overhead, this table also includes the coarser covering locks and the | ||||
| number of finer-granularity locks they cover. | ||||
|  | ||||
|     * Conflicts will be identified by looking for predicate locks | ||||
| when tuples are written and looking at the MVCC information when | ||||
|     * Conflicts are identified by looking for predicate locks | ||||
| when tuples are written, and by looking at the MVCC information when | ||||
| tuples are read. There is no matching between two RAM-based locks. | ||||
|  | ||||
|     * Because write locks are stored in the heap tuples rather than a | ||||
| @@ -493,12 +534,12 @@ to be READ ONLY.) | ||||
|           o We can more aggressively clean up conflicts, predicate | ||||
| locks, and SSI transaction information. | ||||
|  | ||||
|     * Allow a READ ONLY transaction to "opt out" of SSI if there are | ||||
|     * We allow a READ ONLY transaction to "opt out" of SSI if there are | ||||
| no READ WRITE transactions which could cause the READ ONLY | ||||
| transaction to ever become part of a "dangerous structure" of | ||||
| overlapping transaction dependencies. | ||||
|  | ||||
|     * Allow the user to request that a READ ONLY transaction wait | ||||
|     * We allow the user to request that a READ ONLY transaction wait | ||||
| until the conditions are right for it to start in the "opt out" state | ||||
| described above. We add a DEFERRABLE state to transactions, which is | ||||
| specified and maintained in a way similar to READ ONLY. It is | ||||
| @@ -538,12 +579,6 @@ address it? | ||||
| replication solutions, like Postgres-R, Slony, pgpool, HS/SR, etc. | ||||
| This is related to the "WAL file replay" issue. | ||||
|  | ||||
|     * Weak-memory-ordering machines. Make sure that shared memory | ||||
| access which involves visibility across multiple transactions uses | ||||
| locks as needed to avoid problems. On the other hand, ensure that we | ||||
| really need volatile where we're using it. | ||||
| http://archives.postgresql.org/pgsql-committers/2008-06/msg00228.php | ||||
|  | ||||
|     * UNIQUE btree search for equality on all columns. Since a search | ||||
| of a UNIQUE index using equality tests on all columns will lock the | ||||
| heap tuple if an entry is found, it appears that there is no need to | ||||
| @@ -551,15 +586,6 @@ get a predicate lock on the index in that case. A predicate lock is | ||||
| still needed for such a search if a matching index entry which points | ||||
| to a visible tuple is not found. | ||||
|  | ||||
|     * Planner index probes. To avoid problems with data skew at the | ||||
| ends of an index which have historically caused bad plans, the | ||||
| planner now probes the end of an index to see what the maximum or | ||||
| minimum value is when a query appears to be requesting a range of | ||||
| data outside what statistics shows is present. These planner checks | ||||
| don't require predicate locking, but there's currently no easy way to | ||||
| avoid it. What can we do to avoid predicate locking for such planner | ||||
| activity? | ||||
|  | ||||
|     * Minimize touching of shared memory. Should lists in shared | ||||
| memory push entries which have just been returned to the front of the | ||||
| available list, so they will be popped back off soon and some memory | ||||
| @@ -573,13 +599,17 @@ Footnotes | ||||
| [1] http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt | ||||
| Search for serial execution to find the relevant section. | ||||
|  | ||||
| [2] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf | ||||
| Joseph M. Hellerstein, Michael Stonebraker and James Hamilton. 2007. | ||||
| [2] A. Fekete et al. Making Snapshot Isolation Serializable. In ACM | ||||
| Transactions on Database Systems 30:2, Jun. 2005. | ||||
| http://dx.doi.org/10.1145/1071610.1071615 | ||||
|  | ||||
| [3] Joseph M. Hellerstein, Michael Stonebraker and James Hamilton. 2007. | ||||
| Architecture of a Database System. Foundations and Trends(R) in | ||||
| Databases Vol. 1, No. 2 (2007) 141-259. | ||||
| http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf | ||||
|   Of particular interest: | ||||
|     * 6.1 A Note on ACID | ||||
|     * 6.2 A Brief Review of Serializability | ||||
|     * 6.3 Locking and Latching | ||||
|     * 6.3.1 Transaction Isolation Levels | ||||
|     * 6.5.3 Next-Key Locking: Physical Surrogates for Logical | ||||
|     * 6.5.3 Next-Key Locking: Physical Surrogates for Logical Properties | ||||
|   | ||||
		Reference in New Issue
	
	Block a user