Implement comparison of generic records (composite types), and invent a

pseudo-type record[] to represent arrays of possibly-anonymous composite types. Since composite datums carry their own type identification, no extra knowledge is needed at the array level. The main reason for doing this right now is that it is necessary to support the general case of detection of cycles in recursive queries: if you need to compare more than one column to detect a cycle, you need to compare a ROW() to an array built from ROW()s, at least if you want to do it as the spec suggests. Add some documentation and regression tests concerning the cycle detection issue.
2025-11-22 12:22:45 +03:00 · 2008-10-13 16:25:20 +00:00
parent d6dfa1e6c6
commit e3b0117459
18 changed files with 809 additions and 22 deletions
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/func.sgml,v 1.448 2008/10/03 07:33:08 heikki Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/func.sgml,v 1.449 2008/10/13 16:25:19 tgl Exp $ -->

 <chapter id="functions">
  <title>Functions and Operators</title>
@@ -10667,6 +10667,20 @@ AND
   be either true or false, never null.
  </para>

+  <note>
+   <para>
+    The SQL specification requires row-wise comparison to return NULL if the
+    result depends on comparing two NULL values or a NULL and a non-NULL.
+    <productname>PostgreSQL</productname> does this only when comparing the
+    results of two row constructors or comparing a row constructor to the
+    output of a subquery (as in <xref linkend="functions-subquery">).
+    In other contexts where two composite-type values are compared, two
+    NULL field values are considered equal, and a NULL is considered larger
+    than a non-NULL.  This is necessary in order to have consistent sorting
+    and indexing behavior for composite types.
+   </para>
+  </note>
+
  </sect2>
 </sect1>

--- a/doc/src/sgml/queries.sgml
+++ b/doc/src/sgml/queries.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/queries.sgml,v 1.47 2008/10/07 19:27:03 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/queries.sgml,v 1.48 2008/10/13 16:25:19 tgl Exp $ -->

 <chapter id="queries">
 <title>Queries</title>
@@ -1604,8 +1604,85 @@ GROUP BY sub_part
   the recursive part of the query will eventually return no tuples,
   or else the query will loop indefinitely.  Sometimes, using
   <literal>UNION</> instead of <literal>UNION ALL</> can accomplish this
-   by discarding rows that duplicate previous output rows; this catches
-   cycles that would otherwise repeat.  A useful trick for testing queries
+   by discarding rows that duplicate previous output rows.  However, often a
+   cycle does not involve output rows that are completely duplicate: it may be
+   necessary to check just one or a few fields to see if the same point has
+   been reached before.  The standard method for handling such situations is
+   to compute an array of the already-visited values.  For example, consider
+   the following query that searches a table <structname>graph</> using a
+   <structfield>link</> field:
+
+<programlisting>
+WITH RECURSIVE search_graph(id, link, data, depth) AS (
+        SELECT g.id, g.link, g.data, 1
+        FROM graph g
+      UNION ALL
+        SELECT g.id, g.link, g.data, sg.depth + 1
+        FROM graph g, search_graph sg
+        WHERE g.id = sg.link
+)
+SELECT * FROM search_graph;
+</programlisting>
+
+   This query will loop if the <structfield>link</> relationships contain
+   cycles.  Because we require a <quote>depth</> output, just changing
+   <literal>UNION ALL</> to <literal>UNION</> would not eliminate the looping.
+   Instead we need to recognize whether we have reached the same row again
+   while following a particular path of links.  We add two columns
+   <structfield>path</> and <structfield>cycle</> to the loop-prone query:
+
+<programlisting>
+WITH RECURSIVE search_graph(id, link, data, depth, path, cycle) AS (
+        SELECT g.id, g.link, g.data, 1,
+          ARRAY[g.id],
+          false
+        FROM graph g
+      UNION ALL
+        SELECT g.id, g.link, g.data, sg.depth + 1,
+          path || ARRAY[g.id],
+          g.id = ANY(path)
+        FROM graph g, search_graph sg
+        WHERE g.id = sg.link AND NOT cycle
+)
+SELECT * FROM search_graph;
+</programlisting>
+
+   Aside from preventing cycles, the array value is often useful in its own
+   right as representing the <quote>path</> taken to reach any particular row.
+  </para>
+
+  <para>
+   In the general case where more than one field needs to be checked to
+   recognize a cycle, use an array of rows.  For example, if we needed to
+   compare fields <structfield>f1</> and <structfield>f2</>:
+
+<programlisting>
+WITH RECURSIVE search_graph(id, link, data, depth, path, cycle) AS (
+        SELECT g.id, g.link, g.data, 1,
+          ARRAY[ROW(g.f1, g.f2)],
+          false
+        FROM graph g
+      UNION ALL
+        SELECT g.id, g.link, g.data, sg.depth + 1,
+          path || ARRAY[ROW(g.f1, g.f2)],
+          ROW(g.f1, g.f2) = ANY(path)
+        FROM graph g, search_graph sg
+        WHERE g.id = sg.link AND NOT cycle
+)
+SELECT * FROM search_graph;
+</programlisting>
+  </para>
+
+  <tip>
+   <para>
+    Omit the <literal>ROW()</> syntax in the common case where only one field
+    needs to be checked to recognize a cycle.  This allows a simple array
+    rather than a composite-type array to be used, gaining efficiency.
+   </para>
+  </tip>
+
+  <para>
+   A helpful trick for testing queries
   when you are not certain if they might loop is to place a <literal>LIMIT</>
   in the parent query.  For example, this query would loop forever without
   the <literal>LIMIT</>: