Despite the best efforts of commit 0e5c82380, we're still seeing
occasional failures of postgres_fdw's query_cancel test in the
buildfarm. Investigation suggests that its 100ms timeout is
still not enough to reliably ensure that the remote side starts
the query before receiving the cancel request --- and if it
hasn't, it will just discard the request because it's idle.
We discussed allowing a cancel request to kill the next-received
query, but that would have wide and perhaps unpleasant side-effects.
What seems safer is to make postgres_fdw do what a human user would
likely do, which is issue another cancel request if the first one
didn't seem to do anything. We'll keep the same overall 30 second
grace period before concluding things are broken, but issue additional
cancel requests after 1 second, then 2 more seconds, then 4, then 8.
(The next one in series is 16 seconds, but we'll hit the 30 second
timeout before that.)
Having done that, revert the timeout in query_cancel.sql to 10 ms.
That will still be enough on most machines, most of the time, for
the remote query to start; but now we're intentionally risking the
race condition occurring sometimes in the buildfarm, so that the
repeat-cancel code path will get some testing.
As before, back-patch to v17. We might eventually contemplate
back-patching this further, and/or adding similar logic to dblink.
But given the lack of field complaints to date, this feels like
mostly an exercise in test case stabilization, so v17 is enough.
Discussion: https://postgr.es/m/colnv3lzzmc53iu5qoawynr6qq7etn47lmggqr65ddtpjliq5d@glkveb4m6nop
This test occasionally shows
+WARNING: could not get result of cancel request due to timeout
which appears to be because the cancel request is sometimes unluckily
sent to the remote session between queries, and then it's ignored.
This patch tries to make that less probable in three ways:
1. Use a test query that does not involve remote estimates, so that
no EXPLAINs are sent.
2. Make sure that the remote session is ready-to-go (transaction
started, SET commands sent) before we start the timer.
3. Increase the statement_timeout to 100ms, to give the local
session enough time to plan and issue the query.
We might have to go higher than 100ms to make this adequately
stable in the buildfarm, but let's see how it goes.
Back-patch to v17 where this test was introduced.
Jelte Fennema-Nio and Tom Lane
Discussion: https://postgr.es/m/578934.1725045685@sss.pgh.pa.us
This allows us to skip it in Cygwin, where it's reportedly flaky because
of platform bugs or something.
Backpatch to 17, where the test was introduced by commit 2466d6654f85.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/e4d0cb33-6be5-e4d5-ae49-9eac3ff2b005@gmail.com