1
0
mirror of https://github.com/postgres/postgres.git synced 2025-07-28 23:42:10 +03:00

Fix default text search parser's ts_headline code for phrase queries.

This code could produce very poor results when asked to highlight a
string based on a query using phrase-match operators.  The root cause
is that hlCover(), which is supposed to find a minimal substring that
matches the query, was written assuming that word position is not
significant.  I'm only 95% convinced that its algorithm was correct even
for plain AND/OR queries; but it definitely fails completely for phrase
matches, causing it to possibly not identify a cover string at all.

Hence, rewrite hlCover() with a less-tense algorithm that just tries
all the possible substrings, earlier and shorter ones first.  (This is
not as bad as it sounds performance-wise, because all of the string
matching has been done already: the repeated tsquery match checks
boil down to pointer comparisons.)

Unfortunately, since that approach produces more candidate cover
strings than before, it also exposes that there were bugs in the
heuristics in mark_hl_words() for selecting a best cover string.
Fixes there include:
* Do not apply the ShortWord filter to words that appear in the query.
* Remove a misguided optimization for quickly rejecting a cover.
* Fix order-of-operation bug that could cause computation of a
wrong figure of merit (poslen) when shortening a cover.
* Change the preference rule so that candidate headlines that do not
include their whole cover string (after MaxWords trimming) are lowest
priority, since they may not actually satisfy the user's query.

This results in some changes in existing regression test cases,
but they all seem reasonable.  Note in particular that the tests
involving strings like "1 2 3" were previously being affected by
the ShortWord filter, masking the normal matching behavior.

Per bug #16345 from Augustinas Jokubauskas; the new test cases are
based on that example.  Back-patch to 9.6 where phrase search was
added to tsquery.

Discussion: https://postgr.es/m/16345-2e0cf5cddbdcd3b4@postgresql.org
This commit is contained in:
Tom Lane
2020-04-09 13:19:23 -04:00
parent 1306edeae3
commit 8413789477
3 changed files with 146 additions and 91 deletions

View File

@ -1090,12 +1090,12 @@ Water, water, every where,
Nor any drop to drink.
S. T. Coleridge (1772-1834)
', phraseto_tsquery('english', 'painted Ocean'));
ts_headline
----------------------------------
<b>painted</b> <b>Ocean</b>. +
Water, water, every where +
And all the boards did shrink;+
Water, water, every
ts_headline
---------------------------------------
<b>painted</b> Ship +
Upon a <b>painted</b> <b>Ocean</b>.+
Water, water, every where +
And all the boards did shrink
(1 row)
SELECT ts_headline('english', '
@ -1117,6 +1117,15 @@ S. T. Coleridge (1772-1834)
And all the boards
(1 row)
SELECT ts_headline('english',
'Lorem ipsum urna. Nullam nullam ullamcorper urna.',
to_tsquery('english','Lorem') && phraseto_tsquery('english','ullamcorper urna'),
'MaxWords=100, MinWords=1');
ts_headline
-------------------------------------------------------------------------------
<b>Lorem</b> ipsum <b>urna</b>. Nullam nullam <b>ullamcorper</b> <b>urna</b>
(1 row)
SELECT ts_headline('english', '
<html>
<!-- some comment -->
@ -1153,15 +1162,15 @@ SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 <-> 3', 'MaxWords=2, MinWords
(1 row)
SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 & 3', 'MaxWords=4, MinWords=1');
ts_headline
------------------------------
<b>1</b> 2 <b>3</b> <b>1</b>
ts_headline
---------------------
<b>1</b> 2 <b>3</b>
(1 row)
SELECT ts_headline('simple', '1 2 3 1 3'::text, '1 <-> 3', 'MaxWords=4, MinWords=1');
ts_headline
-------------------
<b>1</b> <b>3</b>
ts_headline
----------------------------
<b>3</b> <b>1</b> <b>3</b>
(1 row)
--Check if headline fragments work
@ -1256,6 +1265,16 @@ S. T. Coleridge (1772-1834)
S. T. <b>Coleridge</b>
(1 row)
--Fragments with phrase search
SELECT ts_headline('english',
'Lorem ipsum urna. Nullam nullam ullamcorper urna.',
to_tsquery('english','Lorem') && phraseto_tsquery('english','ullamcorper urna'),
'MaxFragments=100, MaxWords=100, MinWords=1');
ts_headline
-------------------------------------------------------------------------------
<b>Lorem</b> ipsum <b>urna</b>. Nullam nullam <b>ullamcorper</b> <b>urna</b>
(1 row)
--Rewrite sub system
CREATE TABLE test_tsquery (txtkeyword TEXT, txtsample TEXT);
\set ECHO none

View File

@ -330,6 +330,11 @@ Water, water, every where,
S. T. Coleridge (1772-1834)
', phraseto_tsquery('english', 'idle as a painted Ship'));
SELECT ts_headline('english',
'Lorem ipsum urna. Nullam nullam ullamcorper urna.',
to_tsquery('english','Lorem') && phraseto_tsquery('english','ullamcorper urna'),
'MaxWords=100, MinWords=1');
SELECT ts_headline('english', '
<html>
<!-- some comment -->
@ -400,6 +405,12 @@ Water, water, every where,
S. T. Coleridge (1772-1834)
', to_tsquery('english', 'Coleridge & stuck'), 'MaxFragments=2,FragmentDelimiter=***');
--Fragments with phrase search
SELECT ts_headline('english',
'Lorem ipsum urna. Nullam nullam ullamcorper urna.',
to_tsquery('english','Lorem') && phraseto_tsquery('english','ullamcorper urna'),
'MaxFragments=100, MaxWords=100, MinWords=1');
--Rewrite sub system
CREATE TABLE test_tsquery (txtkeyword TEXT, txtsample TEXT);