mirror of
https://github.com/MariaDB/server.git
synced 2025-07-29 05:21:33 +03:00
boolean fulltext search weighting scheme changed
This commit is contained in:
@ -463,3 +463,4 @@ mysql-test/r/rpl000001.eval
|
|||||||
Docs/safe-mysql.xml
|
Docs/safe-mysql.xml
|
||||||
mysys/test_vsnprintf
|
mysys/test_vsnprintf
|
||||||
Docs/manual.de.log
|
Docs/manual.de.log
|
||||||
|
Docs/internals.info
|
||||||
|
@ -57,6 +57,7 @@ This is a manual about @strong{MySQL} internals.
|
|||||||
* mysys functions:: Functions In The @code{mysys} Library
|
* mysys functions:: Functions In The @code{mysys} Library
|
||||||
* DBUG:: DBUG Tags To Use
|
* DBUG:: DBUG Tags To Use
|
||||||
* protocol:: MySQL Client/Server Protocol
|
* protocol:: MySQL Client/Server Protocol
|
||||||
|
* Fulltext Search:: Fulltext Search in MySQL
|
||||||
@end menu
|
@end menu
|
||||||
|
|
||||||
|
|
||||||
@ -535,7 +536,7 @@ Print query.
|
|||||||
@end table
|
@end table
|
||||||
|
|
||||||
|
|
||||||
@node protocol, , DBUG, Top
|
@node protocol, Fulltext Search, DBUG, Top
|
||||||
@chapter MySQL Client/Server Protocol
|
@chapter MySQL Client/Server Protocol
|
||||||
|
|
||||||
@menu
|
@menu
|
||||||
@ -785,6 +786,48 @@ Date 03 0A 00 00 |01 0A |03 00 00 00
|
|||||||
|
|
||||||
@c @printindex fn
|
@c @printindex fn
|
||||||
|
|
||||||
|
@node Fulltext Search, , protocol, Top
|
||||||
|
@chapter Fulltext Search in MySQL
|
||||||
|
|
||||||
|
Hopefully, sometime there will be complete description of
|
||||||
|
fulltext search algorithms.
|
||||||
|
Now it's just unsorted notes.
|
||||||
|
|
||||||
|
@menu
|
||||||
|
* Weighting in boolean mode::
|
||||||
|
@end menu
|
||||||
|
|
||||||
|
@node Weighting in boolean mode, , , Fulltext Search
|
||||||
|
@section Weighting in boolean mode
|
||||||
|
|
||||||
|
The basic idea is as follows: in expression
|
||||||
|
@code{A or B or (C and D and E)}, either @code{A} or @code{B} alone
|
||||||
|
is enough to match the whole expression. While @code{C},
|
||||||
|
@code{D}, and @code{E} should @strong{all} match. So it's
|
||||||
|
reasonable to assign weight 1 to @code{A}, @code{B}, and
|
||||||
|
@code{(C and D and E)}. And @code{C}, @code{D}, and @code{E}
|
||||||
|
should get a weight of 1/3.
|
||||||
|
|
||||||
|
Things become more complicated when considering boolean
|
||||||
|
operators, as used in MySQL FTB. Obvioulsy, @code{+A +B}
|
||||||
|
should be treated as @code{A and B}, and @code{A B} -
|
||||||
|
as @code{A or B}. The problem is, that @code{+A B} can @strong{not}
|
||||||
|
be rewritten in and/or terms (that's the reason why this - extended -
|
||||||
|
set of operators was chosen). Still, aproximations can be used.
|
||||||
|
@code{+A B C} can be approximated as @code{A or (A and (B or C))}
|
||||||
|
or as @code{A or (A and B) or (A and C) or (A and B and C)}.
|
||||||
|
Applying the above logic (and omitting mathematical
|
||||||
|
transformations and normalization) one gets that for
|
||||||
|
@code{+A_1 +A_2 ... +A_N B_1 B_2 ... B_M} the weights
|
||||||
|
should be: @code{A_i = 1/N}, @code{B_j=1} if @code{N==0}, and,
|
||||||
|
otherwise, in the first rewritting approach @code{B_j = 1/3},
|
||||||
|
and in the second one - @code{B_j = (1+(M-1)*2^M)/(M*(2^(M+1)-1))}.
|
||||||
|
|
||||||
|
The second expression gives somewhat steeper increase in total
|
||||||
|
weight as number of matched B's increases, because it assigns
|
||||||
|
higher weights to individual B's. Also the first expression in
|
||||||
|
much simplier. So it is the first one, that is implemented in MySQL.
|
||||||
|
|
||||||
@summarycontents
|
@summarycontents
|
||||||
@contents
|
@contents
|
||||||
|
|
||||||
|
@ -48933,6 +48933,8 @@ Our TODO section contains what we plan to have in 4.0. @xref{TODO MySQL 4.0}.
|
|||||||
|
|
||||||
@itemize @bullet
|
@itemize @bullet
|
||||||
@item
|
@item
|
||||||
|
Boolean fulltext search weighting scheme changed to something more reasonable.
|
||||||
|
@item
|
||||||
Fixed bug in boolean fulltext search, that caused MySQL to ignore queries of
|
Fixed bug in boolean fulltext search, that caused MySQL to ignore queries of
|
||||||
@code{ft_min_word_len} characters.
|
@code{ft_min_word_len} characters.
|
||||||
@item
|
@item
|
||||||
|
@ -322,7 +322,7 @@ void _ftb_climb_the_tree(FTB *ftb, FTB_WORD *ftbw, FT_SEG_ITERATOR *ftsi_orig)
|
|||||||
break;
|
break;
|
||||||
if (yn & FTB_FLAG_YES)
|
if (yn & FTB_FLAG_YES)
|
||||||
{
|
{
|
||||||
ftbe->cur_weight+=weight;
|
ftbe->cur_weight += weight / ftbe->ythresh;
|
||||||
if (++ftbe->yesses == ythresh)
|
if (++ftbe->yesses == ythresh)
|
||||||
{
|
{
|
||||||
yn=ftbe->flags;
|
yn=ftbe->flags;
|
||||||
@ -360,7 +360,7 @@ void _ftb_climb_the_tree(FTB *ftb, FTB_WORD *ftbw, FT_SEG_ITERATOR *ftsi_orig)
|
|||||||
}
|
}
|
||||||
else
|
else
|
||||||
{
|
{
|
||||||
ftbe->cur_weight+=weight;
|
ftbe->cur_weight += ftbe->ythresh ? weight/3 : weight;
|
||||||
if (ftbe->yesses < ythresh)
|
if (ftbe->yesses < ythresh)
|
||||||
break;
|
break;
|
||||||
yn= (ftbe->yesses++ == ythresh) ? ftbe->flags : 0 ;
|
yn= (ftbe->yesses++ == ythresh) ? ftbe->flags : 0 ;
|
||||||
|
Reference in New Issue
Block a user