gloda fulltext search query performance improvement landed, potentially some qualitative fallout, be aware

Andrew Sutherland asutherland at
Wed Apr 7 21:44:06 UTC 2010

  I just landed:

This should improve performance on gloda searches where we have more 
than 400 results.  It may also qualitatively affect the results 
returned.  Our ranking function is basically based on ordering messages 
by date and taking the most recent messages; our trick is that we make 
messages that are more notable (starred message, tagged message, author 
is in an address book, recipients in address book, etc.) or have more 
fulltext matches seem more recent than they really are.

The two files that affect our ranking of messages which affects who gets 
to be in the top 400 are:

In the former, the knobs to twiddle are the arguments to glodaRank in 
RANK_USAGE.  They are the per-column weights.  In the latter, the 
arguments to tweak are the COLUMN_SATURATION values.

The main potential change in results is that we:

1) Are more accurate about knowing how many matches we have.  We were 
previously estimating in an ugly fashion by the length of the offsets() 

2) Are more accurate about knowing the columns which are matching.  This 
accordingly allows us to weight the columns (the values in msg_search.js).

3) Are able to limit the impact of matches with the COLUMN_SATURATION 
values.  Previously if the user searched for "spam" and there was a 
message that went "spam spam spam spam spammity spam" a million times in 
the body of the message, that message would get boosted way into the top 
400.  Now we saturate at a count of 10 with a weight of 1.0 for each, 
which means that the message will look at most 10 weeks more recent than 
it really is on the basis of the body match (as opposed to a jillion).

The current constant choices are primarily made out of kinda-sorta 
consistency with the JS ranking function which uses roughly the same set 
of constants itself.  The JS code is actually a bit more complicated in 
terms of how it weights and scores things.  Hm.  For some reason I 
thought it was using the proximity information that can be inferred 
provided by offsets(), but it turns out I was wrong about that... so 
actually the C++ ranking function can actually do everything the gloda 
ranking function can do in terms of fulltext stuff right now.  Interesting.

The ranking function actually has even more data available to it than we 
are using, but it's somewhat limited under 3.6.22; once we are able to 
use or whenever FTS4 lands, we'll be able to do much more 
sophisticated stuff in the function.  Based on current time limitations 
and the changes in 3.6.23(.1), it doesn't seem worth trying to integrate 
that additional data and so what we have is mainly an attempt to refine 
our existing heuristic a little bit.

This message is intended as a heads-up and request for *extremely 
well-supported feedback*.  Also, since it turns out we're not using the 
proximity information, a heads-up to anyone who wants to provide a patch 
on that; I've filed 
for that.


More information about the tb-planning mailing list