gloda fulltext search query performance improvement landed, potentially some qualitative fallout, be aware
asutherland at asutherland.org
Wed Apr 7 21:44:06 UTC 2010
I just landed:
This should improve performance on gloda searches where we have more
than 400 results. It may also qualitatively affect the results
returned. Our ranking function is basically based on ordering messages
by date and taking the most recent messages; our trick is that we make
messages that are more notable (starred message, tagged message, author
is in an address book, recipients in address book, etc.) or have more
fulltext matches seem more recent than they really are.
The two files that affect our ranking of messages which affects who gets
to be in the top 400 are:
In the former, the knobs to twiddle are the arguments to glodaRank in
RANK_USAGE. They are the per-column weights. In the latter, the
arguments to tweak are the COLUMN_SATURATION values.
The main potential change in results is that we:
1) Are more accurate about knowing how many matches we have. We were
previously estimating in an ugly fashion by the length of the offsets()
2) Are more accurate about knowing the columns which are matching. This
accordingly allows us to weight the columns (the values in msg_search.js).
3) Are able to limit the impact of matches with the COLUMN_SATURATION
values. Previously if the user searched for "spam" and there was a
message that went "spam spam spam spam spammity spam" a million times in
the body of the message, that message would get boosted way into the top
400. Now we saturate at a count of 10 with a weight of 1.0 for each,
which means that the message will look at most 10 weeks more recent than
it really is on the basis of the body match (as opposed to a jillion).
The current constant choices are primarily made out of kinda-sorta
consistency with the JS ranking function which uses roughly the same set
of constants itself. The JS code is actually a bit more complicated in
terms of how it weights and scores things. Hm. For some reason I
thought it was using the proximity information that can be inferred
provided by offsets(), but it turns out I was wrong about that... so
actually the C++ ranking function can actually do everything the gloda
ranking function can do in terms of fulltext stuff right now. Interesting.
The ranking function actually has even more data available to it than we
are using, but it's somewhat limited under 3.6.22; once we are able to
use 184.108.40.206 or whenever FTS4 lands, we'll be able to do much more
sophisticated stuff in the function. Based on current time limitations
and the changes in 3.6.23(.1), it doesn't seem worth trying to integrate
that additional data and so what we have is mainly an attempt to refine
our existing heuristic a little bit.
This message is intended as a heads-up and request for *extremely
well-supported feedback*. Also, since it turns out we're not using the
proximity information, a heads-up to anyone who wants to provide a patch
on that; I've filed https://bugzilla.mozilla.org/show_bug.cgi?id=557915
More information about the tb-planning