Proposal to make gloda fulltext tokenizer treat '_' as punctuation without schema bump
asutherland at asutherland.org
Tue Jul 17 00:27:12 UTC 2012
In https://bugzilla.mozilla.org/show_bug.cgi?id=774188 it was determined
that the SQLite tokenizer thinks that underscore characters are part of
the word and not punctuation. So while "foo-bar-baz" tokenizes to
["foo", "bar", "baz"], "foo_bar_baz" just tokenizes to ["foo_bar_baz"].
In order for total correctness if we change the tokenizer's behaviour,
we need to bump the gloda db schema revision so that all messages are
reindexed. If we don't bump the schema rev, then search terms with "_"
in them effectively become invisible to the search mechanism because the
search engine will go looking for the effective phrase "foo bar baz"
when you search for "foo_bar_baz", and so will never find the messages
I don't like bumping the gloda schema rev because it has the very bad UX
of "I upgraded Thunderbird and now Thunderbird is using a lot of my CPU
and if I do gloda searches right now, they might not find anything".
The argument for making the fix and not bumping the schema is that
treating underscores as part of the word is arguably messed up right now.
NB: There is a possible third path where we pipe the contents of the
fulltext search table into a new table so fulltext reindexing happens
without actual gloda reindexing happening. This is arguably
super-undesirable because SQLite would not throttle itself so it would
likely max one of CPU/disk I/O, and would generate a ridiculously large
transaction file as well as growing the database file by a lot. Also I
never wrote any code to do that, so that would be new code.
More information about the tb-planning