Proposal to make gloda fulltext tokenizer treat '_' as punctuation without schema bump

Andrew Sutherland asutherland at asutherland.org
Tue Jul 17 00:27:12 UTC 2012


In https://bugzilla.mozilla.org/show_bug.cgi?id=774188 it was determined 
that the SQLite tokenizer thinks that underscore characters are part of 
the word and not punctuation.  So while "foo-bar-baz" tokenizes to 
["foo", "bar", "baz"], "foo_bar_baz" just tokenizes to ["foo_bar_baz"].

In order for total correctness if we change the tokenizer's behaviour, 
we need to bump the gloda db schema revision so that all messages are 
reindexed.  If we don't bump the schema rev, then search terms with "_" 
in them effectively become invisible to the search mechanism because the 
search engine will go looking for the effective phrase "foo bar baz" 
when you search for "foo_bar_baz", and so will never find the messages 
in question.

I don't like bumping the gloda schema rev because it has the very bad UX 
of "I upgraded Thunderbird and now Thunderbird is using a lot of my CPU 
and if I do gloda searches right now, they might not find anything".  
The argument for making the fix and not bumping the schema is that 
treating underscores as part of the word is arguably messed up right now.

NB: There is a possible third path where we pipe the contents of the 
fulltext search table into a new table so fulltext reindexing happens 
without actual gloda reindexing happening.  This is arguably 
super-undesirable because SQLite would not throttle itself so it would 
likely max one of CPU/disk I/O, and would generate a ridiculously large 
transaction file as well as growing the database file by a lot.  Also I 
never wrote any code to do that, so that would be new code.

Andrew



More information about the tb-planning mailing list