Proposal to make gloda fulltext tokenizer treat '_' as punctuation without schema bump

Wayne Mery vseerror at Lehigh.EDU
Tue Jul 17 18:49:01 UTC 2012

Quoting Kent James <kent at>:

> On 7/17/2012 10:15 AM, Andrew Sutherland wrote:
>> On 07/17/2012 03:34 AM, Tanstaafl wrote:
>>> But wouldn't it be better to simply write the index/reindex code  
>>> so that it simply doesn't and *can* not consume all CPU cycles? Is  
>>> there no way to throttle it so that it never uses more that say 20%?
>> The code does use adaptive scheduling to try and detect how much  
>> CPU/system time it is using, as well as to notice when the system  
>> appears to be under load (many thanks to rkent for this!) in order  
>> to limit its activities so it doesn't harsh the system.  
>> Unfortunately, this is a tricky thing to do given the limited  
>> platform facilities at hand and how much stuff happens and needs to  
>> happen on the main thread in Thunderbird.  It is possible that  
>> virus checkers are making this much worse on windows, but I don't  
>> have any hard numbers.

It been reported that heavy gloda indexing combined with indexing load  
of Windows Search, or Spotlight on Mac, is quite bad.

Other cases of bad performance have been reported, as Tanstaafl aludes  
to.  In some cases it's clearly inferior hardware (old, bad disk,  
etc).  Beyond that, nothing widespread, but most often no solution is  
found, which is obviously frustrating for the user. The throttling is  
simpllly insufficient in these cases.  If I had to guess, I'd say most  
are disk saturation as rkent mentions. I can reproduce it. And I can  
believe that in some cases AV is involved.

It would be helpful to have a tool or methodology to determine what's wrong.

>> Right now the CPU targets are for 50% utilization while the user is  
>> using Thunderbird and 83% while the user is not using Thunderbird.
>> This is an area where I would be very happy to work with someone  
>> who has the time to get some actual numbers by using profilers like  
>> Xperf and/or augmenting our telemetry reports and delving through  
>> them.
>> Andrew
> Unfortunately many times the problem is not CPU, but disk  
> saturation, and we have no real way to detect and throttle that.

In case someone wants to dig in a get dirty, the major open gloda perf  
issues (other than previously stopword support) are bug 551209, and  
two labeled as perf from [1], bug 585429 and bug 632791.

But these last two are more about speed of indexing, not so much about  
Thunderbird responsiveness.  Identifiable bugs should of course be  
reported - but so far no users I've helped since version 5 have enough  
info to be worth reporting.


This message was sent using IMP, the Internet Messaging Program.

More information about the tb-planning mailing list