Thunderbird Conversations : feedback wanted!

Andrew Sutherland asutherland at
Wed Dec 1 09:42:44 UTC 2010

On 11/30/2010 03:32 PM, Ben Bucksch wrote:
> Given that this is likely to be useful for many other callers, could 
> this be implemented in gloda? If you think that would hurt in case of 
> some callers, maybe that could be optional per connection/caller.

Yes, this would be a reasonable thing to have in the gloda core.  There 
are some queries that we currently are unable to keep up-to-date in 
memory as indexing occurs (fulltext and LIKE constraints), but for most 
relational queries that are not "frozen" to updates, it could be done.  
Because Gloda already uses weak references to keep track of collections, 
we could probably get a lot of mileage without actually impacting 
garbage collection.

Holding collections alive in memory for longer could be problematic 
though; the benefit of a successful cache hit is directly proportional 
to the amount of memory we are eating which would also be directly 
proportional to the potential cost of erroneously keeping the memory alive.

> Another option: Given that most gloda databases should be in the 
> 10-100MB range (mine is 600 MB, but it's a big mailbox), and many 
> machines today have 2+GB RAM *, is it possible to somehow tell the OS 
> to pre-cache certain parts of gloda into RAM disk cache?

Taras Glek has done some really fantastic research about improving 
SQLite performance that can be found on his blog 
(  The key takeaway problem points for 
me were that SQLite databases can get really fragmented (both from a 
file-system perspective and an internal btree perspective) and operating 
system pre-fetch is really limited even when you have high disk 
locality.  Given the fragmentation that can occur and that efficient 
database queries do not exhibit linear access patterns (even on 
unfragmented databases), pre-caching of the gloda database as it exists 
is not likely to be helpful unless we can cache the whole thing.

The major solution Taras found and advocates are increasing page sizes 
to the maximum, 32K, and periodic vacuuming.  (btw, I misspoke in my 
previous message, Gloda uses a 1K page size right now, not 512 bytes.)  
I had investigated using a larger page size of 4k previously when I 
boosted our SQLite cache size (hardcoded to a max of 8megs, but it could 
be adaptive like the Places' DB) but kept us at the (then) default of 1k 
based on input on 401985 that suggested performance was a) similar and 
actually superior for 1k on some platforms and b) the larger page size 
resulted in a significant increase in the size of the rollback journal.  
While the size of the rollback journal is not a direct concern, once the 
set of dirty pages exceeds the SQLite cache size it necessitates 
additional fsync operations to maintain correctness and that was 
seriously concerning (at least in pre-/non-write-ahead-log operation).  
The Places database did not share the same problems as us since they 
tend to have smaller data sets and do not use the fulltext search 
mechanism which generates a lot of the page churn.

Given the non-trivial size of most gloda databases, vacuuming is not a 
particularly viable option since it is blocking and would take a loong 
time and generate what looks like an I/O storm.

The good news is that there are solutions to these problems.  My plan 
for the next major rev of gloda is to adopt a sharding technique that will:

- Increase disk locality by storing things you are likely to want 
together and things you don't appear to care about elsewhere.

- Reduce fragmentation by reducing turnover of page tables by storing 
old data in separate databases and new data with a good chance of 
turnover in dedicated, smaller databases.  These smaller databases can 
then be more easily vacuumed.  Actually, I think my plan is actually to 
do 'manual' incremental vacuuming by migrating blocks of records.  This 
will allow us to throttle our I/O activity while also retaining a high 
level of responsiveness to new data and what not.

- Greatly reduce the time between issuing a fulltext query and getting 
back the first set of results (which will already have a good chance of 
being exactly what you want already).

So, to more directly answer your question:

1) We can definitely and easily increase the size allocated to the 
SQLite cache, but except for the journal spill problem, it's not clear 
that this will do a ton for us so I would need to see performance 
numbers.  There is a SQLite option that reads as much of the database 
from the start of the database and throws it into cache at startup 
(which we may already be using), but that won't get us much with a 
highly internally fragmented database.

2) I don't think there's an easy win on trying to get the OS to cache 
the gloda database without fixing the structural problems gloda has, 
short of encouraging it to cache the whole thing.  Also, it seems like 
the kind of thing that would prove more frustrating for users than 
having to wait a few extra seconds when they do a fulltext search.

In terms of when gloda will get these structural changes, I'm currently 
still focusing on the UI stuff that can leverage gloda's existing 
capabilities before overhauling gloda.  (No point having a backend 
without a way to get at its fanciness!)  While we are limited in our 
ability to improve the performance of fulltext search without those 
changes, there's a lot that can be accomplished with the existing gloda 
through judicious use of pre-fetching.


More information about the tb-planning mailing list