Thunderbird Conversations : feedback wanted!
asutherland at asutherland.org
Wed Dec 1 09:42:44 UTC 2010
On 11/30/2010 03:32 PM, Ben Bucksch wrote:
> Given that this is likely to be useful for many other callers, could
> this be implemented in gloda? If you think that would hurt in case of
> some callers, maybe that could be optional per connection/caller.
Yes, this would be a reasonable thing to have in the gloda core. There
are some queries that we currently are unable to keep up-to-date in
memory as indexing occurs (fulltext and LIKE constraints), but for most
relational queries that are not "frozen" to updates, it could be done.
Because Gloda already uses weak references to keep track of collections,
we could probably get a lot of mileage without actually impacting
Holding collections alive in memory for longer could be problematic
though; the benefit of a successful cache hit is directly proportional
to the amount of memory we are eating which would also be directly
proportional to the potential cost of erroneously keeping the memory alive.
> Another option: Given that most gloda databases should be in the
> 10-100MB range (mine is 600 MB, but it's a big mailbox), and many
> machines today have 2+GB RAM *, is it possible to somehow tell the OS
> to pre-cache certain parts of gloda into RAM disk cache?
Taras Glek has done some really fantastic research about improving
SQLite performance that can be found on his blog
(http://blog.mozilla.com/tglek/). The key takeaway problem points for
me were that SQLite databases can get really fragmented (both from a
file-system perspective and an internal btree perspective) and operating
system pre-fetch is really limited even when you have high disk
locality. Given the fragmentation that can occur and that efficient
database queries do not exhibit linear access patterns (even on
unfragmented databases), pre-caching of the gloda database as it exists
is not likely to be helpful unless we can cache the whole thing.
The major solution Taras found and advocates are increasing page sizes
to the maximum, 32K, and periodic vacuuming. (btw, I misspoke in my
previous message, Gloda uses a 1K page size right now, not 512 bytes.)
I had investigated using a larger page size of 4k previously when I
boosted our SQLite cache size (hardcoded to a max of 8megs, but it could
be adaptive like the Places' DB) but kept us at the (then) default of 1k
based on input on 401985 that suggested performance was a) similar and
actually superior for 1k on some platforms and b) the larger page size
resulted in a significant increase in the size of the rollback journal.
While the size of the rollback journal is not a direct concern, once the
set of dirty pages exceeds the SQLite cache size it necessitates
additional fsync operations to maintain correctness and that was
seriously concerning (at least in pre-/non-write-ahead-log operation).
The Places database did not share the same problems as us since they
tend to have smaller data sets and do not use the fulltext search
mechanism which generates a lot of the page churn.
Given the non-trivial size of most gloda databases, vacuuming is not a
particularly viable option since it is blocking and would take a loong
time and generate what looks like an I/O storm.
The good news is that there are solutions to these problems. My plan
for the next major rev of gloda is to adopt a sharding technique that will:
- Increase disk locality by storing things you are likely to want
together and things you don't appear to care about elsewhere.
- Reduce fragmentation by reducing turnover of page tables by storing
old data in separate databases and new data with a good chance of
turnover in dedicated, smaller databases. These smaller databases can
then be more easily vacuumed. Actually, I think my plan is actually to
do 'manual' incremental vacuuming by migrating blocks of records. This
will allow us to throttle our I/O activity while also retaining a high
level of responsiveness to new data and what not.
- Greatly reduce the time between issuing a fulltext query and getting
back the first set of results (which will already have a good chance of
being exactly what you want already).
So, to more directly answer your question:
1) We can definitely and easily increase the size allocated to the
SQLite cache, but except for the journal spill problem, it's not clear
that this will do a ton for us so I would need to see performance
numbers. There is a SQLite option that reads as much of the database
from the start of the database and throws it into cache at startup
(which we may already be using), but that won't get us much with a
highly internally fragmented database.
2) I don't think there's an easy win on trying to get the OS to cache
the gloda database without fixing the structural problems gloda has,
short of encouraging it to cache the whole thing. Also, it seems like
the kind of thing that would prove more frustrating for users than
having to wait a few extra seconds when they do a fulltext search.
In terms of when gloda will get these structural changes, I'm currently
still focusing on the UI stuff that can leverage gloda's existing
capabilities before overhauling gloda. (No point having a backend
without a way to get at its fanciness!) While we are limited in our
ability to improve the performance of fulltext search without those
changes, there's a lot that can be accomplished with the existing gloda
through judicious use of pre-fetching.
More information about the tb-planning