So, I wrote some code to put places.sqlite data into Mentat

Richard Newman rnewman at mozilla.com
Tue Apr 17 15:00:41 UTC 2018


Thanks for doing this, Thom! It's quite a while since I did something
similar.

A few quick notes on import:

   - We expect writes to be relatively slow (indeed, we expect everything
   to be relatively slow, because we haven't done much performance tuning
   yet), but that importer code is running one SQLite write transaction and
   one Mentat transact per place. Transactions are costly — fsyncs alone will
   add up. You should get a big speedup on the importer by sharing a single `
   begin_transaction` call across all places.

   However, don't expect to be able to tweak your way to this code to
   running in milliseconds: a truly fast Places importer (indeed, this is true
   for most SQL systems) will be fast because it uses knowledge of the
   structure of the input data, rather than forcing the transactor to do
   expensive work to figure it out — *e.g.*, it will probably be faster to
   import all the places, then use their IDs to import all the visits,
   allocating new IDs for each visit in advance, because we know that visits
   never merge. The transactor does lots of work to 'knit' incoming data into
   the existing graph, and the importer knows that it doesn't need to.

   There are a bunch of smaller optimizations you could make, too (*e.g.*,
   looking up attributes by keyword *once*, rather than tens of thousands
   of times), but I won't get into those here.

   - The database size is within expected parameters. A few multiples of
   Places isn't a concern for the leaders I've talked to — we're deliberately
   trading for flexibility here. But you've used :index true on most
   attributes, which trades disk space for reverse lookup performance. There
   are a few issues tracked to make size/speed tradeoffs easier to twiddle,
   like this one to allow an automatic equivalent of url_hash
   <https://github.com/mozilla/mentat/issues/69>.

   There are two reasons why the disk size is larger than Places. The first
   is that Mentat's layout is more normalized than Places' schema; imagine a
   Places where each visit type, date, and URL was tracked in a separate
   table. The second is that we're storing a log of all the data. Right after
   an import into an empty database the log is basically identical to the live
   data, just with different indices. If you want to see what a log-free mode
   would look like, you can `DELETE FROM transactions
   <https://github.com/mozilla/mentat/issues/545>` and vacuum; you should
   see about a 30% drop in file size.

   - It might be worth vacuuming the database after the import. Mentat
   won't do that for you, and the fragmentation might affect query times.


On the topic of the queries you ran:

   - Your 10-most-recent visits query is actually more precise than you
   think it is (it doesn't mean the same thing as the SQL!), which is why it's
   slower. You intend "fetch the most recent 10 visits", but it's actually
   "fetch me the entities which have a date, a place, and a URL, excluding
   those that are missing any, and give me the distinct URLs of the 10 most
   recent". That requires joins, and the joins are why your query is slow —
   they happen *inside* the limit, because they need to exclude partial
   data.

   The performance-equivalent way to do this is to just fetch the visit
   entity: *[:find ?visit :where [?visit :visit/date ?date] :order (desc
   ?when) :limit 10]*. If you also need the other fields, use pull, which I'm
   in the process of building now
   <https://github.com/mozilla/mentat/pull/638>, or grab them out of the
   cache or off disk in a second query.

   The Mentat 2019® way of doing this would be to define a materialized
   view for this so you don't need to do any work *at all* at a
   performance-sensitive moment. We're some ways away from that.

   - For queries with times in the milliseconds, .query_prepared will give
   fairer timings. Mentat does strictly more work than SQLite — being a
   wrapper it naturally does everything SQLite does, plus planning beforehand
   and result processing afterwards — so preparing a query is a little more
   important.

   - You should consider caching appropriate attributes. Adding a
   reverse-lookup cache for :place/url_hash might help your Facebook query:
   `.cache *:place/url_hash reverse*` in the CLI.


More broadly: Places is a very relational system, and a very 'square' one
at that. For example, every visit has all three fields; there are never any
NULLs. That 'squareness' is why we originally proposed addressing Places'
workload by deriving the tabular representation from the log, preserving
the ability to write those square queries. (That could either be done inside
Mentat <https://github.com/mozilla/mentat/issues/33> or as a materialized
view in SQL.)

I'm actually quite pleasantly surprised that you've been able to dump in a
non-trivial amount of data, not vacuum the DB, and write "SQL-ey" queries
and get non-prepared and non-cache-using times that are within
rock-chucking distance of fast.

I think we can get significant improvements just by using what already
exists (*e.g.*, caching attributes and not writing SQL-style queries), and
there's already work tracked and underway that will go even further. The
final leap would be to derive the Places tables as we had originally
planned. Naturally that would get the same query performance as Places
itself (or better, because there'd be no writer contention!). We haven't
prioritized doing that work because leadership didn't consider it much of a
risk.

Keep researching! I'm keen to see this kind of thing continue.

Thanks,

-R

On Tue, Apr 17, 2018 at 1:18 AM, Thom Chiovoloni <tchiovoloni at mozilla.com>
wrote:

> In my 1:1 with Mark I talked about how I was a little concerned that
> (AFAIK) we hadn't tried to run Mentat at a scale equivalent to a user's
> Places database (I had also mentioned this a few times to various others).
> Eventually, if mentat is to actually be the way forward for synced storage
> in Firefox, this seems necessary, and it would be good to find problems
> sooner and not later.
>
> I wrote some hacky code to stuff a representative set of the
> `places.sqlite` data (the relevant-looking fields in your visit and place
> info) into the mentat DB this evening. You can find the code here
> https://github.com/thomcc/mentat-places-test, and if you have a rust dev
> environment setup it should be usable. The README explains somewhat how to
> use it at the start.
>
> I wrote a lot more details in the README, but the TLDR is that my findings
> were mixed. The performance is not very good, but it also wasn't so bad as
> to be unusable on a good machine. That said, I accidentally destroyed my
> places.sqlite while writing this code though, so I only had a subset of my
> history to test on.
>
> Caveat 1: The code is hacky and doesn't handle errors robustly. It may or
> may not work for you. Ask me on IRC, file a bug, or reply if it doesn't.
> That said, it's connection to places is readonly, so your history/bookmarks
> shouldn't be at any risk.
>
> Caveat 2: I did this mostly out of curiosity. I also likely don't really
> know how to use mentat properly, so they should be taken with a grain of
> salt. Someone who knew what they were doing would probably get better
> numbers, but IDK how much better. I'm skeptical it would compare favorably
> to the current system, but it also might not need to.
>
> Anyway, thanks,
> Thom Chiovoloni
>
>
> _______________________________________________
> Sync-dev mailing list
> Sync-dev at mozilla.org
> https://mail.mozilla.org/listinfo/sync-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/sync-dev/attachments/20180417/90ec696a/attachment.html>


More information about the Sync-dev mailing list