Data storage for remote content whitelist

Joshua Cranmer Pidgeot18 at gmail.com
Tue Sep 6 22:44:47 UTC 2011


On 9/6/2011 3:34 PM, Irving Reid wrote:
> I've been speaking to Blake (bwinton) and Mark (standard8) about bug 
> 457296, moving the "load remote content" white list out of the address 
> book and into a separate place. This is the list of addresses for 
> which the end user has clicked "always load remote content for 
> messages from <email-address>".
>
> Mark suggested that I open a discussion on tb-planning about how to 
> store the data; for the time being it's going to be a simple table of 
> display names and email addresses.
>
> We don't currently have much data on how end users are using this 
> list, how many entries it typically contains etc. We should probably 
> add this to test-pilot or whatever other user activity gathering tool 
> would be most appropriate.
>
> That said, we don't want to create another Mork database to hold this 
> list, so I'm looking for some architectural direction about what to 
> use. My first thoughts turned to sqlite, since that's what we're using 
> for Gloda. It could also be done with a simple text format (JSON, 
> perhaps), slurped into memory on first use and appended or rewritten 
> for updates; this would be fine unless a user had many thousands of 
> whitelist entries.
>
> Are there any other alternatives I should consider? The impression I 
> get from others is that Mork and RDF are on their way out. The profile 
> currently contains sqlite, RDF, xml, Mork, json, .js (prefs), .txt and 
> .dat files (containing name=value text entries).

For a base number, I have about 420 people in my address book, one of 
the few parts of my profile that I don't take a shovel to and clean out 
every once in a while. In lieu of any other numbers, that is a first 
pass as a "how large is this dataset?" At that number, most people's 
whitelists would probably not exceed 10 KiB; slurping data from the disk 
for every message is actually somewhat feasible (it would likely be 
hidden by the network latency and is almost surely no more expensive 
than the memory cache lookup anyways).

Of the various storage types, my experiences with them:
SQLite. As Taras has pointed out, SQLite has an atrocious memory lookup 
pattern. There also appears to be a tendency to just waste space (right 
now, empty places files are initialized to a whopping 10MiB, although a 
single vacuum cleans that up to a few hundred K). Timing data for an SQL 
backend indicates that, as a {key,key}->value store, it is pretty 
atrocious, not to mention the high overhead in the API to manage the 
database.

RDF. No, just no. From what I understand of your needs, RDF exposes the 
wrong API (it is essentially a set of triples, while you essentially 
have a set of singlets).

XML. Again, XML is the wrong API--it performs best when working with 
semistructured data, whereas what you have is fully structured data.

Mork. The one thing going for this format is it is insanely fast. But it 
uses an obnoxious format and keeps everything in memory. And the API is 
somewhere between "diseased" and "insane."

JSON. Allows arbitrarily deep levels of hashtables as values, but has 
the downside that it needs to be kept in memory the entire time.

Preferences. The problem here is that preferences are actually rather 
easily accessible and probably moderately commonly modified. It also may 
affect the performance of other parts of the codebase by virtue of 
polluting the global preference namespace.

Ad-hoc INI-ish file format. This places the entire management issue on 
you, and makes the change less suitable for future migration. On the 
other hand, simple and flexible.

IndexedDB. Judging from asuth's comments, this actually isn't usable 
from chrome code. But it layers on top of the backend, so you 
essentially get to leave it up to other people how to actually implement it.

LevelDB. Not in the tree yet [it currently has some issues working on 
Windows], and would only be accessible to C++ code. I've done a little 
bit of timing here, but it is quite fast.

Of all of these options, I would recommend going with a JSON blob, 
although perhaps with the atomic rename trick in case there is an 
inopportune power outage.

-- 
Joshua Cranmer
News submodule owner
DXR coauthor




More information about the tb-planning mailing list