treilly at adobe.com
Fri Jan 4 14:27:09 PST 2008
> I'm going to have a long and rambly set of concerns here: treilly are you in
> the MV area? I will be in town Jan 29-31... is there a chance we could meet
> up to discuss this?
I'm in SF so I'll be around. I'm all for a space summit if
others are interested. Anyone else interested let me know.
> What kind of measurements are you doing for your current work?
None yet, the work flow is kinda like this:
1) meet code size goals (5kb arm)
2) meet memory usage goals (beat MMgc)
3) optimize, optimize, optimize
Most measurements obviously come at step 3.
> What are your plans for finalizers?
TT is almost at the point where it doesn't use them at all and it
really doesn't use them at all in my private branch. If you
think about it I there's no real difference between a linked list
presweep approach to finalization and GC provided finalization.
Its just a matter of who's doing the work, the collector or the
1) GC finalization is slower, it has to either check every object
or build its own lists
2) mutator finalization is better defined since there's some
control over order of operations and only objects that really
need finalization are operated on. In my experience a lot of
finalizers are noops (ie a disconnected Socket object).
3) finalization gets abused by mutators and cause problems
frequently doing things that don't need to be done in the
finalizer, forcing mutators to do handle finalizations via
presweep will probably better focus the finalizer code to what
really needs to be done.
In the player we have finalizers on every object, %99 don't need
it and the important cases (usually serializing about to die
object graphs to the filesystem/socket) have to be done in
presweep anyway because they need to see in tact object graphs.
Finalization just feels like a warm fuzzy C++ blanket that automatic
memory management grew out of.
> Does the allocator treat GC objects and explicit allocations the same? Here
> is my major concern: research indicates  that paging during GC
> drastically degrades performance.
The way the allocator works is largely up in the air, remember TT
is still in prototype mode and there's still a lot of leeway. But
the answer is yes they are stored together and that we think we only
want it this way for minimal/embedded configs.
There's two things to worry about, # of cache lines brought in
and hitting too many pages causing TLB misses or soft page
faults (let me know if my terminology isn't right). Putting a
mark bit in the object header of an opaque object would be
example of doing a bad job of the former. For non-opaque objects
I'm not sure what's better. At first blush it seems better to
store the bits with the object b/c you need to scan the object
anyway but if you have a highly connected object graph and end up
checking an objects mark status many times then it probably is
still advantageous to store the object bits elsewhere.
Spreading objects over too many pages is a problem too. MMgc
approached this by having page types. Ie all GC objects lived on
pages together and all non-GC object lived on other pages. DRC
and opaque objects also had pages of their own. MMgc started
with one page type and grew to where its at and improved each
The new library will have similiar mechanisms. It uses "regions"
which are kinda like jemalloc's run's except they are fixed
size (think ~32kb/64kb) and store objects of different sizes
contiguously. I was thinking for desktop builds opaque vs. non
opaque objects would live in separate regions. This would help
the hitting too many pages during scanning issue. At this point
I don't plan to store mark bits outside the object but I'm
willing to revisit.
The new approach has other things going for it:
1) objects are stored contiguously, if I allocate a couple 16
byte objects the chances are good they will live on the same
cache line. jemalloc can say this too but if I allocate a 32 byte
object followed by a 16 byte object and another 32 byte object in
our scheme they can be contiguous and in jemalloc they won't.
2) atom's/pointers are coalesced and stored contiguously as part
of our exact marking strategy, this should greatly reduce the #
of cache lines that need to be brought in and make them dense (ie
we'll need to read all of it). This applies only to VM laid out
objects of course which are the predominant object type in our
use cases (die C++).
I'm not worried about image data, video data, etc. Small opaque
data can be separated out into their own regions or become direct
OS mmap allocations if they are large enough (ala jemalloc huge
allocs). What I'm not sure about is allocations that aren't big
enough for huge treatment (128/256kb?), but are too big for my current
scheme (32kb). I'm not against a "large" object classification
for these object but I've got TT doing pretty well without it
thanks to the lot idea. TLSF can handle it but my currently fixed size
region scheme doesn't.
I'm convinced lots are the key to keeping external fragmentation
down and that they can be made to be performant. If I'm right
we'll never need the "large" object size and things will be
simpler. We don't know if I'm right yet ;-)
> Mozilla will not be able to switch much of its code to the lot
> abstraction immediately (and I'm a little skeptical about the
> abstraction in general)... if the primary reason for lots is to
> avoid memory fragmentation, couldn't we just use a better
> allocator? Hiding all memory access to a large object behind a
> lot abstraction layer seems like a pretty hefty price to pay.
What if it greatly reduces fragmentation and memory usage and the
performance penalties can be addressed? How many allocations
does Mozilla make that are greater than 32kb? What are they?
I'd like to see some object size frequency data on mozilla. For
the player and TT I can testify that we have a bell curve where
~%90 of allocations are between 32b and 256b. We've spent a good
bit of time doing two things:
- packing/aggregating frequent tiny allocations
- splitting large frequent allocations into a common object and
an uncommon on demand allocated auxiliary object.
I think these two things do more for performance, locality, heap
shrinkage and fragmentation reduction than allocator choice any
day of the week. That said if the player doesn't need "large" object
support and Mozilla does then I'd be happy to help you guys add it ;-)
More information about the Tamarin-devel