space management

Thomas Reilly treilly at adobe.com
Thu Dec 20 08:26:34 PST 2007


Part of our tamarin-tracing effort has been to re-evaluate MMgc.
There's ascpects of MMgc's design that we knew right off the bat would
make it non-ideal for embedded devices.  We have a prototype in shop
to replace MMgc and I wanted to see what the folks who care about
space management think about it.

First, the problems with MMgc:

We don't think the 4kb block architecture is suitable for embedded
devices.  The size class (40) X memory type (3) model leads to pretty
large base level heaps even for trivial applications.  Running in a
fixed 128kb heap environment is never gonna happen with that model.

Interface/C++ issue, MMgc didn't have a cleanly defined interface, a
cleanly defined C interface is more appropriate for such a low level
library.  We'd like have a clean well defined interface and make it
possible to reimplement MMgc on top of this interface alongside
whatever else we want to explore.  We hope to see more activity in the
GC area down the road (generational) and pluggability will facilitate
this.

Too big!  Software can never be too small but we have concrete code
size goals in mind with tamarin-tracing (hereafter TT) and MMgc was
becoming the fat kid in the room as the rest of the VM got smaller.
MMgc problem areas: a) ZCT implementation b) managed vs. unmanaged
duplicity (Fixed vs. GC)

DEBUG vs. Release behavior divergence.  Making things behave too
differently in DEBUG mode makes for lots of hard release only bugs.
Things like zero'ing and poisoning and padding (for
underwrite/overdetection, back pointer graphs etc) work differently in
debug release.  Padding, zero'ing, poisoning should be the same in
release and debug and if you need any of those things there should be
switches to turn them on.  Some of this stuff may be implementable
outside the memory library using tools like valgrind and dtrace (which
works on any build w/ symbols).

Process wide GCHeap.  We've witnessed in the player that if you have
two large applications when one goes away it leaves virtual memory
looking like swiss cheese.  We want more compartmentalization of
address space.

MMgc spends most of its time in mark.  We want to explore exact
marking techniques to speed up marking.  Mostly this b/c of being
conservative but we can do more to speed it up than just going exact
(separating gc/non-gc pointers in object layout, prefetching etc).

DRC implementation, the 4 byte object penalty sucked and the space we
took out of the users object, the memory manager should manage the DRC
bits internally.

Finalizers and the gc callbacks (presweep/postsweep/prereap) have been
problematic.  Ideally the memory manager won't have any re-entrancy.
presweep and finalizers are kinda redundant, we think we want to
remove finalization and force clients to make do with presweep.

Wow that was cathartic, familiarity breeds contempt I guess.  So now
about the new hotness:

So the protoype I've been working on is basically re-working MMgc to
address these issues.  We started with a juicy center based on TLSF
[1] and added the conservative/drc based collection around it.  TLSF
works like traditional malloc implementations but it scales into
larger objects well with a uniform algorithm which makes it small (ie none 
of the tree stuff dlmalloc has for medium size allocations).

Laundry list:

DRC was revamped and greatly simplified.  

GCHeap is gone, instead we get memory in bigger chunks and get/free it
directly from the OS.

There's a splendid clean one file header defined API with names based
on Monty Python that we have divergent opinions about.

There's no more managed vs. unmanaged distinction for write barriers,
there's only one write barrier.  For tamarin's purposes this means
AvmCore is just like any other object.  Unmanaged and managed objects
are handled the same and both are scanned (unless they ask to be
opaque).  Unmanaged just means the gc will never free it.

The new library includes a "lot" abstraction to break up large
allocations. The lot theory is that large allocations lead to
fragmentation.  So the idea is to use lots for any allocation that may
be greater than kLotMaxAlloc.  Which breaks the allocation up into a
balanced tree of allocations no larger kLotMaxAlloc. The other
advantage is fast, controlled growth (no copying required so you can
grow incrementally without the usual perf penalty).  Lots are the
underlying memory structure the VM uses for List/SortedMap/intern
tables/Array's.  Hopefully InlineHashtable's, Lir space, large
String's and large ScriptObjects will be added to the list soon.  The
idea is that with no large allocations ever hitting the allocator
it'll be much less likely that we'll need to expand the heap to meet
an memory request.

Some embedded JVM's do this with Array's, we're just taking the
concept to the extreme.  I cooked up this spread sheet to see how much
data can be stuffed into a lot at a particular depth:

	http://spreadsheets.google.com/pub?key=p-Z4pnL-Pu9OhYdDLIG_6Pg

Here's the math (M = kLotMaxAlloc, W == machine word size)

		0 = M
		1 = M * M/W 
		2 = M * M/W * M/W
		n = M^(n+1)/W^n

So a kLotMaxAlloc of 4096 can stuff 4MB of data into a 1 dimensional
lot.  Ed likes 4096 as the kLotMaxAlloc, I'm leaning towards 2048, we
decided to let benchmarks decide the matter later on.  But whatever,
some small integral division of the page size.  The issues at play are
locality (smaller lot, less locality) and lookup speed.  Although the
later can be mitigated with the lot_block_get API.  On small devices
with little or no cache a 512/1024 block size probably would work and
desktop's/server the OS page size probably makes sense.  We've tested
it down to 128 bytes.

The depth can also be thought of as the # of memory hits it takes to
get from a lot to the address of a particular piece of data.  So for
depth 0 its just math to calculate the address.  At depth 1 you have
to do one lookup, etc.

So that's whats been done so far.  Ideas for improvement:

1) accurate GC, heap is easy need to figure out native function
frames, forth/Box stack and JIT stack.  Plan is to dump our thoughts
on this to a bug if you want to particpate.

2) Trace JIT space optimizations.  If you're interested read the last
couple or so of Andreas's blog entries [2].   Some ideas:

   Allocation folding, if two allocations have the same life time,
   coalesce them.  Basically objects created in constructor and never
   updated.

   load elimination - read of uninitialized data

   store elimination - writes of default values to uninitialized data

   allocation hoisting - when an object is allocated each iteration
   and is garbage after just allocate it once in the loop header

   stack allocation, promote small non-escaping objects into local
   variables (iterators, Point/Rectangle) or for large objets generate
   an explicit delete

We're thinking aggressive memory optimizations in the JIT may even
allow us to dump DRC (although we have non-JIT targets in mind so DRC
would just be turned off).

3) profile guided optimizations:
   slot arrangement, put fields accessed together in time together in space
   slot demotion, use dynamic storage for infrequently accessed fields

Sorry for the long post, hopefully this prototype will land in the
public soon as a tamarin-tracing branch, we'd like to air it out
publically before committing it to TT.

[1] Two level segregated fit: http://rtportal.upv.es/rtmalloc/
[2] http:://andreasgal.com/




More information about the Tamarin-devel mailing list