Robert O'Callahan robert at ocallahan.org
Tue May 27 15:36:54 PDT 2014

Chris' contract has finished so he'll have less time on rr, but things are
still happening :-).

I figured out that the issue causing test failures on some Haswell CPUs was
glibc using the new transactional memory ("RTM") feature for lock elision.
The underlying issue seems to be that sometimes, when a transaction aborts
its effects on the RBC performance counter are not rolled back. rr now
works around this by overriding a few pthread locking functions to disable
lock elision.

I discovered that on some recent Intel CPUs the hardware supports
triggering a fault when CPUID is executed. This would allow us to trap and
emulate CPUID, if only Linux had prctl support for that like it has for
RDTSC. At some point someone should write a kernel patch for that so we can
get it upstream and eventually rr could use it, both to suppress
advertising support for hardware features we can't handle (like RTM) and
also to ensure CPUID results during replay match those during recording.
The latter would let us replay on any core, not just core 0, and also
improve our ability to transport traces across machines.

We still have at least one known divergence failure on rr trunk, on Ehsan's
machine. I'm working on getting access so I can debug that.

I landed a large series of patches that enable manual checkpointing during
replay. The gdb commands "checkpoint", "delete checkpoint" and "restart"
are overridden to hook into rr's checkpointing machinery. The biggest
difficulty was making it possible to checkpoint and resume at any point in
the trace, in particular, when not at an rr event boundary; this required
some refactoring, and exposed some existing and new bugs. Creating
checkpoints is super fast, but resuming them (in a Firefox debug run) is
slow, because we do a gdb "run" command internally which reloads all
symbols. (We can fix this later by optimizing the case where no shared
libraries are loaded/unloaded between the current point and the
checkpoint.) But resuming a checkpoint is always faster than rerunning from
the start, of course.

We should do a new rr release soon with these improvements, but I would
like to try to fix Ehsan's bug first.

Jtehsauts  tshaei dS,o n" Wohfy  Mdaon  yhoaus  eanuttehrotraiitny  eovni
le atrhtohu gthot sf oirng iyvoeu rs ihnesa.r"t sS?o  Whhei csha iids  teoa
stiheer :p atroa lsyazye,d  'mYaonu,r  "sGients  uapr,e  tfaokreg iyvoeunr,
'm aotr  atnod  sgaoy ,h o'mGee.t"  uTph eann dt hwea lmka'n?  gBoutt  uIp
waanndt  wyeonut  thoo mken.o w
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/rr-dev/attachments/20140528/38c34b60/attachment.html>

More information about the rr-dev mailing list