improving replay performance
robert at ocallahan.org
Fri Nov 20 02:32:38 UTC 2015
Based on my usage of rr debugging Gecko, replay performance is a big issue.
Significant speedups of replay and reverse execution should translate
directly into productivity improvements. Previously I've assumed that most
bugs can be figured out by reverse-executing small distances in time, but
for some classes of bugs (e.g. leaks) that's not true.
There's some relatively low-hanging fruit here because we've never really
done any perf work on replay before. The main reasons replay is slow are
around syscall buffering (because almost all rr performance issues are
either fixed by or caused by syscall buffering :-) ). The #1 issue is that
during replay all system calls, including system calls that were "untraced"
during recording, trigger ptrace stops (two per syscall). So simple
syscalls like gettimeofday cause 2 context switches each during replay
instead of zero during recording. Even worse, may-block syscalls arm and
disarm the desched event counter, so e.g. each `read` syscall causes *6*
context switches during replay instead of zero during recording.
I have a fix for the desched arm/disarm issues. For these syscalls we just
want to do nothing during replay, since they have no side effects during
replay and return no results. So I made privileged untraced syscalls (of
which they are the only ones) no-ops during replay by providing a different
rr_page during replay where the syscall instruction for privileged untraced
syscalls is replaced by an instruction that just clears the syscall result
register. This turns out to simplify rr significantly since replay no
longer needs to be aware of descheds. My patches remove desched records
from the trace altogether. These patches have landed on master, which means
the trace format has been bumped. Tests pass but this is a somewhat risky
change. It does, however, speed up replaying my microbenchmark (continuous
non-blocking `read` syscalls) by a factor of 3, as you'd expect. This took
a couple of late nights...
Replaying that microbenchmark is still ten times slower than recording,
however. To close the gap we need to be able to replay a sequence of
buffered syscalls without no per-syscall traps to rr. This is tricky
because we want to preserve the invariant that we run the same code during
recording and replay. Here are some issues:
* Our syscall wrapper code saves the syscall result value to the syscallbuf
record during recording. Currently during replay we use ptrace to set the
syscall result register to the correct result value so that when the
wrapper code is replayed, that saving is a no-op. My idea for fixing this
is to have a global "replaying or recording" flag available to the wrapper
code and have the wrapper code use a conditional move based on that flag to
use either the value loaded from the syscallbuf record or the syscall
result register as the syscall result. This would mean small, temporary
divergences in data values between recording and replay, but control flow
and perf counters would be preserved.
* Some syscalls take in/out parameters. Currently their wrappers copy from
application parameters to the syscall buffer, pass the syscall pointers
into the syscall buffer, and then copy the results out again to the
application. Naively the first step would wipe out the syscall results
we've loaded for replay. We can fix this by using conditional moves in the
copy-to-buffer code so that during replay, the source address is replaced
with the destination so the copy is a no-op.
* After we've replayed a set of buffered syscalls, we need to induce a
ptrace stop so that rr can regain control and begin replaying the following
event. I've thought of many ways to do this, none of which are wholly
satisfying. One least-bad idea might be to create a family of N identical
no-op functions and a jump table, so that after processing syscallbuf
record #K the syscall wrappers call no-op function #K. Then rr can regain
control after K records have been processed by setting a breakpoint on
no-op function #K.
* That might be it...
Those improvements should make replay at least as fast as recording,
probably significantly faster in most cases. I've also got some ideas that
could significantly speed up reverse execution (without going full
omniscient). During reverse-continue I think we can use parallelism to
speed up the search for the most recent debugger stop. However there are
tradeoffs involved so it's not quite as appealing as just making replay fly.
lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf
selthor stor edna siewaoeodm or v sstvr esBa kbvted,t
o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea
.a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the rr-dev