x86 string instructions
me at kylehuey.com
Tue Feb 24 15:12:58 PST 2015
On Tue, Feb 24, 2015 at 3:00 PM, Robert O'Callahan <robert at ocallahan.org>
> x86 REP-prefixed string instructions cause problems for rr because they
> are loops which don't increment the retired-conditional-branches counter,
> and this violates rr's assumption that the number of instruction executions
> between rbc ticks is bounded.
> For example "rep stosb" is sometimes used to fill large areas of memory.
> If we get a signal during this loop, then when advance_to tries to replay
> to the signal, it can't use breakpoints so it has to singlestep until it
> reaches the target state, which could be millions of singlesteps of the
> "rep stosb" instruction. This takes forever.
> The problem more often arises when reverse-executing to a watchpoint where
> the previous watchpoint firing was caused by a REP-prefixed string
> instruction. We need to singlestep forward to before the watchpoint firing
> (which is *after* the watchpoint firing, in reverse execution), and this
> takes forever. Even worse, the ReplayTimeline logic that orders program
> states also needs to do this kind of singlestepping to order the register
> states for a given value of the tick counter.
> To address these issues, we need to be able to execute many iterations of
> the string instruction in one go --- without overshooting specified
> destination state(s). The only way I can think of to do this is to compute
> how many iterations of the instruction we should execute, and then set a
> read/write watchpoint at a memory location that will be read/written near
> the end of those iterations. This is unfortunately quite complicated for
> several reasons:
> -- "cmpsb" and "scasb" instructions may terminate early due to comparison
> results, so we also need to set a breakpoint after the string instruction
> to catch early exits.
> -- Intel CPUs have a quirk where, when not-singlestepping, multiple
> iterations of "rep stosb" (and other instructions) are coalesced to write
> up to 64 bytes at a time (on Ivy Bridge; maybe more on other
> microarchitectures). This effectively means watchpoints can fire late, so
> we need to pad.
> -- There are lots of variants of these instructions that work with words
> of different sizes. The direction can also be changed dynamically.
> -- We need this to work even when the debugger has its own
> watchpoints/breakpoints applied. This is a pain because we may not have a
> free watchpoint to use. Disambiguating user watchpoints from our internal
> watchpoint is also a problem. I think the solution has to be that we adjust
> the iteration count down so our fast-forward operation ends before any user
> watchpoint can fire ... though that means we need to encode enough of the
> string-instruction semantics to know when it would trigger arbitrary user
> I'm mainly writing this email to document the problem, but if anyone has
> any better ideas, let me know :-).
The obvious alternative is to investigate using a different counter, such
as instructions-retired. It might even be possible to use that only in
this situation, using retired-conditional-branches at all other times.
If we don't want to go that far, this solution seems good. We will need to
encode a bunch of architecture logic for the ARM port to implement single
stepping in software, so this won't be completely out of place.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the rr-dev