making bugs more reproducible under rr

Benoit Girard bgirard at mozilla.com
Sat Dec 5 23:21:00 UTC 2015


I think that's a fantastic idea. This week I ran into a bug that reproduced
intermittently only on the b2g emulator. The problem was that we computed
the TimeStamp for the next frame and would shortly called
PostDelayedTask(nextFrame - now). The b2g emulator was intermittently so
slow that by the time we got to that call the TimeDuration was now negative.

While this was a specific example, I think delaying/blocking/sleeping some
threads at various points while recording can be very useful at reproducing
this class of bugs even if it has the net effect of slowing down the
recording. I'm starting to suspect this might partially explain why the
emulator tests are far more intermittent because the execution is so much
slower.

On Sat, Dec 5, 2015 at 2:21 PM, Robert O'Callahan <robert at ocallahan.org>
wrote:

> I think making bugs found in test automation more reproducible under rr is
> the next potentially-low-hanging fruit that could strongly benefit Mozilla,
> so I've been doing some experiments in that area. My approach is to
> identify specific intermittent test failure bugs which I think I
> understand, and figure out how to tweak rr to make them reproducible in a
> reasonable number of runs.
>
> First I made some obvious changes to the rr scheduler (conditional under
> an "rr chaos mode" switch). One is to make the context-switch interval
> random, with a distribution that provides many very short intervals and
> many very long ones. On the hunch that reproducing some bugs requires some
> threads to run often and other threads to be starved, I've also introduced
> a feature where periodically (the interval is random) we randomly reassign
> the priorities of all threads. (Those priorities are honoured strictly.)
>
> This was not enough to reproduce bug 121393 in 5000 runs. When I fixed the
> bug a while ago, I had guessed that if the ImageBridge thread stalls
> indefinitely while loading the test page; then the page's onload event can
> fire, the test script runs, and its 500ms setTimeout can expire all before
> any video frames reach the compositor, so when we take our window snapshot
> the video is blank. I verified this by adding a sleep(2) call to the right
> place in ImageBridge (conditional so it only happens on the test page, not
> on the reference); this reproduces the bug on desktop.
>
> I assume this wasn't reproducing under rr because while we wait for the
> timer to expire, the ImageBridge thread is the only runnable thread so we
> will run it no matter what random priorities have been assigned. To fix
> that, we need to sometimes ignore runnable threads while we wait for a
> timer to expire. One way to do that would be to just inject random sleeps
> into the rr scheduler, but that would slow down test running quite a bit in
> chaos mode, mostly to no effect. That might be OK when running multiple
> tests in parallel, but that's cumbersome to set up locally because of focus
> issues. On the other hand, we kinda need to actually make the test run
> longer if the "timers" that are expiring are actually outside the traced
> processes, e.g. in the test harness' HTTP server. So I'm still thinking
> about the best way to solve this problem.
>
> Rob
> --
> lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf
> toD
> selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t
> rdsme,aoreseoouoto
> o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea
> lurpr
> .a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr
> esn
>
> _______________________________________________
> rr-dev mailing list
> rr-dev at mozilla.org
> https://mail.mozilla.org/listinfo/rr-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/rr-dev/attachments/20151205/5879b6ea/attachment.html>


More information about the rr-dev mailing list