making bugs more reproducible under rr

Robert O'Callahan robert at ocallahan.org
Sat Dec 5 19:21:16 UTC 2015


I think making bugs found in test automation more reproducible under rr is
the next potentially-low-hanging fruit that could strongly benefit Mozilla,
so I've been doing some experiments in that area. My approach is to
identify specific intermittent test failure bugs which I think I
understand, and figure out how to tweak rr to make them reproducible in a
reasonable number of runs.

First I made some obvious changes to the rr scheduler (conditional under an
"rr chaos mode" switch). One is to make the context-switch interval random,
with a distribution that provides many very short intervals and many very
long ones. On the hunch that reproducing some bugs requires some threads to
run often and other threads to be starved, I've also introduced a feature
where periodically (the interval is random) we randomly reassign the
priorities of all threads. (Those priorities are honoured strictly.)

This was not enough to reproduce bug 121393 in 5000 runs. When I fixed the
bug a while ago, I had guessed that if the ImageBridge thread stalls
indefinitely while loading the test page; then the page's onload event can
fire, the test script runs, and its 500ms setTimeout can expire all before
any video frames reach the compositor, so when we take our window snapshot
the video is blank. I verified this by adding a sleep(2) call to the right
place in ImageBridge (conditional so it only happens on the test page, not
on the reference); this reproduces the bug on desktop.

I assume this wasn't reproducing under rr because while we wait for the
timer to expire, the ImageBridge thread is the only runnable thread so we
will run it no matter what random priorities have been assigned. To fix
that, we need to sometimes ignore runnable threads while we wait for a
timer to expire. One way to do that would be to just inject random sleeps
into the rr scheduler, but that would slow down test running quite a bit in
chaos mode, mostly to no effect. That might be OK when running multiple
tests in parallel, but that's cumbersome to set up locally because of focus
issues. On the other hand, we kinda need to actually make the test run
longer if the "timers" that are expiring are actually outside the traced
processes, e.g. in the test harness' HTTP server. So I'm still thinking
about the best way to solve this problem.

Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf
toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t
rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea
lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr
esn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/rr-dev/attachments/20151205/5821cbf1/attachment.html>


More information about the rr-dev mailing list