Signal delivery missed when executed under rr

Robert O'Callahan robert at ocallahan.org
Fri Jun 19 03:23:34 UTC 2020


On Fri, Jun 19, 2020 at 3:11 PM Noah Misch <noah at leadboat.com> wrote:

> I've been using "rr record --chaos" (self-built at rr commit e89f84c7,
> running
> under kernel 4.19.98-1+deb10u1) to hunt for race conditions in a test
> suite[1].


Fun! Did you catch anything useful?

One of the traces, out of 396, appears to show a SIGUSR2 queued via
> kill(), yet never delivered to the recipient.  Have any of you seen
> symptoms
> like this?  Here's a normal sequence of events:
>
>     {
>       real_time:6015822.471460 global_time:343531, event:`SYSCALL: kill'
> (state:ENTERING_SYSCALL) tid:20801, ticks:1363861
>     rax:0xffffffffffffffda rbx:0x681ff648 rcx:0xffffffffffffffff rdx:0x2
> rsi:0xc rdi:0x5143 rbp:0x3e rsp:0x681ff480 r8:0x7ffd4e9a1f90 r9:0x0 r10:0x0
> r11:0x246 r12:0x5585fb9086f0 r13:0x648753d35b00 r14:0x4 r15:0
>     x0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0
> gs:0x0 orig_rax:0x3e fs_base:0x648753d35b80 gs_base:0x0
>     }
>     {
>       real_time:6015822.471489 global_time:343532, event:`SYSCALL: kill'
> (state:EXITING_SYSCALL) tid:20801, ticks:1363861
>     rax:0x0 rbx:0x681ff648 rcx:0xffffffffffffffff rdx:0x2 rsi:0xc
> rdi:0x5143 rbp:0x3e rsp:0x681ff480 r8:0x7ffd4e9a1f90 r9:0x0 r10:0x0
> r11:0x246 r12:0x5585fb9086f0 r13:0x648753d35b00 r14:0x4 r15:0x0 rip:0x700000
>     02 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0
> orig_rax:0x3e fs_base:0x648753d35b80 gs_base:0x0
>     }
>     {
>       real_time:6015822.471524 global_time:343533, event:`SYSCALL:
> epoll_wait' (state:EXITING_SYSCALL) tid:20803, ticks:21329
>     rax:0xfffffffffffffffc rbx:0x681ff3b8 rcx:0xffffffffffffffff rdx:0x1
> rsi:0x5585fc37b6a8 rdi:0x4 rbp:0x0 rsp:0x681ff1f0 r8:0x5000004 r9:0x1
> r10:0x493e0 r11:0x246 r12:0x7f12b8d3d04e r13:0x493e0 r14:0x5585fc37
>     b6a8 r15:0x7f12b8d3d05a rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b
> ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xe8 fs_base:0x648753d35b80 gs_base:0x0
>     }
>     {
>       real_time:6015822.471561 global_time:343534, event:`SIGNAL:
> SIGUSR2(async)' tid:20803, ticks:21329
>     rax:0xfffffffffffffffc rbx:0x681ff3b8 rcx:0xffffffffffffffff rdx:0x1
> rsi:0x5585fc37b6a8 rdi:0x4 rbp:0x0 rsp:0x681ff1f0 r8:0x5000004 r9:0x1
> r10:0x493e0 r11:0x246 r12:0x7f12b8d3d04e r13:0x493e0 r14:0x5585fc37
>     b6a8 r15:0x7f12b8d3d05a rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b
> ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xe8 fs_base:0x648753d35b80 gs_base:0x0
>     }
>     {
>       real_time:6015822.471592 global_time:343535, event:`SIGNAL_HANDLER:
> SIGUSR2(async)' tid:20803, ticks:21329
>     rax:0x0 rbx:0x681ff3b8 rcx:0xffffffffffffffff rdx:0x681fec40
> rsi:0x681fed70 rdi:0xc rbp:0x0 rsp:0x681fec38 r8:0x5000004 r9:0x1
> r10:0x493e0 r11:0x246 r12:0x7f12b8d3d04e r13:0x493e0 r14:0x5585fc37b6a8
> r15:0x7f12b8d3d05a rip:0x5585fb47a900 eflags:0x246 cs:0x33 ss:0x2b ds:0x0
> es:0x0 fs:0x0 gs:0x0 orig_rax:0xe8 fs_base:0x648753d35b80 gs_base:0x0
> st0:0x0 st1:0x0 st2:0x0 st3:0x0 st4:0x0 st5:0x0 st6:0x0 st7:0x0 ymm0:0x0
> ymm1:0x0 ymm2:0x0 ymm3:0x0 ymm4:0x0 ymm5:0x0 ymm6:0x0 ymm7:0x0 ymm8:0x0
> ymm9:0x0 ymm10:0x0 ymm11:0x0 ymm12:0x0 ymm13:0x0 ymm14:0x0 ymm15:0x0
>     }
>
> In the failed run, the recipient-task (latter three) events don't appear.
>

Does the target process run after the kill() exits? Was the target process
definitely in epoll_wait()?

It inherits blocked SIGUSR2 at
>   fork, then calls sigemptyset(&set);sigprocmask(SIG_SETMASK, &set, NULL)
> to
>   unblock.  I set breakpoints on sigprocmask() and pthread_sigmask().  The
>   call triggered delivery of a pending SIGINT, then returned zero.  The
>   SIGUSR2, expected later, never arrived.  Hence, this theory did not pan
> out.
>

You mean breakpoints in the rr replay of the recording with the missing
signal? You should be able to definitely determine whether the sigprocmask
was cleared before the relevant kill() syscall was entered.

rr doesn't do anything special with SIGUSR2 and it doesn't emulate kill()
syscalls, so I'd be surprised if it had somehow caused SIGUSR2 to be lost.
I don't remember encountering any rr bugs like that.

Rob
-- 
Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy ot
mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil eht. Efil
fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah ruo dna ta
dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw, draeh evah ew
hcihw, gninnigeb eht morf saw hcihw taht.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/rr-dev/attachments/20200619/9d6295ff/attachment.html>


More information about the rr-dev mailing list