Signal delivery missed when executed under rr

Noah Misch noah at leadboat.com
Fri Jun 19 03:06:08 UTC 2020


I've been using "rr record --chaos" (self-built at rr commit e89f84c7, running
under kernel 4.19.98-1+deb10u1) to hunt for race conditions in a test
suite[1].  One of the traces, out of 396, appears to show a SIGUSR2 queued via
kill(), yet never delivered to the recipient.  Have any of you seen symptoms
like this?  Here's a normal sequence of events:

    {
      real_time:6015822.471460 global_time:343531, event:`SYSCALL: kill' (state:ENTERING_SYSCALL) tid:20801, ticks:1363861
    rax:0xffffffffffffffda rbx:0x681ff648 rcx:0xffffffffffffffff rdx:0x2 rsi:0xc rdi:0x5143 rbp:0x3e rsp:0x681ff480 r8:0x7ffd4e9a1f90 r9:0x0 r10:0x0 r11:0x246 r12:0x5585fb9086f0 r13:0x648753d35b00 r14:0x4 r15:0
    x0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x3e fs_base:0x648753d35b80 gs_base:0x0
    }
    {
      real_time:6015822.471489 global_time:343532, event:`SYSCALL: kill' (state:EXITING_SYSCALL) tid:20801, ticks:1363861
    rax:0x0 rbx:0x681ff648 rcx:0xffffffffffffffff rdx:0x2 rsi:0xc rdi:0x5143 rbp:0x3e rsp:0x681ff480 r8:0x7ffd4e9a1f90 r9:0x0 r10:0x0 r11:0x246 r12:0x5585fb9086f0 r13:0x648753d35b00 r14:0x4 r15:0x0 rip:0x700000
    02 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x3e fs_base:0x648753d35b80 gs_base:0x0
    }
    {
      real_time:6015822.471524 global_time:343533, event:`SYSCALL: epoll_wait' (state:EXITING_SYSCALL) tid:20803, ticks:21329
    rax:0xfffffffffffffffc rbx:0x681ff3b8 rcx:0xffffffffffffffff rdx:0x1 rsi:0x5585fc37b6a8 rdi:0x4 rbp:0x0 rsp:0x681ff1f0 r8:0x5000004 r9:0x1 r10:0x493e0 r11:0x246 r12:0x7f12b8d3d04e r13:0x493e0 r14:0x5585fc37
    b6a8 r15:0x7f12b8d3d05a rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xe8 fs_base:0x648753d35b80 gs_base:0x0
    }
    {
      real_time:6015822.471561 global_time:343534, event:`SIGNAL: SIGUSR2(async)' tid:20803, ticks:21329
    rax:0xfffffffffffffffc rbx:0x681ff3b8 rcx:0xffffffffffffffff rdx:0x1 rsi:0x5585fc37b6a8 rdi:0x4 rbp:0x0 rsp:0x681ff1f0 r8:0x5000004 r9:0x1 r10:0x493e0 r11:0x246 r12:0x7f12b8d3d04e r13:0x493e0 r14:0x5585fc37
    b6a8 r15:0x7f12b8d3d05a rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xe8 fs_base:0x648753d35b80 gs_base:0x0
    }
    {
      real_time:6015822.471592 global_time:343535, event:`SIGNAL_HANDLER: SIGUSR2(async)' tid:20803, ticks:21329
    rax:0x0 rbx:0x681ff3b8 rcx:0xffffffffffffffff rdx:0x681fec40 rsi:0x681fed70 rdi:0xc rbp:0x0 rsp:0x681fec38 r8:0x5000004 r9:0x1 r10:0x493e0 r11:0x246 r12:0x7f12b8d3d04e r13:0x493e0 r14:0x5585fc37b6a8 r15:0x7f12b8d3d05a rip:0x5585fb47a900 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xe8 fs_base:0x648753d35b80 gs_base:0x0 st0:0x0 st1:0x0 st2:0x0 st3:0x0 st4:0x0 st5:0x0 st6:0x0 st7:0x0 ymm0:0x0 ymm1:0x0 ymm2:0x0 ymm3:0x0 ymm4:0x0 ymm5:0x0 ymm6:0x0 ymm7:0x0 ymm8:0x0 ymm9:0x0 ymm10:0x0 ymm11:0x0 ymm12:0x0 ymm13:0x0 ymm14:0x0 ymm15:0x0
    }

In the failed run, the recipient-task (latter three) events don't appear.


I formed some theories that would explain the signal non-delivery as a bug in
the test suite or the software under test:

- Kernel might have failed the sender's kill().  Based on the control flow
  observed when stepping just beyond kill(), this did not happen.

- Recipient might have been blocking SIGUSR2.  It inherits blocked SIGUSR2 at
  fork, then calls sigemptyset(&set);sigprocmask(SIG_SETMASK, &set, NULL) to
  unblock.  I set breakpoints on sigprocmask() and pthread_sigmask().  The
  call triggered delivery of a pending SIGINT, then returned zero.  The
  SIGUSR2, expected later, never arrived.  Hence, this theory did not pan out.

Perhaps a kernel bug is possible.  Have you experienced bugs that suppress
signal delivery?

Thanks,
nm

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=tree;f=src/test/recovery;hb=a3b2bf1fe7


More information about the rr-dev mailing list