Signal delivery missed when executed under rr

Noah Misch noah at leadboat.com
Fri Jun 19 06:31:47 UTC 2020


On Fri, Jun 19, 2020 at 03:23:34PM +1200, Robert O'Callahan wrote:
> On Fri, Jun 19, 2020 at 3:11 PM Noah Misch <noah at leadboat.com> wrote:
> > I've been using "rr record --chaos" (self-built at rr commit e89f84c7, running
> > under kernel 4.19.98-1+deb10u1) to hunt for race conditions in a test
> > suite[1].
> > 
> Fun! Did you catch anything useful?

Not yet, but one of my collaborators exploited it recently:
https://www.postgresql.org/message-id/CAH2-WznTb6-0fjW4WPzNQh4mFvBH86J7bqZpNqteVUzo8p=6Hg@mail.gmail.com

> > One of the traces, out of 396, appears to show a SIGUSR2 queued via
> > kill(), yet never delivered to the recipient.  Have any of you seen symptoms
> > like this?  Here's a normal sequence of events:
> > 
> >     {
> >       real_time:6015822.471460 global_time:343531, event:`SYSCALL: kill' (state:ENTERING_SYSCALL) tid:20801, ticks:1363861
> >     rax:0xffffffffffffffda rbx:0x681ff648 rcx:0xffffffffffffffff rdx:0x2 rsi:0xc rdi:0x5143 rbp:0x3e rsp:0x681ff480 r8:0x7ffd4e9a1f90 r9:0x0 r10:0x0 r11:0x246 r12:0x5585fb9086f0 r13:0x648753d35b00 r14:0x4 r15:0
> >     x0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x3e fs_base:0x648753d35b80 gs_base:0x0
> >     }
> >     {
> >       real_time:6015822.471489 global_time:343532, event:`SYSCALL: kill' (state:EXITING_SYSCALL) tid:20801, ticks:1363861
> >     rax:0x0 rbx:0x681ff648 rcx:0xffffffffffffffff rdx:0x2 rsi:0xc rdi:0x5143 rbp:0x3e rsp:0x681ff480 r8:0x7ffd4e9a1f90 r9:0x0 r10:0x0 r11:0x246 r12:0x5585fb9086f0 r13:0x648753d35b00 r14:0x4 r15:0x0 rip:0x700000
> >     02 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x3e fs_base:0x648753d35b80 gs_base:0x0
> >     }
> >     {
> >       real_time:6015822.471524 global_time:343533, event:`SYSCALL: epoll_wait' (state:EXITING_SYSCALL) tid:20803, ticks:21329
> >     rax:0xfffffffffffffffc rbx:0x681ff3b8 rcx:0xffffffffffffffff rdx:0x1 rsi:0x5585fc37b6a8 rdi:0x4 rbp:0x0 rsp:0x681ff1f0 r8:0x5000004 r9:0x1 r10:0x493e0 r11:0x246 r12:0x7f12b8d3d04e r13:0x493e0 r14:0x5585fc37
> >     b6a8 r15:0x7f12b8d3d05a rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xe8 fs_base:0x648753d35b80 gs_base:0x0
> >     }
> >     {
> >       real_time:6015822.471561 global_time:343534, event:`SIGNAL: SIGUSR2(async)' tid:20803, ticks:21329
> >     rax:0xfffffffffffffffc rbx:0x681ff3b8 rcx:0xffffffffffffffff rdx:0x1 rsi:0x5585fc37b6a8 rdi:0x4 rbp:0x0 rsp:0x681ff1f0 r8:0x5000004 r9:0x1 r10:0x493e0 r11:0x246 r12:0x7f12b8d3d04e r13:0x493e0 r14:0x5585fc37
> >     b6a8 r15:0x7f12b8d3d05a rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xe8 fs_base:0x648753d35b80 gs_base:0x0
> >     }
> >     {
> >       real_time:6015822.471592 global_time:343535, event:`SIGNAL_HANDLER: SIGUSR2(async)' tid:20803, ticks:21329
> >     rax:0x0 rbx:0x681ff3b8 rcx:0xffffffffffffffff rdx:0x681fec40 rsi:0x681fed70 rdi:0xc rbp:0x0 rsp:0x681fec38 r8:0x5000004 r9:0x1 r10:0x493e0 r11:0x246 r12:0x7f12b8d3d04e r13:0x493e0 r14:0x5585fc37b6a8 r15:0x7f12b8d3d05a rip:0x5585fb47a900 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xe8 fs_base:0x648753d35b80 gs_base:0x0 st0:0x0 st1:0x0 st2:0x0 st3:0x0 st4:0x0 st5:0x0 st6:0x0 st7:0x0 ymm0:0x0 ymm1:0x0 ymm2:0x0 ymm3:0x0 ymm4:0x0 ymm5:0x0 ymm6:0x0 ymm7:0x0 ymm8:0x0 ymm9:0x0 ymm10:0x0 ymm11:0x0 ymm12:0x0 ymm13:0x0 ymm14:0x0 ymm15:0x0
> >     }
> > 
> > In the failed run, the recipient-task (latter three) events don't appear.
> > 
> > 
> Does the target process run after the kill() exits?

Yes.

> Was the target process definitely in epoll_wait()?

In the failed run, no, it was in select() instead.  More event detail from the
failed run (task 4322 calls kill(4324, SIGUSR2)):

{
  real_time:6016189.505132 global_time:4665052, event:`SYSCALL: openat' (state:EXITING_SYSCALL) tid:4324, ticks:19509
rax:0x5 rbx:0x681fdcf0 rcx:0xffffffffffffffff rdx:0x2 rsi:0x55ddd04078c8 rdi:0xffffff9c rbp:0x2 rsp:0x681fdb30 r8:0x20 r9:0x32e30b576d90 r10:0x0 r11:0x246 r12:0x55ddd04078c8 r13:0x55ddd04078c8 r14:0x55ddceaa91b0 r15:0x2 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x101 fs_base:0x4d5a302c4b80 gs_base:0x0
}
{
  real_time:6016189.505302 global_time:4665053, event:`SYSCALLBUF_FLUSH' tid:4324, ticks:19821
}
{
  real_time:6016189.505319 global_time:4665054, event:`SYSCALL: select' (state:ENTERING_SYSCALL) tid:4324, ticks:19821
rax:0xffffffffffffffda rbx:0x681fdf18 rcx:0xffffffffffffffff rdx:0x0 rsi:0x0 rdi:0x0 rbp:0x17 rsp:0x681fdd50 r8:0x681fe080 r9:0x0 r10:0x0 r11:0x246 r12:0xc0 r13:0x55ddd0407818 r14:0x2 r15:0x55ddd0407818 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x17 fs_base:0x4d5a302c4b80 gs_base:0x0
}
{
  real_time:6016189.505322 global_time:4665055, event:`SYSCALLBUF_RESET' tid:4324, ticks:19821
}

... various events of tid != 4324 ...

{
  real_time:6016189.589446 global_time:4666270, event:`SYSCALL: kill' (state:ENTERING_SYSCALL) tid:4322, ticks:1334744
rax:0xffffffffffffffda rbx:0x681ff648 rcx:0xffffffffffffffff rdx:0x2 rsi:0xc rdi:0x10e4 rbp:0x3e rsp:0x681ff480 r8:0x7ffd2643ab10 r9:0x0 r10:0x0 r11:0x246 r12:0x55ddcec0e6f0 r13:0x4d5a302c4b00 r14:0x4 r15:0x2 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x3e fs_base:0x4d5a302c4b80 gs_base:0x0
}
{
  real_time:6016189.589475 global_time:4666271, event:`SYSCALL: kill' (state:EXITING_SYSCALL) tid:4322, ticks:1334744
rax:0x0 rbx:0x681ff648 rcx:0xffffffffffffffff rdx:0x2 rsi:0xc rdi:0x10e4 rbp:0x3e rsp:0x681ff480 r8:0x7ffd2643ab10 r9:0x0 r10:0x0 r11:0x246 r12:0x55ddcec0e6f0 r13:0x4d5a302c4b00 r14:0x4 r15:0x2 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x3e fs_base:0x4d5a302c4b80 gs_base:0x0
}

... various events of tid != 4324 ...

{
  real_time:6016189.605956 global_time:4666291, event:`SYSCALL: select' (state:EXITING_SYSCALL) tid:4324, ticks:19821
rax:0x0 rbx:0x681fdf18 rcx:0xffffffffffffffff rdx:0x0 rsi:0x0 rdi:0x0 rbp:0x17 rsp:0x681fdd50 r8:0x681fe080 r9:0x0 r10:0x0 r11:0x246 r12:0xc0 r13:0x55ddd0407818 r14:0x2 r15:0x55ddd0407818 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x17 fs_base:0x4d5a302c4b80 gs_base:0x0
}
{
  real_time:6016189.606073 global_time:4666292, event:`SYSCALL: openat' (state:ENTERING_SYSCALL) tid:4324, ticks:20108
rax:0xffffffffffffffda rbx:0x681fdcf0 rcx:0xffffffffffffffff rdx:0x2 rsi:0x55ddd04078c8 rdi:0xffffff9c rbp:0x2 rsp:0x681fdb30 r8:0x20 r9:0x55ddd0401450 r10:0x0 r11:0x246 r12:0x55ddd04078c8 r13:0x55ddd04078c8 r14:0x55ddceaa91b0 r15:0x3 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x101 fs_base:0x4d5a302c4b80 gs_base:0x0
}
{
  real_time:6016189.606379 global_time:4666293, event:`SYSCALL: openat' (state:EXITING_SYSCALL) tid:4324, ticks:20108
rax:0x9 rbx:0x681fdcf0 rcx:0xffffffffffffffff rdx:0x2 rsi:0x55ddd04078c8 rdi:0xffffff9c rbp:0x2 rsp:0x681fdb30 r8:0x20 r9:0x55ddd0401450 r10:0x0 r11:0x246 r12:0x55ddd04078c8 r13:0x55ddd04078c8 r14:0x55ddceaa91b0 r15:0x3 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x101 fs_base:0x4d5a302c4b80 gs_base:0x0
}
{
  real_time:6016189.606576 global_time:4666294, event:`SYSCALLBUF_FLUSH' tid:4324, ticks:20420
}
{
  real_time:6016189.606598 global_time:4666295, event:`SYSCALL: select' (state:ENTERING_SYSCALL) tid:4324, ticks:20420
rax:0xffffffffffffffda rbx:0x681fdf18 rcx:0xffffffffffffffff rdx:0x0 rsi:0x0 rdi:0x0 rbp:0x17 rsp:0x681fdd50 r8:0x681fe080 r9:0x0 r10:0x0 r11:0x246 r12:0xc0 r13:0x55ddd0407818 r14:0x3 r15:0x55ddd0407818 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x17 fs_base:0x4d5a302c4b80 gs_base:0x0
}

... no SIGUSR2 targeting tid = 4324 ...

> > It inherits blocked SIGUSR2 at
> >   fork, then calls sigemptyset(&set);sigprocmask(SIG_SETMASK, &set, NULL) to
> >   unblock.  I set breakpoints on sigprocmask() and pthread_sigmask().  The
> >   call triggered delivery of a pending SIGINT, then returned zero.  The
> >   SIGUSR2, expected later, never arrived.  Hence, this theory did not pan out.
> > 
> > 
> You mean breakpoints in the rr replay of the recording with the missing signal?

Right.

> You should be able to definitely determine whether the sigprocmask was
> cleared before the relevant kill() syscall was entered.

sigprocmask() returns at event 4663756, so it precedes all the events listed
above.  (In some successful runs, the kill() does happen before the
sigprocmask() unblocks SIGUSR2.  That doesn't pose a problem; the SIGUSR2 is
then delivered before sigprocmask() returns.  To miss the signal this way,
SIGUSR2 would need to remain blocked for the life of the process.)


More information about the rr-dev mailing list