Optimising latch signals

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Optimising latch signals

Thomas Munro-5
Hi hackers,

Here are some more experimental patches to reduce system calls.

0001 skips sending signals when the recipient definitely isn't
waiting, using a new flag-and-memory-barrier dance.  This seems to
skip around 12% of the kill() calls for "make check", and probably
helps with some replication configurations that do a lot of
signalling.  Patch by Andres (with a small memory barrier adjustment
by me).

0002 gets rid of the latch self-pipe on Linux systems.

0003 does the same on *BSD/macOS systems.

The idea for 0002 and 0003 is to use a new dedicated signal just for
latch wakeups, and keep it blocked (Linux) or ignored (BSD), except
while waiting.  There may be other ways to achieve this without
bringing in a new signal, but it seemed important to leave SIGUSR1
unblocked for procsignals, and hard to figure out how to multiplex
with existing SIGUSR2 users, so for the first attempt at prototyping
this I arbitrarily chose SIGURG.

0001-Optimize-latches-to-send-fewer-signals.patch (3K) Download Attachment
0002-Remove-self-pipe-in-WAIT_USE_EPOLL-builds.patch (15K) Download Attachment
0003-Remove-self-pipe-in-WAIT_USE_KQUEUE-builds.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Optimising latch signals

Thomas Munro-5
On Sun, Aug 9, 2020 at 11:48 PM Thomas Munro <[hidden email]> wrote:

> Here are some more experimental patches to reduce system calls.
>
> 0001 skips sending signals when the recipient definitely isn't
> waiting, using a new flag-and-memory-barrier dance.  This seems to
> skip around 12% of the kill() calls for "make check", and probably
> helps with some replication configurations that do a lot of
> signalling.  Patch by Andres (with a small memory barrier adjustment
> by me).
>
> 0002 gets rid of the latch self-pipe on Linux systems.
>
> 0003 does the same on *BSD/macOS systems.
Here's a rebase over the recent signal handler/mask reorganisation.

Some thoughts, on looking at this again after a while:

1.  It's a bit clunky that pqinitmask() takes a new argument to say
whether SIGURG should be blocked; that's because the knowledge of
which latch implementation we're using is private to latch.c, and only
the epoll version needs to block it.  I wonder how to make that
tidier.
2.  It's a bit weird to have UnBlockSig (SIGURG remains blocked for
epoll builds) and UnBlockAllSig (SIGURG is also unblocked).  Maybe the
naming is confusing.
3.  Maybe it's strange to continue to use overloaded SIGUSR1 on
non-epoll, non-kqueue systems; perhaps we should use SIGURG
everywhere.
4.  As a nano-optimisation, SetLatch() on a latch the current process
owns might as well use raise(SIGURG) rather than kill().  This is
necessary to close races when SetLatch(MyLatch) runs in a signal
handler.  In other words, although this patch uses signal blocking to
close the race when other processes call SetLatch() and send us
SIGURG, there's still a race if, say, SIGINT is sent to the
checkpointer and it sets its own latch from its SIGINT handler
function; in the user context it may be in WaitEventSetWait() having
just seen latch->is_set == false, and now be about to enter
epoll_pwait()/kevent() after the signal handler returns, so we need to
give it a reason not to go to sleep.

By way of motivation for removing the self-pipe, and where possible
also the signal handler, here is a trace of the WAL writer handling
three requests to write data, on a FreeBSD system, with the patch:

kevent(9,0x0,0,{ SIGURG,... },1,{ 0.200000000 }) = 1 (0x1)
pwrite(4,"\b\M-Q\^E\0\^A\0\0\0\0\0\M-/\^\"...,8192,0xaf0000) = 8192 (0x2000)
kevent(9,0x0,0,{ SIGURG,... },1,{ 0.200000000 }) = 1 (0x1)
pwrite(4,"\b\M-Q\^E\0\^A\0\0\0\0 \M-/\^\\0"...,8192,0xaf2000) = 8192 (0x2000)
kevent(9,0x0,0,{ SIGURG,... },1,{ 0.200000000 }) = 1 (0x1)
pwrite(4,"\b\M-Q\^E\0\^A\0\0\0\0`\M-/\^\\0"...,8192,0xaf6000) = 8192 (0x2000)

Here is the same thing on unpatched master:

kevent(11,0x0,0,0x801c195b0,1,{ 0.200000000 })   ERR#4 'Interrupted system call'
SIGNAL 30 (SIGUSR1) code=SI_USER pid=66575 uid=1001
sigprocmask(SIG_SETMASK,{ SIGUSR1 },0x0)         = 0 (0x0)
write(10,"\0",1)                                 = 1 (0x1)
sigreturn(0x7fffffffc880)                        EJUSTRETURN
pwrite(4,"\b\M-Q\^E\0\^A\0\0\0\0`\M-w)\0\0"...,8192,0xf76000) = 8192 (0x2000)
kevent(11,0x0,0,{ 9,EVFILT_READ,0x0,0,0x1,0x801c19580 },1,{
0.200000000 }) = 1 (0x1)
read(9,"\0",16)                                  = 1 (0x1)
kevent(11,0x0,0,0x801c195b0,1,{ 0.200000000 })   ERR#4 'Interrupted system call'
SIGNAL 30 (SIGUSR1) code=SI_USER pid=66575 uid=1001
sigprocmask(SIG_SETMASK,{ SIGUSR1 },0x0)         = 0 (0x0)
write(10,"\0",1)                                 = 1 (0x1)
sigreturn(0x7fffffffc880)                        EJUSTRETURN
pwrite(4,"\b\M-Q\^E\0\^A\0\0\0\0 \M-y)\0\0"...,8192,0xf92000) = 8192 (0x2000)
kevent(11,0x0,0,{ 9,EVFILT_READ,0x0,0,0x1,0x801c19580 },1,{
0.200000000 }) = 1 (0x1)
SIGNAL 30 (SIGUSR1) code=SI_USER pid=66575 uid=1001
sigprocmask(SIG_SETMASK,{ SIGUSR1 },0x0)         = 0 (0x0)
write(10,"\0",1)                                 = 1 (0x1)
sigreturn(0x7fffffffc880)                        EJUSTRETURN
read(9,"\0\0",16)                                = 2 (0x2)
kevent(11,0x0,0,0x801c195b0,1,{ 0.200000000 })   ERR#4 'Interrupted system call'
SIGNAL 30 (SIGUSR1) code=SI_USER pid=66575 uid=1001
sigprocmask(SIG_SETMASK,{ SIGUSR1 },0x0)         = 0 (0x0)
write(10,"\0",1)                                 = 1 (0x1)
sigreturn(0x7fffffffc880)                        EJUSTRETURN
pwrite(4,"\b\M-Q\^E\0\^A\0\0\0\0\0\M-z)\0"...,8192,0xfa0000) = 8192 (0x2000)
kevent(11,0x0,0,{ 9,EVFILT_READ,0x0,0,0x1,0x801c19580 },1,{
0.200000000 }) = 1 (0x1)
read(9,"\0",16)                                  = 1 (0x1)

The improvement isn't quite as good on Linux, because as far as I can
tell you still have to have an (empty) signal handler installed and it
runs (can we find a way to avoid that?), but you still get to skip all
the pipe manipulation.

v2-0001-Optimize-latches-to-send-fewer-signals.patch (4K) Download Attachment
v2-0002-Remove-self-pipe-in-WAIT_USE_EPOLL-builds.patch (16K) Download Attachment
v2-0003-Remove-self-pipe-in-WAIT_USE_KQUEUE-builds.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Optimising latch signals

Thomas Munro-5
On Fri, Nov 13, 2020 at 12:42 PM Thomas Munro <[hidden email]> wrote:
> 1.  It's a bit clunky that pqinitmask() takes a new argument to say
> whether SIGURG should be blocked; that's because the knowledge of
> which latch implementation we're using is private to latch.c, and only
> the epoll version needs to block it.  I wonder how to make that
> tidier.

I found, I think, a better way: now InitializeLatchSupport() is in
charge of managing the signal handler and modifying the signal mask.

> 3.  Maybe it's strange to continue to use overloaded SIGUSR1 on
> non-epoll, non-kqueue systems; perhaps we should use SIGURG
> everywhere.

Fixed.

> The improvement isn't quite as good on Linux, because as far as I can
> tell you still have to have an (empty) signal handler installed and it
> runs (can we find a way to avoid that?), but you still get to skip all
> the pipe manipulation.

I received an off-list clue that we could use a signalfd, which I'd
discounted before because it still has to be drained; in fact the
overheads saved outweigh that so this seems like a better solution,
and I'm reliably informed that in a future WAIT_USE_IOURING mode it
should be possible to get rid of the read too, so it seems like a good
direction to go in.

I'll add this to the next commitfest.

v3-0001-Optimize-latches-to-send-fewer-signals.patch (4K) Download Attachment
v3-0002-Use-SIGURG-rather-than-SIGUSR1-for-latches.patch (10K) Download Attachment
v3-0003-Use-signalfd-for-epoll-latches.patch (14K) Download Attachment
v3-0004-Use-EVFILT_SIGNAL-for-kqueue-latches.patch (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Optimising latch signals

Thomas Munro-5
On Thu, Nov 19, 2020 at 4:49 PM Thomas Munro <[hidden email]> wrote:
> I'll add this to the next commitfest.

Let's see if this version fixes the Windows compile error and warning
reported by cfbot.

v4-0001-Optimize-latches-to-send-fewer-signals.patch (3K) Download Attachment
v4-0002-Use-SIGURG-rather-than-SIGUSR1-for-latches.patch (10K) Download Attachment
v4-0003-Use-signalfd-for-epoll-latches.patch (16K) Download Attachment
v4-0004-Use-EVFILT_SIGNAL-for-kqueue-latches.patch (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Optimising latch signals

Thomas Munro-5
Here's a new version with two small changes from Andres:
1.  Reorder InitPostmasterChild() slightly to avoid hanging on
EXEC_BACKEND builds.
2.  Revert v2's use of raise(x) instead of kill(MyProcPid, x); glibc
manages to generate 5 syscalls for raise().

I'm planning to commit this soon if there are no objections.

v5-0001-Optimize-latches-to-send-fewer-signals.patch (3K) Download Attachment
v5-0002-Use-SIGURG-rather-than-SIGUSR1-for-latches.patch (10K) Download Attachment
v5-0003-Use-signalfd-for-epoll-latches.patch (18K) Download Attachment
v5-0004-Use-EVFILT_SIGNAL-for-kqueue-latches.patch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Optimising latch signals

Thomas Munro-5
On Sat, Feb 27, 2021 at 12:04 AM Thomas Munro <[hidden email]> wrote:
> I'm planning to commit this soon if there are no objections.

Pushed, with the addition of an SFD_CLOEXEC flag for the signalfd.
Time to watch the buildfarm to find out if my speculation about
illumos is correct...


Reply | Threaded
Open this post in threaded view
|

Re: Optimising latch signals

Thomas Munro-5
On Mon, Mar 1, 2021 at 2:29 PM Thomas Munro <[hidden email]> wrote:
> Time to watch the buildfarm to find out if my speculation about
> illumos is correct...

I just heard that damselfly's host has been decommissioned with no
immediate plan for a replacement.  That was the last of the
Solaris-family animals testing master.  It may be some time before I
find out if my assumptions broke something on that OS...


Reply | Threaded
Open this post in threaded view
|

Re: Optimising latch signals

Álvaro Herrera
On 2021-Mar-03, Thomas Munro wrote:

> On Mon, Mar 1, 2021 at 2:29 PM Thomas Munro <[hidden email]> wrote:
> > Time to watch the buildfarm to find out if my speculation about
> > illumos is correct...
>
> I just heard that damselfly's host has been decommissioned with no
> immediate plan for a replacement.  That was the last of the
> Solaris-family animals testing master.  It may be some time before I
> find out if my assumptions broke something on that OS...

Hi, I don't know if you realized but we have two new Illumos members
now (haddock and hake), and they're both failing initdb on signalfd()
problems.


--
Álvaro Herrera       Valdivia, Chile


Reply | Threaded
Open this post in threaded view
|

Re: Optimising latch signals

Thomas Munro-5
On Tue, Mar 9, 2021 at 12:20 PM Alvaro Herrera <[hidden email]> wrote:

> On 2021-Mar-03, Thomas Munro wrote:
> > On Mon, Mar 1, 2021 at 2:29 PM Thomas Munro <[hidden email]> wrote:
> > > Time to watch the buildfarm to find out if my speculation about
> > > illumos is correct...
> >
> > I just heard that damselfly's host has been decommissioned with no
> > immediate plan for a replacement.  That was the last of the
> > Solaris-family animals testing master.  It may be some time before I
> > find out if my assumptions broke something on that OS...
>
> Hi, I don't know if you realized but we have two new Illumos members
> now (haddock and hake), and they're both failing initdb on signalfd()
> problems.

Ah, cool.  I'd been discussing this with their owner, who saw my
message and wanted to provide replacements.  Nice to see these start
up even though I don't love the colour of their first results.  In
off-list emails, we got as far as determining that signalfd() fails on
illumos when running inside a zone (= container), because
/dev/signalfd is somehow not present.  Apparently it works when
running on the main host.  Tracing revealed that it's trying to open
that device and getting ENOENT here:

running bootstrap script ... FATAL:  XX000: signalfd() failed
LOCATION:  InitializeLatchSupport, latch.c:279

I'll wait a short time while he tries to see if that can be fixed (I
have no clue if it's a configuration problem in some kind of zone
creation scripts, or a bug, or what).  If not, my fallback plan will
be to change it to default to WAIT_USE_POLL on that OS until it can be
fixed.


Reply | Threaded
Open this post in threaded view
|

Re: Optimising latch signals

Thomas Munro-5
On Tue, Mar 9, 2021 at 1:09 PM Thomas Munro <[hidden email]> wrote:
> On Tue, Mar 9, 2021 at 12:20 PM Alvaro Herrera <[hidden email]> wrote:
> > Hi, I don't know if you realized but we have two new Illumos members
> > now (haddock and hake), and they're both failing initdb on signalfd()
> > problems.

> I'll wait a short time while he tries to see if that can be fixed (I

They're green now.  For the record: https://www.illumos.org/issues/13613