MultiXact\SLRU buffers configuration

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

MultiXact\SLRU buffers configuration

Andrey M. Borodin
Hi, hackers!

*** The problem ***
I'm investigating some cases of reduced database performance due to MultiXactOffsetLock contention (80% MultiXactOffsetLock, 20% IO DataFileRead).
The problem manifested itself during index repack and constraint validation. Both being effectively full table scans.
The database workload contains a lot of select for share\select for update queries. I've tried to construct synthetic world generator and could not achieve similar lock configuration: I see a lot of different locks in wait events, particularly a lot more MultiXactMemberLocks. But from my experiments with synthetic workload, contention of MultiXactOffsetLock can be reduced by increasing NUM_MXACTOFFSET_BUFFERS=8 to bigger numbers.

*** Question 1 ***
Is it safe to increase number of buffers of MultiXact\All SLRUs, recompile and run database as usual?
I cannot experiment much with production. But I'm mostly sure that bigger buffers will solve the problem.

*** Question 2 ***
Probably, we could do GUCs for SLRU sizes? Are there any reasons not to do them configurable? I think multis, clog, subtransactions and others will benefit from bigger buffer. But, probably, too much of knobs can be confusing.

*** Question 3 ***
MultiXact offset lock is always taken as exclusive lock. It turns MultiXact Offset subsystem to single threaded. If someone have good idea how to make it more concurrency-friendly, I'm willing to put some efforts into this.
Probably, I could just add LWlocks for each offset buffer page. Is it something worth doing? Or are there any hidden cavers and difficulties?

Thanks!

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Andrey M. Borodin


> 8 мая 2020 г., в 21:36, Andrey M. Borodin <[hidden email]> написал(а):
>
> *** The problem ***
> I'm investigating some cases of reduced database performance due to MultiXactOffsetLock contention (80% MultiXactOffsetLock, 20% IO DataFileRead).
> The problem manifested itself during index repack and constraint validation. Both being effectively full table scans.
> The database workload contains a lot of select for share\select for update queries. I've tried to construct synthetic world generator and could not achieve similar lock configuration: I see a lot of different locks in wait events, particularly a lot more MultiXactMemberLocks. But from my experiments with synthetic workload, contention of MultiXactOffsetLock can be reduced by increasing NUM_MXACTOFFSET_BUFFERS=8 to bigger numbers.
>
> *** Question 1 ***
> Is it safe to increase number of buffers of MultiXact\All SLRUs, recompile and run database as usual?
> I cannot experiment much with production. But I'm mostly sure that bigger buffers will solve the problem.
>
> *** Question 2 ***
> Probably, we could do GUCs for SLRU sizes? Are there any reasons not to do them configurable? I think multis, clog, subtransactions and others will benefit from bigger buffer. But, probably, too much of knobs can be confusing.
>
> *** Question 3 ***
> MultiXact offset lock is always taken as exclusive lock. It turns MultiXact Offset subsystem to single threaded. If someone have good idea how to make it more concurrency-friendly, I'm willing to put some efforts into this.
> Probably, I could just add LWlocks for each offset buffer page. Is it something worth doing? Or are there any hidden cavers and difficulties?
I've created benchmark[0] imitating MultiXact pressure on my laptop: 7 clients are concurrently running select "select * from table where primary_key = ANY ($1) for share" where $1 is array of identifiers so that each tuple in a table is locked by different set of XIDs. During this benchmark I observe contention of MultiXactControlLock in pg_stat_activity

                                    пятница,  8 мая 2020 г. 15:08:37 (every 1s)

  pid  |         wait_event         | wait_event_type | state  |                       query                        
-------+----------------------------+-----------------+--------+----------------------------------------------------
 41344 | ClientRead                 | Client          | idle   | insert into t1 select generate_series(1,1000000,1)
 41375 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
 41377 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
 41378 |                            |                 | active | select * from t1 where i = ANY ($1) for share
 41379 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
 41381 |                            |                 | active | select * from t1 where i = ANY ($1) for share
 41383 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
 41385 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
(8 rows)

Finally, the benchmark is measuring time to execute select for update 42 times.

I've went ahead and created 3 patches:
1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
2. Reduce locking level to shared on read of MultiXactId members
3. Configurable cache size

I've found out that:
1. When MultiXact working set does not fit into buffers - benchmark results grow very high. Yet, very big buffers slow down benchmark too. For this benchmark optimal SLRU size id 32 pages for offsets and 64 pages for members (defaults are 8 and 16 respectively).
2. Lock optimisation increases performance by 5% on default SLRU sizes. Actually, benchmark does not explicitly read MultiXactId members, but when it replaces one with another - it have to read previous set. I understand that we can construct benchmark to demonstrate dominance of any algorithm and 5% of synthetic workload is not a very big number. But it just make sense to try to take shared lock for reading.
3. Manipulations with cache size do not affect benchmark anyhow. It's somewhat expected: benchmark is designed to defeat cache, either way OffsetControlLock would not be stressed.

For our workload, I think we will just increase numbers of SLRU sizes. But patchset may be useful for tuning and as a performance optimisation of MultiXact.

Also MultiXacts seems to be not very good fit into SLRU design. I think it would be better to use B-tree as a container. Or at least make MultiXact members extendable in-place (reserve some size when multixact is created).
When we want to extend number of locks for a tuple currently we will:
1. Iterate through all SLRU buffers for offsets to read current offset (with exclusive lock for offsets)
2. Iterate through all buffers for members to find current members (with exclusive lock for members)
3. Create new members array with +1 xid
4. Iterate through all cache members to find out maybe there are any such cache item as what we are going to create
5. iterate over 1 again for write
6. Iterate over 2 again for write

Obviously this does not scale well - we cannot increase SLRU sizes for too long.

Thanks! I'd be happy to hear any feedback.

Best regards, Andrey Borodin.

[0] https://github.com/x4m/multixact_stress

v1-0001-Add-GUCs-to-tune-MultiXact-SLRUs.patch (8K) Download Attachment
v1-0002-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch (10K) Download Attachment
v1-0003-Make-MultiXact-local-cache-size-configurable.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Andrey M. Borodin


> 11 мая 2020 г., в 16:17, Andrey M. Borodin <[hidden email]> написал(а):
>
> I've went ahead and created 3 patches:
> 1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
> 2. Reduce locking level to shared on read of MultiXactId members
> 3. Configurable cache size

I'm looking more at MultiXact and it seems to me that we have a race condition there.

When we create a new MultiXact we do:
1. Generate new MultiXactId under MultiXactGenLock
2. Record new mxid with members and offset to WAL
3. Write offset to SLRU under MultiXactOffsetControlLock
4. Write members to SLRU under MultiXactMemberControlLock

When we read MultiXact we do:
1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
3. Retrieve members from SLRU under MultiXactMemberControlLock
4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.

What am I missing?

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Kyotaro Horiguchi-4
At Wed, 13 May 2020 23:08:37 +0500, "Andrey M. Borodin" <[hidden email]> wrote in

>
>
> > 11 мая 2020 г., в 16:17, Andrey M. Borodin <[hidden email]> написал(а):
> >
> > I've went ahead and created 3 patches:
> > 1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
> > 2. Reduce locking level to shared on read of MultiXactId members
> > 3. Configurable cache size
>
> I'm looking more at MultiXact and it seems to me that we have a race condition there.
>
> When we create a new MultiXact we do:
> 1. Generate new MultiXactId under MultiXactGenLock
> 2. Record new mxid with members and offset to WAL
> 3. Write offset to SLRU under MultiXactOffsetControlLock
> 4. Write members to SLRU under MultiXactMemberControlLock

But, don't we hold exclusive lock on the buffer through all the steps
above?

> When we read MultiXact we do:
> 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
> 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
> 3. Retrieve members from SLRU under MultiXactMemberControlLock
> 4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.

So transactions never see such incomplete mxids, I believe.

> What am I missing?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Andrey M. Borodin


> 14 мая 2020 г., в 06:25, Kyotaro Horiguchi <[hidden email]> написал(а):
>
> At Wed, 13 May 2020 23:08:37 +0500, "Andrey M. Borodin" <[hidden email]> wrote in
>>
>>
>>> 11 мая 2020 г., в 16:17, Andrey M. Borodin <[hidden email]> написал(а):
>>>
>>> I've went ahead and created 3 patches:
>>> 1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
>>> 2. Reduce locking level to shared on read of MultiXactId members
>>> 3. Configurable cache size
>>
>> I'm looking more at MultiXact and it seems to me that we have a race condition there.
>>
>> When we create a new MultiXact we do:
>> 1. Generate new MultiXactId under MultiXactGenLock
>> 2. Record new mxid with members and offset to WAL
>> 3. Write offset to SLRU under MultiXactOffsetControlLock
>> 4. Write members to SLRU under MultiXactMemberControlLock
>
> But, don't we hold exclusive lock on the buffer through all the steps
> above?
Yes...Unless MultiXact is observed on StandBy. This could lead to observing inconsistent snapshot: one of lockers committed tuple delete, but standby sees it as alive.

>> When we read MultiXact we do:
>> 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
>> 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
>> 3. Retrieve members from SLRU under MultiXactMemberControlLock
>> 4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.
>
> So transactions never see such incomplete mxids, I believe.
I've observed sleep in step 2. I believe it's possible to observe special effects of step 4 too.
Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it hits hard on Standbys: if someone is locking whole table on primary - all seq scans on standbys follow him with MultiXactOffsetControlLock contention.

It looks like this:
0x00007fcd56896ff7 in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7ffd83376fe0) at ../sysdeps/unix/sysv/linux/select.c:41
#0  0x00007fcd56896ff7 in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7ffd83376fe0) at ../sysdeps/unix/sysv/linux/select.c:41
#1  0x000056186e0d54bd in pg_usleep (microsec=microsec@entry=1000) at ./build/../src/port/pgsleep.c:56
#2  0x000056186dd5edf2 in GetMultiXactIdMembers (from_pgupgrade=0 '\000', onlyLock=<optimized out>, members=0x7ffd83377080, multi=3106214809) at ./build/../src/backend/access/transam/multixact.c:1370
#3  GetMultiXactIdMembers () at ./build/../src/backend/access/transam/multixact.c:1202
#4  0x000056186dd2d2d9 in MultiXactIdGetUpdateXid (xmax=<optimized out>, t_infomask=<optimized out>) at ./build/../src/backend/access/heap/heapam.c:7039
#5  0x000056186dd35098 in HeapTupleGetUpdateXid (tuple=tuple@entry=0x7fcba3b63d58) at ./build/../src/backend/access/heap/heapam.c:7080
#6  0x000056186e0cd0f8 in HeapTupleSatisfiesMVCC (htup=<optimized out>, snapshot=0x56186f44a058, buffer=230684) at ./build/../src/backend/utils/time/tqual.c:1091
#7  0x000056186dd2d922 in heapgetpage (scan=scan@entry=0x56186f4c8e78, page=page@entry=3620) at ./build/../src/backend/access/heap/heapam.c:439
#8  0x000056186dd2ea7c in heapgettup_pagemode (key=0x0, nkeys=0, dir=ForwardScanDirection, scan=0x56186f4c8e78) at ./build/../src/backend/access/heap/heapam.c:1034
#9  heap_getnext (scan=scan@entry=0x56186f4c8e78, direction=direction@entry=ForwardScanDirection) at ./build/../src/backend/access/heap/heapam.c:1801
#10 0x000056186de84f51 in SeqNext (node=node@entry=0x56186f4a4f78) at ./build/../src/backend/executor/nodeSeqscan.c:81
#11 0x000056186de6a3f1 in ExecScanFetch (recheckMtd=0x56186de84ef0 <SeqRecheck>, accessMtd=0x56186de84f20 <SeqNext>, node=0x56186f4a4f78) at ./build/../src/backend/executor/execScan.c:97
#12 ExecScan (node=0x56186f4a4f78, accessMtd=0x56186de84f20 <SeqNext>, recheckMtd=0x56186de84ef0 <SeqRecheck>) at ./build/../src/backend/executor/execScan.c:164


Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Kyotaro Horiguchi-4
At Thu, 14 May 2020 10:19:42 +0500, "Andrey M. Borodin" <[hidden email]> wrote in

> >> I'm looking more at MultiXact and it seems to me that we have a race condition there.
> >>
> >> When we create a new MultiXact we do:
> >> 1. Generate new MultiXactId under MultiXactGenLock
> >> 2. Record new mxid with members and offset to WAL
> >> 3. Write offset to SLRU under MultiXactOffsetControlLock
> >> 4. Write members to SLRU under MultiXactMemberControlLock
> >
> > But, don't we hold exclusive lock on the buffer through all the steps
> > above?
> Yes...Unless MultiXact is observed on StandBy. This could lead to observing inconsistent snapshot: one of lockers committed tuple delete, but standby sees it as alive.

Ah, right. I looked from GetNewMultiXactId. Actually
XLOG_MULTIXACT_CREATE_ID is not protected from concurrent reference to
the creating mxact id. And GetMultiXactIdMembers is considering that
case.

> >> When we read MultiXact we do:
> >> 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
> >> 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
> >> 3. Retrieve members from SLRU under MultiXactMemberControlLock
> >> 4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.
> >
> > So transactions never see such incomplete mxids, I believe.
> I've observed sleep in step 2. I believe it's possible to observe special effects of step 4 too.
> Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it hits hard on Standbys: if someone is locking whole table on primary - all seq scans on standbys follow him with MultiXactOffsetControlLock contention.

GetMultiXactIdMembers believes that 4 is successfully done if 2
returned valid offset, but actually that is not obvious.

If we add a single giant lock just to isolate ,say,
GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
unnecessarily.  Perhaps we need finer-grained locking-key for standby
that works similary to buffer lock on primary, that doesn't cause
confilicts between irrelevant mxids.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Andrey M. Borodin


> 14 мая 2020 г., в 11:16, Kyotaro Horiguchi <[hidden email]> написал(а):
>
> At Thu, 14 May 2020 10:19:42 +0500, "Andrey M. Borodin" <[hidden email]> wrote in
>>>> I'm looking more at MultiXact and it seems to me that we have a race condition there.
>>>>
>>>> When we create a new MultiXact we do:
>>>> 1. Generate new MultiXactId under MultiXactGenLock
>>>> 2. Record new mxid with members and offset to WAL
>>>> 3. Write offset to SLRU under MultiXactOffsetControlLock
>>>> 4. Write members to SLRU under MultiXactMemberControlLock
>>>
>>> But, don't we hold exclusive lock on the buffer through all the steps
>>> above?
>> Yes...Unless MultiXact is observed on StandBy. This could lead to observing inconsistent snapshot: one of lockers committed tuple delete, but standby sees it as alive.
>
> Ah, right. I looked from GetNewMultiXactId. Actually
> XLOG_MULTIXACT_CREATE_ID is not protected from concurrent reference to
> the creating mxact id. And GetMultiXactIdMembers is considering that
> case.
>
>>>> When we read MultiXact we do:
>>>> 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
>>>> 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
>>>> 3. Retrieve members from SLRU under MultiXactMemberControlLock
>>>> 4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.
>>>
>>> So transactions never see such incomplete mxids, I believe.
>> I've observed sleep in step 2. I believe it's possible to observe special effects of step 4 too.
>> Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it hits hard on Standbys: if someone is locking whole table on primary - all seq scans on standbys follow him with MultiXactOffsetControlLock contention.
>
> GetMultiXactIdMembers believes that 4 is successfully done if 2
> returned valid offset, but actually that is not obvious.
>
> If we add a single giant lock just to isolate ,say,
> GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
> unnecessarily.  Perhaps we need finer-grained locking-key for standby
> that works similary to buffer lock on primary, that doesn't cause
> confilicts between irrelevant mxids.
>
We can just replay members before offsets. If offset is already there - members are there too.
But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for example.

Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency, because transaction in this mxid could not commit before we started. ISTM.
So instead of fix, we, probably, can just add a comment. If this reasoning is correct.

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Kyotaro Horiguchi-4
At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" <[hidden email]> wrote in

> > GetMultiXactIdMembers believes that 4 is successfully done if 2
> > returned valid offset, but actually that is not obvious.
> >
> > If we add a single giant lock just to isolate ,say,
> > GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
> > unnecessarily.  Perhaps we need finer-grained locking-key for standby
> > that works similary to buffer lock on primary, that doesn't cause
> > confilicts between irrelevant mxids.
> >
> We can just replay members before offsets. If offset is already there - members are there too.
> But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for example.
Generally in such cases, condition variables would work.  In the
attached PoC, the reader side gets no penalty in the "likely" code
path.  The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases.  But I couldn't cause the
situation where the sleep 1000u is reached.

> Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency, because transaction in this mxid could not commit before we started. ISTM.
> So instead of fix, we, probably, can just add a comment. If this reasoning is correct.

The step 4 of the reader side reads the members of the target mxid. It
is already written if the offset of the *next* mxid is filled-in.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e2aa5c9ce4..9db8f6cddd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -82,6 +82,7 @@
 #include "lib/ilist.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
+#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/lmgr.h"
 #include "storage/pmsignal.h"
@@ -233,6 +234,7 @@ typedef struct MultiXactStateData
  /* support for members anti-wraparound measures */
  MultiXactOffset offsetStopLimit; /* known if oldestOffsetKnown */
 
+ ConditionVariable nextoff_cv;
  /*
  * Per-backend data starts here.  We have two arrays stored in the area
  * immediately following the MultiXactStateData struct. Each is indexed by
@@ -873,6 +875,14 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
  /* Exchange our lock */
  LWLockRelease(MultiXactOffsetControlLock);
 
+ /*
+ *  Let everybody know the offset of this mxid is recorded now. The waiters
+ *  are waiting for the offset of the mxid next of the target to know the
+ *  number of members of the target mxid, so we don't need to wait for
+ *  members of this mxid are recorded.
+ */
+ ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
+
  LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE);
 
  prev_pageno = -1;
@@ -1367,9 +1377,19 @@ retry:
  if (nextMXOffset == 0)
  {
  /* Corner case 2: next multixact is still being filled in */
+
+ /*
+ * The recorder of the next mxid is just before writing the offset.
+ * Wait for the offset to be written.
+ */
+ ConditionVariablePrepareToSleep(&MultiXactState->nextoff_cv);
+
  LWLockRelease(MultiXactOffsetControlLock);
  CHECK_FOR_INTERRUPTS();
- pg_usleep(1000L);
+
+ ConditionVariableSleep(&MultiXactState->nextoff_cv,
+   WAIT_EVENT_WAIT_NEXT_MXMEMBERS);
+ ConditionVariableCancelSleep();
  goto retry;
  }
 
@@ -1847,6 +1867,7 @@ MultiXactShmemInit(void)
 
  /* Make sure we zero out the per-backend state */
  MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
+ ConditionVariableInit(&MultiXactState->nextoff_cv);
  }
  else
  Assert(found);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 0ecd29a1d9..1ac6b37188 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3879,6 +3879,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
  case WAIT_EVENT_SYNC_REP:
  event_name = "SyncRep";
  break;
+ case WAIT_EVENT_WAIT_NEXT_MXMEMBERS:
+ event_name = "Mact/WaitNextXactMembers";
+ break;
  /* no default case, so that compiler will warn */
  }
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ae9a39573c..e79bba0bef 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -886,7 +886,8 @@ typedef enum
  WAIT_EVENT_REPLICATION_ORIGIN_DROP,
  WAIT_EVENT_REPLICATION_SLOT_DROP,
  WAIT_EVENT_SAFE_SNAPSHOT,
- WAIT_EVENT_SYNC_REP
+ WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_WAIT_NEXT_MXMEMBERS
 } WaitEventIPC;
 
 /* ----------
Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Andrey M. Borodin


> 15 мая 2020 г., в 05:03, Kyotaro Horiguchi <[hidden email]> написал(а):
>
> At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" <[hidden email]> wrote in
>>> GetMultiXactIdMembers believes that 4 is successfully done if 2
>>> returned valid offset, but actually that is not obvious.
>>>
>>> If we add a single giant lock just to isolate ,say,
>>> GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
>>> unnecessarily.  Perhaps we need finer-grained locking-key for standby
>>> that works similary to buffer lock on primary, that doesn't cause
>>> confilicts between irrelevant mxids.
>>>
>> We can just replay members before offsets. If offset is already there - members are there too.
>> But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for example.
>
> Generally in such cases, condition variables would work.  In the
> attached PoC, the reader side gets no penalty in the "likely" code
> path.  The writer side always calls ConditionVariableBroadcast but the
> waiter list is empty in almost all cases.  But I couldn't cause the
> situation where the sleep 1000u is reached.
Thanks! That really looks like a good solution without magic timeouts. Beautiful!
I think I can create temporary extension which calls MultiXact API and tests edge-cases like this 1000us wait.
This extension will also be also useful for me to assess impact of bigger buffers, reduced read locking (as in my 2nd patch) and other tweaks.

>> Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency, because transaction in this mxid could not commit before we started. ISTM.
>> So instead of fix, we, probably, can just add a comment. If this reasoning is correct.
>
> The step 4 of the reader side reads the members of the target mxid. It
> is already written if the offset of the *next* mxid is filled-in.
Most often - yes, but members are not guaranteed to be filled in order. Those who win MXMemberControlLock will write first.
But nobody can read members of MXID before it is returned. And its members will be written before returning MXID.

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Kyotaro Horiguchi-4
At Fri, 15 May 2020 14:01:46 +0500, "Andrey M. Borodin" <[hidden email]> wrote in

>
>
> > 15 мая 2020 г., в 05:03, Kyotaro Horiguchi <[hidden email]> написал(а):
> >
> > At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" <[hidden email]> wrote in
> >>> GetMultiXactIdMembers believes that 4 is successfully done if 2
> >>> returned valid offset, but actually that is not obvious.
> >>>
> >>> If we add a single giant lock just to isolate ,say,
> >>> GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
> >>> unnecessarily.  Perhaps we need finer-grained locking-key for standby
> >>> that works similary to buffer lock on primary, that doesn't cause
> >>> confilicts between irrelevant mxids.
> >>>
> >> We can just replay members before offsets. If offset is already there - members are there too.
> >> But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for example.
> >
> > Generally in such cases, condition variables would work.  In the
> > attached PoC, the reader side gets no penalty in the "likely" code
> > path.  The writer side always calls ConditionVariableBroadcast but the
> > waiter list is empty in almost all cases.  But I couldn't cause the
> > situation where the sleep 1000u is reached.
> Thanks! That really looks like a good solution without magic timeouts. Beautiful!
> I think I can create temporary extension which calls MultiXact API and tests edge-cases like this 1000us wait.
> This extension will also be also useful for me to assess impact of bigger buffers, reduced read locking (as in my 2nd patch) and other tweaks.

Happy to hear that, It would need to use timeout just in case, though.

> >> Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency, because transaction in this mxid could not commit before we started. ISTM.
> >> So instead of fix, we, probably, can just add a comment. If this reasoning is correct.
> >
> > The step 4 of the reader side reads the members of the target mxid. It
> > is already written if the offset of the *next* mxid is filled-in.
> Most often - yes, but members are not guaranteed to be filled in order. Those who win MXMemberControlLock will write first.
> But nobody can read members of MXID before it is returned. And its members will be written before returning MXID.

Yeah, right.  Otherwise assertion failure happens.


regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Daniel Gustafsson
In reply to this post by Kyotaro Horiguchi-4
> On 15 May 2020, at 02:03, Kyotaro Horiguchi <[hidden email]> wrote:

> Generally in such cases, condition variables would work.  In the
> attached PoC, the reader side gets no penalty in the "likely" code
> path.  The writer side always calls ConditionVariableBroadcast but the
> waiter list is empty in almost all cases.  But I couldn't cause the
> situation where the sleep 1000u is reached.

The submitted patch no longer applies, can you please submit an updated
version?  I'm marking the patch Waiting on Author in the meantime.

cheers ./daniel

Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Andrey M. Borodin


> 2 июля 2020 г., в 17:02, Daniel Gustafsson <[hidden email]> написал(а):
>
>> On 15 May 2020, at 02:03, Kyotaro Horiguchi <[hidden email]> wrote:
>
>> Generally in such cases, condition variables would work.  In the
>> attached PoC, the reader side gets no penalty in the "likely" code
>> path.  The writer side always calls ConditionVariableBroadcast but the
>> waiter list is empty in almost all cases.  But I couldn't cause the
>> situation where the sleep 1000u is reached.
>
> The submitted patch no longer applies, can you please submit an updated
> version?  I'm marking the patch Waiting on Author in the meantime.
Thanks, Daniel! PFA V2

Best regards, Andrey Borodin.




v2-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch (10K) Download Attachment
v2-0002-Make-MultiXact-local-cache-size-configurable.patch (5K) Download Attachment
v2-0003-Add-conditional-variable-to-wait-for-next-MultXac.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Daniel Gustafsson


> On 8 Jul 2020, at 09:03, Andrey M. Borodin <[hidden email]> wrote:
>
>
>
>> 2 июля 2020 г., в 17:02, Daniel Gustafsson <[hidden email]> написал(а):
>>
>>> On 15 May 2020, at 02:03, Kyotaro Horiguchi <[hidden email]> wrote:
>>
>>> Generally in such cases, condition variables would work.  In the
>>> attached PoC, the reader side gets no penalty in the "likely" code
>>> path.  The writer side always calls ConditionVariableBroadcast but the
>>> waiter list is empty in almost all cases.  But I couldn't cause the
>>> situation where the sleep 1000u is reached.
>>
>> The submitted patch no longer applies, can you please submit an updated
>> version?  I'm marking the patch Waiting on Author in the meantime.
> Thanks, Daniel! PFA V2

This version too has stopped applying according to the CFbot.  I've moved it to
the next commitfest as we're out of time in this one and it was only pointed
out now, but kept it Waiting on Author.

cheers ./daniel

Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Anastasia Lubennikova
In reply to this post by Andrey M. Borodin
On 08.07.2020 10:03, Andrey M. Borodin wrote:

>
>> 2 июля 2020 г., в 17:02, Daniel Gustafsson <[hidden email]> написал(а):
>>
>>> On 15 May 2020, at 02:03, Kyotaro Horiguchi <[hidden email]> wrote:
>>> Generally in such cases, condition variables would work.  In the
>>> attached PoC, the reader side gets no penalty in the "likely" code
>>> path.  The writer side always calls ConditionVariableBroadcast but the
>>> waiter list is empty in almost all cases.  But I couldn't cause the
>>> situation where the sleep 1000u is reached.
>> The submitted patch no longer applies, can you please submit an updated
>> version?  I'm marking the patch Waiting on Author in the meantime.
> Thanks, Daniel! PFA V2
>
> Best regards, Andrey Borodin.
1) The first patch is sensible and harmless, so I think it is ready for
committer. I haven't tested the performance impact, though.

2) I like the initial proposal to make various SLRU buffers
configurable, however, I doubt if it really solves the problem, or just
moves it to another place?

The previous patch you sent was based on some version that contained
changes for other slru buffers numbers: 'multixact_offsets_slru_buffers'
and 'multixact_members_slru_buffers'. Have you just forgot to attach
them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
Meanwhile, I attach the rebased patch to calm down the CFbot. The
changes are trivial.

2.1) I think that both min and max values for this parameter are too
extreme. Have you tested them?

+               &multixact_local_cache_entries,
+               256, 2, INT_MAX / 2,

2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.

3) No changes for third patch. I just renamed it for consistency.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


v3-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch (7K) Download Attachment
v3-0002-Make-MultiXact-local-cache-size-configurable.patch (3K) Download Attachment
v3-0003-Add-conditional-variable-to-wait-for-next-MultXac.patch (3K) Download Attachment