MultiXact\SLRU buffers configuration

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
66 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

MultiXact\SLRU buffers configuration

Andrey Borodin-2
Hi, hackers!

*** The problem ***
I'm investigating some cases of reduced database performance due to MultiXactOffsetLock contention (80% MultiXactOffsetLock, 20% IO DataFileRead).
The problem manifested itself during index repack and constraint validation. Both being effectively full table scans.
The database workload contains a lot of select for share\select for update queries. I've tried to construct synthetic world generator and could not achieve similar lock configuration: I see a lot of different locks in wait events, particularly a lot more MultiXactMemberLocks. But from my experiments with synthetic workload, contention of MultiXactOffsetLock can be reduced by increasing NUM_MXACTOFFSET_BUFFERS=8 to bigger numbers.

*** Question 1 ***
Is it safe to increase number of buffers of MultiXact\All SLRUs, recompile and run database as usual?
I cannot experiment much with production. But I'm mostly sure that bigger buffers will solve the problem.

*** Question 2 ***
Probably, we could do GUCs for SLRU sizes? Are there any reasons not to do them configurable? I think multis, clog, subtransactions and others will benefit from bigger buffer. But, probably, too much of knobs can be confusing.

*** Question 3 ***
MultiXact offset lock is always taken as exclusive lock. It turns MultiXact Offset subsystem to single threaded. If someone have good idea how to make it more concurrency-friendly, I'm willing to put some efforts into this.
Probably, I could just add LWlocks for each offset buffer page. Is it something worth doing? Or are there any hidden cavers and difficulties?

Thanks!

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Andrey Borodin-2


> 8 мая 2020 г., в 21:36, Andrey M. Borodin <[hidden email]> написал(а):
>
> *** The problem ***
> I'm investigating some cases of reduced database performance due to MultiXactOffsetLock contention (80% MultiXactOffsetLock, 20% IO DataFileRead).
> The problem manifested itself during index repack and constraint validation. Both being effectively full table scans.
> The database workload contains a lot of select for share\select for update queries. I've tried to construct synthetic world generator and could not achieve similar lock configuration: I see a lot of different locks in wait events, particularly a lot more MultiXactMemberLocks. But from my experiments with synthetic workload, contention of MultiXactOffsetLock can be reduced by increasing NUM_MXACTOFFSET_BUFFERS=8 to bigger numbers.
>
> *** Question 1 ***
> Is it safe to increase number of buffers of MultiXact\All SLRUs, recompile and run database as usual?
> I cannot experiment much with production. But I'm mostly sure that bigger buffers will solve the problem.
>
> *** Question 2 ***
> Probably, we could do GUCs for SLRU sizes? Are there any reasons not to do them configurable? I think multis, clog, subtransactions and others will benefit from bigger buffer. But, probably, too much of knobs can be confusing.
>
> *** Question 3 ***
> MultiXact offset lock is always taken as exclusive lock. It turns MultiXact Offset subsystem to single threaded. If someone have good idea how to make it more concurrency-friendly, I'm willing to put some efforts into this.
> Probably, I could just add LWlocks for each offset buffer page. Is it something worth doing? Or are there any hidden cavers and difficulties?
I've created benchmark[0] imitating MultiXact pressure on my laptop: 7 clients are concurrently running select "select * from table where primary_key = ANY ($1) for share" where $1 is array of identifiers so that each tuple in a table is locked by different set of XIDs. During this benchmark I observe contention of MultiXactControlLock in pg_stat_activity

                                    пятница,  8 мая 2020 г. 15:08:37 (every 1s)

  pid  |         wait_event         | wait_event_type | state  |                       query                        
-------+----------------------------+-----------------+--------+----------------------------------------------------
 41344 | ClientRead                 | Client          | idle   | insert into t1 select generate_series(1,1000000,1)
 41375 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
 41377 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
 41378 |                            |                 | active | select * from t1 where i = ANY ($1) for share
 41379 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
 41381 |                            |                 | active | select * from t1 where i = ANY ($1) for share
 41383 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
 41385 | MultiXactOffsetControlLock | LWLock          | active | select * from t1 where i = ANY ($1) for share
(8 rows)

Finally, the benchmark is measuring time to execute select for update 42 times.

I've went ahead and created 3 patches:
1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
2. Reduce locking level to shared on read of MultiXactId members
3. Configurable cache size

I've found out that:
1. When MultiXact working set does not fit into buffers - benchmark results grow very high. Yet, very big buffers slow down benchmark too. For this benchmark optimal SLRU size id 32 pages for offsets and 64 pages for members (defaults are 8 and 16 respectively).
2. Lock optimisation increases performance by 5% on default SLRU sizes. Actually, benchmark does not explicitly read MultiXactId members, but when it replaces one with another - it have to read previous set. I understand that we can construct benchmark to demonstrate dominance of any algorithm and 5% of synthetic workload is not a very big number. But it just make sense to try to take shared lock for reading.
3. Manipulations with cache size do not affect benchmark anyhow. It's somewhat expected: benchmark is designed to defeat cache, either way OffsetControlLock would not be stressed.

For our workload, I think we will just increase numbers of SLRU sizes. But patchset may be useful for tuning and as a performance optimisation of MultiXact.

Also MultiXacts seems to be not very good fit into SLRU design. I think it would be better to use B-tree as a container. Or at least make MultiXact members extendable in-place (reserve some size when multixact is created).
When we want to extend number of locks for a tuple currently we will:
1. Iterate through all SLRU buffers for offsets to read current offset (with exclusive lock for offsets)
2. Iterate through all buffers for members to find current members (with exclusive lock for members)
3. Create new members array with +1 xid
4. Iterate through all cache members to find out maybe there are any such cache item as what we are going to create
5. iterate over 1 again for write
6. Iterate over 2 again for write

Obviously this does not scale well - we cannot increase SLRU sizes for too long.

Thanks! I'd be happy to hear any feedback.

Best regards, Andrey Borodin.

[0] https://github.com/x4m/multixact_stress

v1-0001-Add-GUCs-to-tune-MultiXact-SLRUs.patch (8K) Download Attachment
v1-0002-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch (10K) Download Attachment
v1-0003-Make-MultiXact-local-cache-size-configurable.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Andrey Borodin-2


> 11 мая 2020 г., в 16:17, Andrey M. Borodin <[hidden email]> написал(а):
>
> I've went ahead and created 3 patches:
> 1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
> 2. Reduce locking level to shared on read of MultiXactId members
> 3. Configurable cache size

I'm looking more at MultiXact and it seems to me that we have a race condition there.

When we create a new MultiXact we do:
1. Generate new MultiXactId under MultiXactGenLock
2. Record new mxid with members and offset to WAL
3. Write offset to SLRU under MultiXactOffsetControlLock
4. Write members to SLRU under MultiXactMemberControlLock

When we read MultiXact we do:
1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
3. Retrieve members from SLRU under MultiXactMemberControlLock
4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.

What am I missing?

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Kyotaro Horiguchi-4
At Wed, 13 May 2020 23:08:37 +0500, "Andrey M. Borodin" <[hidden email]> wrote in

>
>
> > 11 мая 2020 г., в 16:17, Andrey M. Borodin <[hidden email]> написал(а):
> >
> > I've went ahead and created 3 patches:
> > 1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
> > 2. Reduce locking level to shared on read of MultiXactId members
> > 3. Configurable cache size
>
> I'm looking more at MultiXact and it seems to me that we have a race condition there.
>
> When we create a new MultiXact we do:
> 1. Generate new MultiXactId under MultiXactGenLock
> 2. Record new mxid with members and offset to WAL
> 3. Write offset to SLRU under MultiXactOffsetControlLock
> 4. Write members to SLRU under MultiXactMemberControlLock

But, don't we hold exclusive lock on the buffer through all the steps
above?

> When we read MultiXact we do:
> 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
> 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
> 3. Retrieve members from SLRU under MultiXactMemberControlLock
> 4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.

So transactions never see such incomplete mxids, I believe.

> What am I missing?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Andrey Borodin-2


> 14 мая 2020 г., в 06:25, Kyotaro Horiguchi <[hidden email]> написал(а):
>
> At Wed, 13 May 2020 23:08:37 +0500, "Andrey M. Borodin" <[hidden email]> wrote in
>>
>>
>>> 11 мая 2020 г., в 16:17, Andrey M. Borodin <[hidden email]> написал(а):
>>>
>>> I've went ahead and created 3 patches:
>>> 1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
>>> 2. Reduce locking level to shared on read of MultiXactId members
>>> 3. Configurable cache size
>>
>> I'm looking more at MultiXact and it seems to me that we have a race condition there.
>>
>> When we create a new MultiXact we do:
>> 1. Generate new MultiXactId under MultiXactGenLock
>> 2. Record new mxid with members and offset to WAL
>> 3. Write offset to SLRU under MultiXactOffsetControlLock
>> 4. Write members to SLRU under MultiXactMemberControlLock
>
> But, don't we hold exclusive lock on the buffer through all the steps
> above?
Yes...Unless MultiXact is observed on StandBy. This could lead to observing inconsistent snapshot: one of lockers committed tuple delete, but standby sees it as alive.

>> When we read MultiXact we do:
>> 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
>> 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
>> 3. Retrieve members from SLRU under MultiXactMemberControlLock
>> 4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.
>
> So transactions never see such incomplete mxids, I believe.
I've observed sleep in step 2. I believe it's possible to observe special effects of step 4 too.
Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it hits hard on Standbys: if someone is locking whole table on primary - all seq scans on standbys follow him with MultiXactOffsetControlLock contention.

It looks like this:
0x00007fcd56896ff7 in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7ffd83376fe0) at ../sysdeps/unix/sysv/linux/select.c:41
#0  0x00007fcd56896ff7 in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7ffd83376fe0) at ../sysdeps/unix/sysv/linux/select.c:41
#1  0x000056186e0d54bd in pg_usleep (microsec=microsec@entry=1000) at ./build/../src/port/pgsleep.c:56
#2  0x000056186dd5edf2 in GetMultiXactIdMembers (from_pgupgrade=0 '\000', onlyLock=<optimized out>, members=0x7ffd83377080, multi=3106214809) at ./build/../src/backend/access/transam/multixact.c:1370
#3  GetMultiXactIdMembers () at ./build/../src/backend/access/transam/multixact.c:1202
#4  0x000056186dd2d2d9 in MultiXactIdGetUpdateXid (xmax=<optimized out>, t_infomask=<optimized out>) at ./build/../src/backend/access/heap/heapam.c:7039
#5  0x000056186dd35098 in HeapTupleGetUpdateXid (tuple=tuple@entry=0x7fcba3b63d58) at ./build/../src/backend/access/heap/heapam.c:7080
#6  0x000056186e0cd0f8 in HeapTupleSatisfiesMVCC (htup=<optimized out>, snapshot=0x56186f44a058, buffer=230684) at ./build/../src/backend/utils/time/tqual.c:1091
#7  0x000056186dd2d922 in heapgetpage (scan=scan@entry=0x56186f4c8e78, page=page@entry=3620) at ./build/../src/backend/access/heap/heapam.c:439
#8  0x000056186dd2ea7c in heapgettup_pagemode (key=0x0, nkeys=0, dir=ForwardScanDirection, scan=0x56186f4c8e78) at ./build/../src/backend/access/heap/heapam.c:1034
#9  heap_getnext (scan=scan@entry=0x56186f4c8e78, direction=direction@entry=ForwardScanDirection) at ./build/../src/backend/access/heap/heapam.c:1801
#10 0x000056186de84f51 in SeqNext (node=node@entry=0x56186f4a4f78) at ./build/../src/backend/executor/nodeSeqscan.c:81
#11 0x000056186de6a3f1 in ExecScanFetch (recheckMtd=0x56186de84ef0 <SeqRecheck>, accessMtd=0x56186de84f20 <SeqNext>, node=0x56186f4a4f78) at ./build/../src/backend/executor/execScan.c:97
#12 ExecScan (node=0x56186f4a4f78, accessMtd=0x56186de84f20 <SeqNext>, recheckMtd=0x56186de84ef0 <SeqRecheck>) at ./build/../src/backend/executor/execScan.c:164


Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Kyotaro Horiguchi-4
At Thu, 14 May 2020 10:19:42 +0500, "Andrey M. Borodin" <[hidden email]> wrote in

> >> I'm looking more at MultiXact and it seems to me that we have a race condition there.
> >>
> >> When we create a new MultiXact we do:
> >> 1. Generate new MultiXactId under MultiXactGenLock
> >> 2. Record new mxid with members and offset to WAL
> >> 3. Write offset to SLRU under MultiXactOffsetControlLock
> >> 4. Write members to SLRU under MultiXactMemberControlLock
> >
> > But, don't we hold exclusive lock on the buffer through all the steps
> > above?
> Yes...Unless MultiXact is observed on StandBy. This could lead to observing inconsistent snapshot: one of lockers committed tuple delete, but standby sees it as alive.

Ah, right. I looked from GetNewMultiXactId. Actually
XLOG_MULTIXACT_CREATE_ID is not protected from concurrent reference to
the creating mxact id. And GetMultiXactIdMembers is considering that
case.

> >> When we read MultiXact we do:
> >> 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
> >> 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
> >> 3. Retrieve members from SLRU under MultiXactMemberControlLock
> >> 4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.
> >
> > So transactions never see such incomplete mxids, I believe.
> I've observed sleep in step 2. I believe it's possible to observe special effects of step 4 too.
> Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it hits hard on Standbys: if someone is locking whole table on primary - all seq scans on standbys follow him with MultiXactOffsetControlLock contention.

GetMultiXactIdMembers believes that 4 is successfully done if 2
returned valid offset, but actually that is not obvious.

If we add a single giant lock just to isolate ,say,
GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
unnecessarily.  Perhaps we need finer-grained locking-key for standby
that works similary to buffer lock on primary, that doesn't cause
confilicts between irrelevant mxids.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Andrey Borodin-2


> 14 мая 2020 г., в 11:16, Kyotaro Horiguchi <[hidden email]> написал(а):
>
> At Thu, 14 May 2020 10:19:42 +0500, "Andrey M. Borodin" <[hidden email]> wrote in
>>>> I'm looking more at MultiXact and it seems to me that we have a race condition there.
>>>>
>>>> When we create a new MultiXact we do:
>>>> 1. Generate new MultiXactId under MultiXactGenLock
>>>> 2. Record new mxid with members and offset to WAL
>>>> 3. Write offset to SLRU under MultiXactOffsetControlLock
>>>> 4. Write members to SLRU under MultiXactMemberControlLock
>>>
>>> But, don't we hold exclusive lock on the buffer through all the steps
>>> above?
>> Yes...Unless MultiXact is observed on StandBy. This could lead to observing inconsistent snapshot: one of lockers committed tuple delete, but standby sees it as alive.
>
> Ah, right. I looked from GetNewMultiXactId. Actually
> XLOG_MULTIXACT_CREATE_ID is not protected from concurrent reference to
> the creating mxact id. And GetMultiXactIdMembers is considering that
> case.
>
>>>> When we read MultiXact we do:
>>>> 1. Retrieve offset by mxid from SLRU under MultiXactOffsetControlLock
>>>> 2. If offset is 0 - it's not filled in at step 4 of previous algorithm, we sleep and goto 1
>>>> 3. Retrieve members from SLRU under MultiXactMemberControlLock
>>>> 4. ..... what we do if there are just zeroes because step 4 is not executed yet? Nothing, return empty members list.
>>>
>>> So transactions never see such incomplete mxids, I believe.
>> I've observed sleep in step 2. I believe it's possible to observe special effects of step 4 too.
>> Maybe we could add lock on standby to dismiss this 1000us wait? Sometimes it hits hard on Standbys: if someone is locking whole table on primary - all seq scans on standbys follow him with MultiXactOffsetControlLock contention.
>
> GetMultiXactIdMembers believes that 4 is successfully done if 2
> returned valid offset, but actually that is not obvious.
>
> If we add a single giant lock just to isolate ,say,
> GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
> unnecessarily.  Perhaps we need finer-grained locking-key for standby
> that works similary to buffer lock on primary, that doesn't cause
> confilicts between irrelevant mxids.
>
We can just replay members before offsets. If offset is already there - members are there too.
But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for example.

Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency, because transaction in this mxid could not commit before we started. ISTM.
So instead of fix, we, probably, can just add a comment. If this reasoning is correct.

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Kyotaro Horiguchi-4
At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" <[hidden email]> wrote in

> > GetMultiXactIdMembers believes that 4 is successfully done if 2
> > returned valid offset, but actually that is not obvious.
> >
> > If we add a single giant lock just to isolate ,say,
> > GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
> > unnecessarily.  Perhaps we need finer-grained locking-key for standby
> > that works similary to buffer lock on primary, that doesn't cause
> > confilicts between irrelevant mxids.
> >
> We can just replay members before offsets. If offset is already there - members are there too.
> But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for example.
Generally in such cases, condition variables would work.  In the
attached PoC, the reader side gets no penalty in the "likely" code
path.  The writer side always calls ConditionVariableBroadcast but the
waiter list is empty in almost all cases.  But I couldn't cause the
situation where the sleep 1000u is reached.

> Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency, because transaction in this mxid could not commit before we started. ISTM.
> So instead of fix, we, probably, can just add a comment. If this reasoning is correct.

The step 4 of the reader side reads the members of the target mxid. It
is already written if the offset of the *next* mxid is filled-in.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e2aa5c9ce4..9db8f6cddd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -82,6 +82,7 @@
 #include "lib/ilist.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
+#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/lmgr.h"
 #include "storage/pmsignal.h"
@@ -233,6 +234,7 @@ typedef struct MultiXactStateData
  /* support for members anti-wraparound measures */
  MultiXactOffset offsetStopLimit; /* known if oldestOffsetKnown */
 
+ ConditionVariable nextoff_cv;
  /*
  * Per-backend data starts here.  We have two arrays stored in the area
  * immediately following the MultiXactStateData struct. Each is indexed by
@@ -873,6 +875,14 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
  /* Exchange our lock */
  LWLockRelease(MultiXactOffsetControlLock);
 
+ /*
+ *  Let everybody know the offset of this mxid is recorded now. The waiters
+ *  are waiting for the offset of the mxid next of the target to know the
+ *  number of members of the target mxid, so we don't need to wait for
+ *  members of this mxid are recorded.
+ */
+ ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
+
  LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE);
 
  prev_pageno = -1;
@@ -1367,9 +1377,19 @@ retry:
  if (nextMXOffset == 0)
  {
  /* Corner case 2: next multixact is still being filled in */
+
+ /*
+ * The recorder of the next mxid is just before writing the offset.
+ * Wait for the offset to be written.
+ */
+ ConditionVariablePrepareToSleep(&MultiXactState->nextoff_cv);
+
  LWLockRelease(MultiXactOffsetControlLock);
  CHECK_FOR_INTERRUPTS();
- pg_usleep(1000L);
+
+ ConditionVariableSleep(&MultiXactState->nextoff_cv,
+   WAIT_EVENT_WAIT_NEXT_MXMEMBERS);
+ ConditionVariableCancelSleep();
  goto retry;
  }
 
@@ -1847,6 +1867,7 @@ MultiXactShmemInit(void)
 
  /* Make sure we zero out the per-backend state */
  MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
+ ConditionVariableInit(&MultiXactState->nextoff_cv);
  }
  else
  Assert(found);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 0ecd29a1d9..1ac6b37188 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3879,6 +3879,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
  case WAIT_EVENT_SYNC_REP:
  event_name = "SyncRep";
  break;
+ case WAIT_EVENT_WAIT_NEXT_MXMEMBERS:
+ event_name = "Mact/WaitNextXactMembers";
+ break;
  /* no default case, so that compiler will warn */
  }
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ae9a39573c..e79bba0bef 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -886,7 +886,8 @@ typedef enum
  WAIT_EVENT_REPLICATION_ORIGIN_DROP,
  WAIT_EVENT_REPLICATION_SLOT_DROP,
  WAIT_EVENT_SAFE_SNAPSHOT,
- WAIT_EVENT_SYNC_REP
+ WAIT_EVENT_SYNC_REP,
+ WAIT_EVENT_WAIT_NEXT_MXMEMBERS
 } WaitEventIPC;
 
 /* ----------
Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Andrey Borodin-2


> 15 мая 2020 г., в 05:03, Kyotaro Horiguchi <[hidden email]> написал(а):
>
> At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" <[hidden email]> wrote in
>>> GetMultiXactIdMembers believes that 4 is successfully done if 2
>>> returned valid offset, but actually that is not obvious.
>>>
>>> If we add a single giant lock just to isolate ,say,
>>> GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
>>> unnecessarily.  Perhaps we need finer-grained locking-key for standby
>>> that works similary to buffer lock on primary, that doesn't cause
>>> confilicts between irrelevant mxids.
>>>
>> We can just replay members before offsets. If offset is already there - members are there too.
>> But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for example.
>
> Generally in such cases, condition variables would work.  In the
> attached PoC, the reader side gets no penalty in the "likely" code
> path.  The writer side always calls ConditionVariableBroadcast but the
> waiter list is empty in almost all cases.  But I couldn't cause the
> situation where the sleep 1000u is reached.
Thanks! That really looks like a good solution without magic timeouts. Beautiful!
I think I can create temporary extension which calls MultiXact API and tests edge-cases like this 1000us wait.
This extension will also be also useful for me to assess impact of bigger buffers, reduced read locking (as in my 2nd patch) and other tweaks.

>> Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency, because transaction in this mxid could not commit before we started. ISTM.
>> So instead of fix, we, probably, can just add a comment. If this reasoning is correct.
>
> The step 4 of the reader side reads the members of the target mxid. It
> is already written if the offset of the *next* mxid is filled-in.
Most often - yes, but members are not guaranteed to be filled in order. Those who win MXMemberControlLock will write first.
But nobody can read members of MXID before it is returned. And its members will be written before returning MXID.

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Kyotaro Horiguchi-4
At Fri, 15 May 2020 14:01:46 +0500, "Andrey M. Borodin" <[hidden email]> wrote in

>
>
> > 15 мая 2020 г., в 05:03, Kyotaro Horiguchi <[hidden email]> написал(а):
> >
> > At Thu, 14 May 2020 11:44:01 +0500, "Andrey M. Borodin" <[hidden email]> wrote in
> >>> GetMultiXactIdMembers believes that 4 is successfully done if 2
> >>> returned valid offset, but actually that is not obvious.
> >>>
> >>> If we add a single giant lock just to isolate ,say,
> >>> GetMultiXactIdMember and RecordNewMultiXact, it reduces concurrency
> >>> unnecessarily.  Perhaps we need finer-grained locking-key for standby
> >>> that works similary to buffer lock on primary, that doesn't cause
> >>> confilicts between irrelevant mxids.
> >>>
> >> We can just replay members before offsets. If offset is already there - members are there too.
> >> But I'd be happy if we could mitigate those 1000us too - with a hint about last maixd state in a shared MX state, for example.
> >
> > Generally in such cases, condition variables would work.  In the
> > attached PoC, the reader side gets no penalty in the "likely" code
> > path.  The writer side always calls ConditionVariableBroadcast but the
> > waiter list is empty in almost all cases.  But I couldn't cause the
> > situation where the sleep 1000u is reached.
> Thanks! That really looks like a good solution without magic timeouts. Beautiful!
> I think I can create temporary extension which calls MultiXact API and tests edge-cases like this 1000us wait.
> This extension will also be also useful for me to assess impact of bigger buffers, reduced read locking (as in my 2nd patch) and other tweaks.

Happy to hear that, It would need to use timeout just in case, though.

> >> Actually, if we read empty mxid array instead of something that is replayed just yet - it's not a problem of inconsistency, because transaction in this mxid could not commit before we started. ISTM.
> >> So instead of fix, we, probably, can just add a comment. If this reasoning is correct.
> >
> > The step 4 of the reader side reads the members of the target mxid. It
> > is already written if the offset of the *next* mxid is filled-in.
> Most often - yes, but members are not guaranteed to be filled in order. Those who win MXMemberControlLock will write first.
> But nobody can read members of MXID before it is returned. And its members will be written before returning MXID.

Yeah, right.  Otherwise assertion failure happens.


regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Daniel Gustafsson
In reply to this post by Kyotaro Horiguchi-4
> On 15 May 2020, at 02:03, Kyotaro Horiguchi <[hidden email]> wrote:

> Generally in such cases, condition variables would work.  In the
> attached PoC, the reader side gets no penalty in the "likely" code
> path.  The writer side always calls ConditionVariableBroadcast but the
> waiter list is empty in almost all cases.  But I couldn't cause the
> situation where the sleep 1000u is reached.

The submitted patch no longer applies, can you please submit an updated
version?  I'm marking the patch Waiting on Author in the meantime.

cheers ./daniel

Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Andrey Borodin-2


> 2 июля 2020 г., в 17:02, Daniel Gustafsson <[hidden email]> написал(а):
>
>> On 15 May 2020, at 02:03, Kyotaro Horiguchi <[hidden email]> wrote:
>
>> Generally in such cases, condition variables would work.  In the
>> attached PoC, the reader side gets no penalty in the "likely" code
>> path.  The writer side always calls ConditionVariableBroadcast but the
>> waiter list is empty in almost all cases.  But I couldn't cause the
>> situation where the sleep 1000u is reached.
>
> The submitted patch no longer applies, can you please submit an updated
> version?  I'm marking the patch Waiting on Author in the meantime.
Thanks, Daniel! PFA V2

Best regards, Andrey Borodin.




v2-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch (10K) Download Attachment
v2-0002-Make-MultiXact-local-cache-size-configurable.patch (5K) Download Attachment
v2-0003-Add-conditional-variable-to-wait-for-next-MultXac.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Daniel Gustafsson


> On 8 Jul 2020, at 09:03, Andrey M. Borodin <[hidden email]> wrote:
>
>
>
>> 2 июля 2020 г., в 17:02, Daniel Gustafsson <[hidden email]> написал(а):
>>
>>> On 15 May 2020, at 02:03, Kyotaro Horiguchi <[hidden email]> wrote:
>>
>>> Generally in such cases, condition variables would work.  In the
>>> attached PoC, the reader side gets no penalty in the "likely" code
>>> path.  The writer side always calls ConditionVariableBroadcast but the
>>> waiter list is empty in almost all cases.  But I couldn't cause the
>>> situation where the sleep 1000u is reached.
>>
>> The submitted patch no longer applies, can you please submit an updated
>> version?  I'm marking the patch Waiting on Author in the meantime.
> Thanks, Daniel! PFA V2

This version too has stopped applying according to the CFbot.  I've moved it to
the next commitfest as we're out of time in this one and it was only pointed
out now, but kept it Waiting on Author.

cheers ./daniel

Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Anastasia Lubennikova
In reply to this post by Andrey Borodin-2
On 08.07.2020 10:03, Andrey M. Borodin wrote:

>
>> 2 июля 2020 г., в 17:02, Daniel Gustafsson <[hidden email]> написал(а):
>>
>>> On 15 May 2020, at 02:03, Kyotaro Horiguchi <[hidden email]> wrote:
>>> Generally in such cases, condition variables would work.  In the
>>> attached PoC, the reader side gets no penalty in the "likely" code
>>> path.  The writer side always calls ConditionVariableBroadcast but the
>>> waiter list is empty in almost all cases.  But I couldn't cause the
>>> situation where the sleep 1000u is reached.
>> The submitted patch no longer applies, can you please submit an updated
>> version?  I'm marking the patch Waiting on Author in the meantime.
> Thanks, Daniel! PFA V2
>
> Best regards, Andrey Borodin.
1) The first patch is sensible and harmless, so I think it is ready for
committer. I haven't tested the performance impact, though.

2) I like the initial proposal to make various SLRU buffers
configurable, however, I doubt if it really solves the problem, or just
moves it to another place?

The previous patch you sent was based on some version that contained
changes for other slru buffers numbers: 'multixact_offsets_slru_buffers'
and 'multixact_members_slru_buffers'. Have you just forgot to attach
them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
Meanwhile, I attach the rebased patch to calm down the CFbot. The
changes are trivial.

2.1) I think that both min and max values for this parameter are too
extreme. Have you tested them?

+               &multixact_local_cache_entries,
+               256, 2, INT_MAX / 2,

2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.

3) No changes for third patch. I just renamed it for consistency.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


v3-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch (7K) Download Attachment
v3-0002-Make-MultiXact-local-cache-size-configurable.patch (3K) Download Attachment
v3-0003-Add-conditional-variable-to-wait-for-next-MultXac.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Andrey Borodin-2
Hi, Anastasia!

> 28 авг. 2020 г., в 23:08, Anastasia Lubennikova <[hidden email]> написал(а):
>
> 1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance impact, though.
>
> 2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the problem, or just moves it to another place?
>
> The previous patch you sent was based on some version that contained changes for other slru buffers numbers: 'multixact_offsets_slru_buffers' and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
> Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial.
>
> 2.1) I think that both min and max values for this parameter are too extreme. Have you tested them?
>
> +               &multixact_local_cache_entries,
> +               256, 2, INT_MAX / 2,
>
> 2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
>
> 3) No changes for third patch. I just renamed it for consistency.

Thank you for your review.

Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to reproduce problem from production...

You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance degradation with values ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible functioning value?

I greatly appreciate your review, sorry for so long delay. Thanks!

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Anastasia Lubennikova
On 28.09.2020 17:41, Andrey M. Borodin wrote:

> Hi, Anastasia!
>
>> 28 авг. 2020 г., в 23:08, Anastasia Lubennikova <[hidden email]> написал(а):
>>
>> 1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance impact, though.
>>
>> 2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the problem, or just moves it to another place?
>>
>> The previous patch you sent was based on some version that contained changes for other slru buffers numbers: 'multixact_offsets_slru_buffers' and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
>> Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial.
>>
>> 2.1) I think that both min and max values for this parameter are too extreme. Have you tested them?
>>
>> +               &multixact_local_cache_entries,
>> +               256, 2, INT_MAX / 2,
>>
>> 2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
>>
>> 3) No changes for third patch. I just renamed it for consistency.
> Thank you for your review.
>
> Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to reproduce problem from production...
>
> You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance degradation with values ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible functioning value?

I would go with the values that we consider adequate for this setting.
As I see, there is no strict rule about it in guc.c and many variables
have large border values. Anyway, we need to explain it at least in the
documentation and code comments.

It seems that the default was conservative enough, so it can be also a
minimal value too. As for maximum, can you provide any benchmark
results? If we have a peak and a noticeable performance degradation
after that, we can use it to calculate the preferable maxvalue.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Alexander Korotkov-4
In reply to this post by Andrey Borodin-2
Hi!

On Mon, Sep 28, 2020 at 5:41 PM Andrey M. Borodin <[hidden email]> wrote:

> > 28 авг. 2020 г., в 23:08, Anastasia Lubennikova <[hidden email]> написал(а):
> >
> > 1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance impact, though.
> >
> > 2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the problem, or just moves it to another place?
> >
> > The previous patch you sent was based on some version that contained changes for other slru buffers numbers: 'multixact_offsets_slru_buffers' and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
> > Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial.
> >
> > 2.1) I think that both min and max values for this parameter are too extreme. Have you tested them?
> >
> > +               &multixact_local_cache_entries,
> > +               256, 2, INT_MAX / 2,
> >
> > 2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
> >
> > 3) No changes for third patch. I just renamed it for consistency.
>
> Thank you for your review.
>
> Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to reproduce problem from production...
>
> You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance degradation with values ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible functioning value?
>
> I greatly appreciate your review, sorry for so long delay. Thanks!
I took a look at this patchset.

The 1st and 3rd patches look good to me.  I made just minor improvements.
1) There is still a case when SimpleLruReadPage_ReadOnly() relocks the
SLRU lock, which is already taken in exclusive mode.  I've evaded this
by passing the lock mode as a parameter to
SimpleLruReadPage_ReadOnly().
3) CHECK_FOR_INTERRUPTS() is not needed anymore, because it's called
inside ConditionVariableSleep() if needed.  Also, no current wait
events use slashes, and I don't think we should introduce slashes
here.  Even if we should, then we should also rename existing wait
events to be consistent with a new one.  So, I've renamed the new wait
event to remove the slash.

Regarding the patch 2.  I see the current documentation in the patch
doesn't explain to the user how to set the new parameter.  I think we
should give users an idea what workloads need high values of
multixact_local_cache_entries parameter and what doesn't.  Also, we
should explain the negative aspects of high values
multixact_local_cache_entries.  Ideally, we should get the advantage
without overloading users with new nontrivial parameters, but I don't
have a particular idea here.

I'd like to propose committing 1 and 3, but leave 2 for further review.

------
Regards,
Alexander Korotkov

v4-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offsets.patch (16K) Download Attachment
v4-0002-Make-MultiXact-local-cache-size-configurable.patch (5K) Download Attachment
v4-0003-Add-conditional-variable-to-wait-for-next-MultXact-o.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Andrey Borodin-2


> 26 окт. 2020 г., в 06:05, Alexander Korotkov <[hidden email]> написал(а):
>
> Hi!
>
> On Mon, Sep 28, 2020 at 5:41 PM Andrey M. Borodin <[hidden email]> wrote:
>>> 28 авг. 2020 г., в 23:08, Anastasia Lubennikova <[hidden email]> написал(а):
>>>
>>> 1) The first patch is sensible and harmless, so I think it is ready for committer. I haven't tested the performance impact, though.
>>>
>>> 2) I like the initial proposal to make various SLRU buffers configurable, however, I doubt if it really solves the problem, or just moves it to another place?
>>>
>>> The previous patch you sent was based on some version that contained changes for other slru buffers numbers: 'multixact_offsets_slru_buffers' and 'multixact_members_slru_buffers'. Have you just forgot to attach them? The patch message "[PATCH v2 2/4]" hints that you had 4 patches)
>>> Meanwhile, I attach the rebased patch to calm down the CFbot. The changes are trivial.
>>>
>>> 2.1) I think that both min and max values for this parameter are too extreme. Have you tested them?
>>>
>>> +               &multixact_local_cache_entries,
>>> +               256, 2, INT_MAX / 2,
>>>
>>> 2.2) MAX_CACHE_ENTRIES is not used anymore, so it can be deleted.
>>>
>>> 3) No changes for third patch. I just renamed it for consistency.
>>
>> Thank you for your review.
>>
>> Indeed, I had 4th patch with tests, but these tests didn't work well: I still did not manage to stress SLRUs to reproduce problem from production...
>>
>> You are absolutely correct in point 2: I did only tests with sane values. And observed extreme performance degradation with values ~ 64 megabytes. I do not know which highest values should we pick? 1Gb? Or highest possible functioning value?
>>
>> I greatly appreciate your review, sorry for so long delay. Thanks!
>
> I took a look at this patchset.
>
> The 1st and 3rd patches look good to me.  I made just minor improvements.
> 1) There is still a case when SimpleLruReadPage_ReadOnly() relocks the
> SLRU lock, which is already taken in exclusive mode.  I've evaded this
> by passing the lock mode as a parameter to
> SimpleLruReadPage_ReadOnly().
> 3) CHECK_FOR_INTERRUPTS() is not needed anymore, because it's called
> inside ConditionVariableSleep() if needed.  Also, no current wait
> events use slashes, and I don't think we should introduce slashes
> here.  Even if we should, then we should also rename existing wait
> events to be consistent with a new one.  So, I've renamed the new wait
> event to remove the slash.
>
> Regarding the patch 2.  I see the current documentation in the patch
> doesn't explain to the user how to set the new parameter.  I think we
> should give users an idea what workloads need high values of
> multixact_local_cache_entries parameter and what doesn't.  Also, we
> should explain the negative aspects of high values
> multixact_local_cache_entries.  Ideally, we should get the advantage
> without overloading users with new nontrivial parameters, but I don't
> have a particular idea here.
>
> I'd like to propose committing 1 and 3, but leave 2 for further review.

Thanks for your review, Alexander!
+1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
Other changes seem correct to me too.


I've tried to find optimal value for cache size and it seems to me that it affects multixact scalability much less than sizes of offsets\members buffers. I concur that patch 2 of the patchset does not seem documented enough.

Best regards, Andrey Borodin.

Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Alexander Korotkov-4
On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin <[hidden email]> wrote:
> Thanks for your review, Alexander!
> +1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
> Other changes seem correct to me too.
>
>
> I've tried to find optimal value for cache size and it seems to me that it affects multixact scalability much less than sizes of offsets\members buffers. I concur that patch 2 of the patchset does not seem documented enough.

Thank you.  I've made a few more minor adjustments to the patchset.
I'm going to push 0001 and 0003 if there are no objections.

------
Regards,
Alexander Korotkov

v5-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch (17K) Download Attachment
v5-0002-Make-MultiXact-local-cache-size-configurable.patch (5K) Download Attachment
v5-0003-Add-conditional-variable-to-wait-for-next-MultXact.patch (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: MultiXact\SLRU buffers configuration

Alexander Korotkov-4
On Tue, Oct 27, 2020 at 8:02 PM Alexander Korotkov <[hidden email]> wrote:

> On Mon, Oct 26, 2020 at 6:45 PM Andrey Borodin <[hidden email]> wrote:
> > Thanks for your review, Alexander!
> > +1 for avoiding double locking in SimpleLruReadPage_ReadOnly().
> > Other changes seem correct to me too.
> >
> >
> > I've tried to find optimal value for cache size and it seems to me that it affects multixact scalability much less than sizes of offsets\members buffers. I concur that patch 2 of the patchset does not seem documented enough.
>
> Thank you.  I've made a few more minor adjustments to the patchset.
> I'm going to push 0001 and 0003 if there are no objections.
I get that patchset v5 doesn't pass the tests due to typo in assert.
The fixes version is attached.

------
Regards,
Alexander Korotkov

v6-0001-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch (17K) Download Attachment
v6-0002-Make-MultiXact-local-cache-size-configurable.patch (5K) Download Attachment
v6-0003-Add-conditional-variable-to-wait-for-next-MultXact.patch (6K) Download Attachment
1234