CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Jim Jarvie

Using V12, Linux [Ubuntu 16.04LTS]

I have a system which implements a message queue with the basic pattern that a process selects a group of, for example 250, rows for processing via SELECT .. LIMIT 250 FOR UPDATE SKIP LOCKED.

When there are a small number of concurrent connections to process the queue, this seems to work as expected and connections quickly obtain a unique block of 250 rows for processing.

However, as I scale up the number of concurrent connections, I see a spike in CPU (to 100% across 80 cores) when the SELECT FOR UPDATE SKIP LOCKED executes and the select processes wait for multiple minutes (10-20 minutes) before completing.  My use case requires around 256 concurrent processors for the queue but I've been unable to scale beyond 128 without everything grinding to a halt.

The queue table itself fits in RAM (with 2M hugepages) and during the wait, all the performance counters drop to almost 0 - no disk read or write (semi-expected due to the table fitting in memory) with 100% buffer hit rate in pg_top and row read around 100/s which is much smaller than expected.

After processes complete the select and the number of waiting selects starts to fall, CPU load falls and then suddenly the remaining processes all complete within a few seconds and things perform normally until the next time there are a group of SELECT  FOR UPDATE statements which bunch together and things then repeat.

I found that performing extremely frequent vacuum analyze (every 30 minutes) helps a small amount but this is not that helpful so problems are still very apparent.

I've exhausted all the performance tuning and analysis results I can find that seem even a little bit relevant but cannot get this cracked.

Is anyone on the list able to help with suggestions of what I can do to track why this CPU hogging happens as this does seem to be the root of the problem?

Thanks in advance,

Jim


Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Michael Lewis
Message queue...
Are rows deleted? Are they updated once or many times? Have you adjusted fillfactor on table or indexes? How many rows in the table currently or on average? Is there any ordering to which rows you update?

It seems likely that one of the experts/code contributors will chime in and explain about how locking that many rows in that many concurrent connections means that some resource is overrun and so you are escalating to a table lock instead of actually truly locking only the 250 rows you wanted.

On the other hand, you say 80 cores and you are trying to increase the number of concurrent processes well beyond that without (much) disk I/O being involved. I wouldn't expect that to perform awesome.

Is there a chance to modify the code to permit each process to lock 1000 rows at a time and be content with 64 concurrent processes?
Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Jim Jarvie

Thank you for the quick response.

No adjustments of fill factors.  Hadn't though of that - I'll investigate and try some options to see if I can measure an effect.

There is some ordering on the select [ ORDER BY q_id] so each block of 250 is sequential-ish queue items; I just need them more or less in the order they were queued so as near FIFO as possible without being totally strict on absolute sequential order.

Table has around 192K rows, as a row is processed it is deleted as part of the transaction with a commit at the end after all 250 are processed [partitioned table, state changes and it migrates to a different partition] and as the queue drops to 64K it is added to with 128K rows at a time.

I've tuned the LIMIT value both up and down.  As I move the limit up, the problem becomes substantially worse; 300 swamps it and the selects take > 1 hour to complete; at 600 they just all lock everything up and it stops processing.  I did try 1,000 but it basically resulted in nothing being processed.

Less processes does not give the throughput required because the queue sends data elsewhere which has a long round trip time but does permit over 1K concurrent connections as their work-round for throughput.  I'm stuck having to scale up my concurrent processes in order to compensate for the long processing time of an individual queue item.



On 18-Aug.-2020 20:08, Michael Lewis wrote:
Message queue...
Are rows deleted? Are they updated once or many times? Have you adjusted
fillfactor on table or indexes? How many rows in the table currently or on
average? Is there any ordering to which rows you update?

It seems likely that one of the experts/code contributors will chime in and
explain about how locking that many rows in that many concurrent
connections means that some resource is overrun and so you are escalating
to a table lock instead of actually truly locking only the 250 rows you
wanted.

On the other hand, you say 80 cores and you are trying to increase the
number of concurrent processes well beyond that without (much) disk I/O
being involved. I wouldn't expect that to perform awesome.

Is there a chance to modify the code to permit each process to lock 1000
rows at a time and be content with 64 concurrent processes?

Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Henrique Montenegro
Did you try using NOWAIT instead of SKIP LOCKED to see if the behavior still shows up?

On Tue, Aug 18, 2020, 8:22 PM Jim Jarvie <[hidden email]> wrote:

Thank you for the quick response.

No adjustments of fill factors.  Hadn't though of that - I'll investigate and try some options to see if I can measure an effect.

There is some ordering on the select [ ORDER BY q_id] so each block of 250 is sequential-ish queue items; I just need them more or less in the order they were queued so as near FIFO as possible without being totally strict on absolute sequential order.

Table has around 192K rows, as a row is processed it is deleted as part of the transaction with a commit at the end after all 250 are processed [partitioned table, state changes and it migrates to a different partition] and as the queue drops to 64K it is added to with 128K rows at a time.

I've tuned the LIMIT value both up and down.  As I move the limit up, the problem becomes substantially worse; 300 swamps it and the selects take > 1 hour to complete; at 600 they just all lock everything up and it stops processing.  I did try 1,000 but it basically resulted in nothing being processed.

Less processes does not give the throughput required because the queue sends data elsewhere which has a long round trip time but does permit over 1K concurrent connections as their work-round for throughput.  I'm stuck having to scale up my concurrent processes in order to compensate for the long processing time of an individual queue item.



On 18-Aug.-2020 20:08, Michael Lewis wrote:
Message queue...
Are rows deleted? Are they updated once or many times? Have you adjusted
fillfactor on table or indexes? How many rows in the table currently or on
average? Is there any ordering to which rows you update?

It seems likely that one of the experts/code contributors will chime in and
explain about how locking that many rows in that many concurrent
connections means that some resource is overrun and so you are escalating
to a table lock instead of actually truly locking only the 250 rows you
wanted.

On the other hand, you say 80 cores and you are trying to increase the
number of concurrent processes well beyond that without (much) disk I/O
being involved. I wouldn't expect that to perform awesome.

Is there a chance to modify the code to permit each process to lock 1000
rows at a time and be content with 64 concurrent processes?

Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Michael Lewis
In reply to this post by Jim Jarvie
On Tue, Aug 18, 2020 at 6:22 PM Jim Jarvie <[hidden email]> wrote:

There is some ordering on the select [ ORDER BY q_id] so each block of 250 is sequential-ish queue items; I just need them more or less in the order they were queued so as near FIFO as possible without being totally strict on absolute sequential order.

How long does each process take in total? How strict does that FIFO really need to be when you are already doing SKIP LOCKED anyway?

Table has around 192K rows, as a row is processed it is deleted as part of the transaction with a commit at the end after all 250 are processed [partitioned table, state changes and it migrates to a different partition] and as the queue drops to 64K it is added to with 128K rows at a time.

Can you expound on the partitioning? Are all consumers of the queue always hitting one active partition and anytime a row is processed, it always moves to one of many? archived type partitions?

Less processes does not give the throughput required because the queue sends data elsewhere which has a long round trip time

Is that done via FDW or otherwise within the same database transaction? Are you connecting some queue consumer application code to Postgres, select for update, doing work on some remote system that is slow, and then coming back and committing the DB work?

By the way, top-posting is discouraged here and partial quotes with interspersed comments are common practice.
Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Michael Lewis
Also, have you checked how bloated your indexes are getting? Do you run default autovacuum settings? Did you update to the new default 2ms cost delay value? With a destructive queue, it would be very important to ensure autovacuum is keeping up with the churn. Share your basic table structure and indexes, sanitized if need be.
Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Laurenz Albe
In reply to this post by Jim Jarvie
On Tue, 2020-08-18 at 19:52 -0400, Jim Jarvie wrote:

> I have a system which implements a message queue with the basic pattern that a process selects a group of,
>  for example 250, rows for processing via SELECT .. LIMIT 250 FOR UPDATE SKIP LOCKED.
>
> When there are a small number of concurrent connections to process the queue, this seems to work as
>  expected and connections quickly obtain a unique block of 250 rows for processing.
> However, as I scale up the number of concurrent connections, I see a spike in CPU (to 100% across 80 cores)
>  when the SELECT FOR UPDATE SKIP LOCKED executes and the select processes wait for multiple minutes
>  (10-20 minutes) before completing.  My use case requires around 256 concurrent processors for the queue
>  but I've been unable to scale beyond 128 without everything grinding to a halt.
>
> The queue table itself fits in RAM (with 2M hugepages) and during the wait, all the performance counters
>  drop to almost 0 - no disk read or write (semi-expected due to the table fitting in memory) with 100%
>  buffer hit rate in pg_top and row read around 100/s which is much smaller than expected.
>
> After processes complete the select and the number of waiting selects starts to fall, CPU load falls and
>  then suddenly the remaining processes all complete within a few seconds and things perform normally until
>  the next time there are a group of SELECT  FOR UPDATE statements which bunch together and things then repeat.
>
> I found that performing extremely frequent vacuum analyze (every 30 minutes) helps a small amount but
>  this is not that helpful so problems are still very apparent.
>
> I've exhausted all the performance tuning and analysis results I can find that seem even a little bit
>  relevant but cannot get this cracked.
>
> Is anyone on the list able to help with suggestions of what I can do to track why this CPU hogging happens
>  as this does seem to be the root of the problem?

You should

- check with "pgstattuple" if the table is bloated.

- use "perf" to see where the CPU time is spent.

- look at "pg_stat_activity" for wait events (unlikely if the CPU is busy).

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com



Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Jim Jarvie
In reply to this post by Michael Lewis

Updates added, mixture of good and bad news:

On 18-Aug.-2020 20:39, Michael Lewis wrote:

      
How long does each process take in total? How strict does that FIFO really
need to be when you are already doing SKIP LOCKED anyway?

The processes all bottleneck[ed] for several minutes, approximately exponential to the number above the threshold where the problem happened.  Up to around 60 concurrent worked with minimal delay but beyond that a few more added about a minute, 70 about 5 minutes, 80 about 30 minutes and beyond that was hours (I left up to 12 hours one time).

However, I removed the order by clause which eliminated [only] the high CPU.  The processes all stopped in the same pattern, just without the high CPU use.  So the ordering was the factor in the CPU use, but was not responsible for the - forgive the pun - lock up.

I then added a random few seconds of delay to each process before it executes the select in order to prevent too many of them colliding on simultaneous selects.  That was enough to make the lock-up disappear and individual selects complete in a few ms, regardless of how many other concurrent transactions are in progress (tested up to 192 concurrent).  But still not quite out the woods - read below.

Can you expound on the partitioning? Are all consumers of the queue always
hitting one active partition and anytime a row is processed, it always
moves to one of many? archived type partitions?

Partitions are list partitioned as 'incoming', 'processing', 'retry', 'ok', 'failed':

Incoming: This is itself hash partitioned (64 partitions) approx 10M rows added/day so partitioned to allow incoming throughput; this works well.

Processing: Simple partition, data is moved into this in blocks as the rows count drops below a threshold, another block is added, coming from the incoming.

Retry: simple partition, non fatal errors go in here and go back into the processing queue for retries later.

Failed: simple partition, fatal errors go here.  Thankfully very few.

OK: hash partition, as everything that was in incoming should eventually end up here.  64 partitions currently.

There is one interesting thing about this.  When the SELECT FOR UPDATE SKIP LOCKED is executed, reasonably frequently, the select aborts with the error:

Tuple to be locked was already moved to another partition due to concurrent update.

This error still persists even though the lock-up has been removed by the time delay, so there is a regular stream of transactions aborting due to this (I just re-run the transaction to recover).

Now, if locking worked as I understand it, if another process locked and migrated, this should still have left the lock in place on the original partition and created a new one on the newly inserted partition until a commit was done.  The second process should not have visibility on the newly inserted row and the skip locked should simply have skipped over the locked but deleted row on the original partition.

What am I missing?  All of this feels like some locking/partitioning issue but I'm unsure exactly what.

Is that done via FDW or otherwise within the same database transaction? Are
you connecting some queue consumer application code to Postgres, select for
update, doing work on some remote system that is slow, and then coming back
and committing the DB work?
Alas not FDW, an actual external system elsewhere in the world which sends an ACK when it has processed the message.  I have no control or influence on this.
By the way, top-posting is discouraged here and partial quotes with
interspersed comments are common practice.
Noted!

    
Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Michael Lewis
Great to hear that some of the issues are now mitigated. Though, perhaps you actually require that ORDER BY if items are expected to be sitting in the queue quite some time because you have incoming queue items in a burst pattern and have to play catch up sometimes. If so, I highly suspect the index on q_id is becoming very bloated and reindex concurrently would help.

> Partitions are list partitioned as 'incoming', 'processing', 'retry', 'ok', 'failed':

I am unclear on what purpose a "processing" status would have. Shouldn't a row be in the incoming status & locked by select for update, until it either gets updated to ok or failed (or left alone if retry is needed)? What purpose do the retry and processing statuses serve? I don't understand your full workflow to venture a guess on how you are hitting that error regarding a row being in the wrong partition, but fewer main level partitions and removing unneeded updates seems likely to help or resolve the issue perhaps.

I don't know if you might have missed my last message, and the suggestion from Laurenz to check pgstattuple.

At a high level, it seems like any needed update to the rows would result in it being removed from the current partition and moved to another partition. If you are doing this in a transaction block, then you could just as well skip the select for update and just DELETE [] RETURNING from the existing partition and insert into the new partition later (use a select for update if you want to order the deletes*). If your transaction fails and gets rolled back, then the delete won't have happened and the row will get picked up by the next consumer.

Another thought is that I don't know how performant that hash partitioning will be for select for update, particularly if that targets many partitions potentially. Would it be feasible to match the number of partitions to the number of consumers and actually have each of them working on one?


*https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-in-postgresql-9-5/
Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Jim Jarvie


On 20-Aug.-2020 13:30, Michael Lewis wrote:
Great to hear that some of the issues are now mitigated. Though, perhaps
you actually require that ORDER BY if items are expected to be sitting in
the queue quite some time because you have incoming queue items in a burst
pattern and have to play catch up sometimes. If so, I highly suspect the
index on q_id is becoming very bloated and reindex concurrently would help.
I managed to bypass the need for the sort by relying on the active feed only sending the oldest items in for processing (it was always doing that) but based on some of the earlier e-mails in this thread, it prompted the revelation that my order by when processing was really pretty pointless because I need more-or-less ordered rather than strictly ordered and that was already happening due to how the process list was being fed.

I don't know if you might have missed my last message, and the suggestion
from Laurenz to check pgstattuple.
I still need to look at that, but since I had made some progress, I got pretty exited and have not got round to this yet.
*
https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-in-postgresql-9-5/

This does warn about the overhead, but I've also upgraded pg_top on my system today and saw a useful additional data point that it displays - the number of locks held by a process.

What I see happening is that when the select statements collide, they are holding about 10-12 locks each and then begin to very slowly acquire more locks every few seconds.  One process will grow quicker than others then reach the target (250) and start processing.  Then another takes the lead and so on until a critical mass is reached and then the remaining all acquire their locks in a few seconds.

I still keep thinking there is some scaling type issue here in the locking and possibly due to it being a partitioned table (due to that tuple moved error).


    
Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Michael Lewis
Can you share an explain analyze for the query that does the select for update? I wouldn't assume that partition pruning is possible at all with hash, and it would be interesting to see how it is finding those rows.

    
Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Jim Jarvie


On 20-Aug.-2020 17:42, Michael Lewis wrote:
Can you share an explain analyze for the query that does the select for
update? I wouldn't assume that partition pruning is possible at all with
hash, and it would be interesting to see how it is finding those rows.

Well this got interesting  - the already moved error showed up:  Note, the actual process partitions are regular table partitions, these are not hashed.  Only the incoming and completed are hashed due to row counts at either end of the processing; in flight (where the issue shows up) is quite small:

[queuedb] # explain analyze select queueid,txobject,objectid,state from mq.queue where (state = 'tx_active' or state='tx_fail_retryable') and txobject = 'ticket' limit 250 for update skip locked;
ERROR:  40001: tuple to be locked was already moved to another partition due to concurrent update
LOCATION:  heapam_tuple_lock, heapam_handler.c:405
Time: 579.131 ms
[queuedb] # explain analyze select queueid,txobject,objectid,state from mq.queue where (state = 'tx_active' or state='tx_fail_retryable') and txobject = 'ticket' limit 250 for update skip locked;
ERROR:  40001: tuple to be locked was already moved to another partition due to concurrent update
LOCATION:  heapam_tuple_lock, heapam_handler.c:405
Time: 568.008 ms
[queuedb] # explain analyze select queueid,txobject,objectid,state from mq.queue where (state = 'tx_active' or state='tx_fail_retryable') and txobject = 'ticket' limit 250 for update skip locked;
                                                                        QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..25.71 rows=250 width=34) (actual time=1306.041..1306.338 rows=250 loops=1)
   ->  LockRows  (cost=0.00..7934.38 rows=77150 width=34) (actual time=1306.040..1306.315 rows=250 loops=1)
         ->  Append  (cost=0.00..7162.88 rows=77150 width=34) (actual time=520.685..1148.347 rows=31500 loops=1)
               ->  Seq Scan on queue_tx_active  (cost=0.00..6764.50 rows=77000 width=34) (actual time=520.683..1145.258 rows=31500 loops=1)
                     Filter: ((txobject = 'ticket'::mq.queue_object) AND ((state = 'tx_active'::mq.tx_state) OR (state = 'tx_fail_retryable'::mq.tx_state)))
               ->  Seq Scan on queue_tx_fail_retryable  (cost=0.00..12.62 rows=150 width=34) (never executed)
                     Filter: ((txobject = 'ticket'::mq.queue_object) AND ((state = 'tx_active'::mq.tx_state) OR (state = 'tx_fail_retryable'::mq.tx_state)))
 Planning Time: 0.274 ms
 Execution Time: 1306.380 ms
(9 rows)

Time: 1317.150 ms (00:01.317)
[queuedb] #


    
Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Michael Lewis
On Thu, Aug 20, 2020 at 4:40 PM Jim Jarvie <[hidden email]> wrote:

On 20-Aug.-2020 17:42, Michael Lewis wrote:

Can you share an explain analyze for the query that does the select for
update? I wouldn't assume that partition pruning is possible at all with
hash, and it would be interesting to see how it is finding those rows.

Well this got interesting  - the already moved error showed up:  Note, the actual process partitions are regular table partitions, these are not hashed.  Only the incoming and completed are hashed due to row counts at either end of the processing; in flight (where the issue shows up) is quite small:

[queuedb] # explain analyze select queueid,txobject,objectid,state from mq.queue where (state = 'tx_active' or state='tx_fail_retryable') and txobject = 'ticket' limit 250 for update skip locked;
ERROR:  40001: tuple to be locked was already moved to another partition due to concurrent update
LOCATION:  heapam_tuple_lock, heapam_handler.c:405
Time: 579.131 ms

That is super curious. I hope that someone will jump in with an explanation or theory on this.

I still wonder why the move between partitions is needed though if the work is either done (failed or successful) or not done... not started, retry needed or in progress... it doesn't matter. It needs to get picked up by the next process if it isn't already row locked.

    
Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

David Rowley
On Fri, 21 Aug 2020 at 11:01, Michael Lewis <[hidden email]> wrote:

>
> On Thu, Aug 20, 2020 at 4:40 PM Jim Jarvie <[hidden email]> wrote:
>>
>> On 20-Aug.-2020 17:42, Michael Lewis wrote:
>>
>> Can you share an explain analyze for the query that does the select for
>> update? I wouldn't assume that partition pruning is possible at all with
>> hash, and it would be interesting to see how it is finding those rows.
>>
>> Well this got interesting  - the already moved error showed up:  Note, the actual process partitions are regular table partitions, these are not hashed.  Only the incoming and completed are hashed due to row counts at either end of the processing; in flight (where the issue shows up) is quite small:
>>
>> [queuedb] # explain analyze select queueid,txobject,objectid,state from mq.queue where (state = 'tx_active' or state='tx_fail_retryable') and txobject = 'ticket' limit 250 for update skip locked;
>> ERROR:  40001: tuple to be locked was already moved to another partition due to concurrent update
>> LOCATION:  heapam_tuple_lock, heapam_handler.c:405
>> Time: 579.131 ms
>
> That is super curious. I hope that someone will jump in with an explanation or theory on this.
>
> I still wonder why the move between partitions is needed though if the work is either done (failed or successful) or not done... not started, retry needed or in progress... it doesn't matter. It needs to get picked up by the next process if it isn't already row locked.

I may be heading off in the wrong direction as I'm not fully sure I
understand what the complaint is about, but isn't the executor just
hitting dead rows in one of the active or failed partitions that have
been moved off to some other partition?

When updates occur in a non-partitioned table we can follow item
pointer chains to find the live row and check if the WHERE clause
still matches to determine if the row should be updated, or in this
case just locked since it's a SELECT FOR UPDATE. However, with
partitioned table, a concurrent UPDATE may have caused the row to have
been moved off to another partition, in which case the tuple's item
pointer cannot point to it since we don't have enough address space,
we only have 6 bytes for a TID. To get around the fact that we can't
follow these update chains, we just throw the serialization error,
which is what you're getting. Ideally, we'd figure out where the live
version of the tuple is and check if it matches the WHERE clause and
lock it if it does, but we've no means to do that with the current
design.

If the complaint is about the fact that you're getting the error and
you think you shouldn't be because you said "SKIP LOCKED" then I'm not
really sure the fact that you said "SKIP LOCKED" gives us the right to
ignore this case. The user only gave us the go-ahead to skip locked
tuples, not skip tuples that we just failed to follow item pointer
chains for.   It might be okay to do this for rows that have since
been marked as complete since they no longer match your WHERE clause,
however, if the row has gone from the queue_tx_active partition into
the queue_tx_fail_retryable partition then I don't see why we'd have
the right to skip the tuple. Your query says you want tuples that need
to be retried. We can't go skipping them.

So isn't the fix just to code the application to retry on 40001 errors?

David


Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Thomas Munro-5
In reply to this post by Jim Jarvie
On Fri, Aug 21, 2020 at 9:58 AM Jim Jarvie <[hidden email]> wrote:
> However, as I scale up the number of concurrent connections, I see a spike in CPU (to 100% across 80 cores) when the SELECT FOR UPDATE SKIP LOCKED executes and the select processes wait for multiple minutes (10-20 minutes) before completing.  My use case requires around 256 concurrent processors for the queue but I've been unable to scale beyond 128 without everything grinding to a halt.

Maybe it's just getting algorithmically ugly.  To claim some job rows,
you have to skip all dead/non-matching tuples left behind so far at
the start of the table by all the other sessions, and then also all
currently locked tuples, and you have to do update-chain walks on some
of them too.  It all gets a bit explosive once you have such high
numbers of workers.

I think I'd experiment with splitting the job table up into N tables
and feed jobs into all of them about evenly (by hashing, at random,
whatever), and then I'd assign each consumer a "preferred" table where
it looks for jobs first (perhaps my_worker_id % ntables), before
trying the others in round robin order.  Then they won't trample on
each other's toes so much.

In the past I've wondered about a hypothetical circular_seqscan
option, which would cause table scans to start where they left off
last time in each backend, so SELECT * FROM t LIMIT 1 repeated would
show you a different row each time until you get all the way around to
the start again (as we're entirely within our rights to do for a query
with no ORDER BY).  That'd give the system a chance to vacuum and
start refilling the start of the table before you get around to it
again, instead of repeatedly having to step over the same useless
pages every time you need a new job.  Combined with the N tables
thing, you'd be approaching a sweet spot for contention and dead tuple
avoidance.  The synchronized_seqscans setting is related to this idea,
but more expensive, different, and probably not useful.

Hmm.  I guess another way to avoid colliding with others' work would
be to try to use SELECT * FROM t TABLESAMPLE SYSTEM (10) WHERE ... FOR
UPDATE SKIP LOCKED LIMIT ....  It's less cache-friendly, and less
order-preserving, but way more contention-friendly.  That has another
complication though; how do you pick 10?  And if it doesn't return any
or enough rows, it doesn't mean there isn't enough, so you may need to
be ready to fall back to the plain approach if having 250 rows is
really important to you and TABLESAMPLE doesn't give you enough.  Or
something.

By the way, when working with around 64 consumer processes I was also
annoyed by the thundering herd problem when using NOTIFY.  I found
various workaround solutions to that, but ultimately I think we need
more precise wakeups for that sort of thing, which I hope to revisit
one day.


Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Jim Jarvie
Merging a couple of emails:

On 20-Aug.-2020 19:37, David Rowley wrote:

> When updates occur in a non-partitioned table we can follow item
> pointer chains to find the live row and check if the WHERE clause
> still matches to determine if the row should be updated, or in this
> case just locked since it's a SELECT FOR UPDATE. However, with
> partitioned table, a concurrent UPDATE may have caused the row to have
> been moved off to another partition, in which case the tuple's item
> pointer cannot point to it since we don't have enough address space,
> we only have 6 bytes for a TID. To get around the fact that we can't
> follow these update chains, we just throw the serialization error,
> which is what you're getting. Ideally, we'd figure out where the live
> version of the tuple is and check if it matches the WHERE clause and
> lock it if it does, but we've no means to do that with the current
> design.
This is the absolute best description of what causes the "tuple to be
locked was already moved to another partition due to concurrent update"
message I have ever seen.  It totally makes sense why this happens given
your explanation.  Thank you for giving the detail.

I am backing off the query with a time delay and then retrying and that
seems to be the correct approach as well, but I only stumbled upon that
by accident.  Hopefully when the google gnomes index the mailing list
this message will come up to save others time and worry about the message.

On 21-Aug.-2020 00:34, Thomas Munro wrote:
>
> Maybe it's just getting algorithmically ugly.  To claim some job rows,
> you have to skip all dead/non-matching tuples left behind so far at
> the start of the table by all the other sessions, and then also all
> currently locked tuples, and you have to do update-chain walks on some
> of them too.  It all gets a bit explosive once you have such high
> numbers of workers.
Yes, fundamentally it seems to come down to traffic volume.  When I get
over 128 connections all selecting and locking that one table, applying
the locks seems to struggle and the problem grows exponentially.  I'm an
extreme edge case, so not really a scenario that locking could ever have
been expected to handle but it's pretty good that it gets up to 128 at all.
> I think I'd experiment with splitting the job table up into N tables
> and feed jobs into all of them about evenly (by hashing, at random,
> whatever), and then I'd assign each consumer a "preferred" table where
> it looks for jobs first (perhaps my_worker_id % ntables), before
> trying the others in round robin order.  Then they won't trample on
> each other's toes so much.
I think this is a good idea, but (in my case) I think this is where it
will need v13 which is going to give that via "Allow ROW values to be
used as partitioning expressions" ? (e.g. will v13 permit queueid mod
200 as the partition expression to make 200 partitions to allow 200
[contention free] consumers?).
> In the past I've wondered about a hypothetical circular_seqscan
> option, which would cause table scans to start where they left off
> last time in each backend, so SELECT * FROM t LIMIT 1 repeated would
> show you a different row each time until you get all the way around to
> the start again (as we're entirely within our rights to do for a query
> with no ORDER BY).  That'd give the system a chance to vacuum and
> start refilling the start of the table before you get around to it
> again, instead of repeatedly having to step over the same useless
> pages every time you need a new job.
I like the sound of this; if I understand correctly, it would
essentially walk in insertion(-ish) order which would be OK for me. As
long as it was documented clearly; perhaps I should put a page online
about high traffic message queues with PostgreSQL for people to find
when they try the same thing.

>
> Hmm.  I guess another way to avoid colliding with others' work would
> be to try to use SELECT * FROM t TABLESAMPLE SYSTEM (10) WHERE ... FOR
> UPDATE SKIP LOCKED LIMIT ....  It's less cache-friendly, and less
> order-preserving, but way more contention-friendly.  That has another
> complication though; how do you pick 10?  And if it doesn't return any
> or enough rows, it doesn't mean there isn't enough, so you may need to
> be ready to fall back to the plain approach if having 250 rows is
> really important to you and TABLESAMPLE doesn't give you enough.  Or
> something.

For me, this is an acceptable compromise.  I just need to consume
something on each pass and up to 250 items but getting something less
than 250 would still allow progress and if it removed wait time, could
even be an overall greater throughput.  I may try this just to see how
it performs.

If someone has a strictly ordered queue, they will never really have
this issue as they must start at the beginning and go sequentially using
only 1 consumer.  Because I only need loose/approximate ordering and
throughput is the objective, all the locking and contention comes into play.

> By the way, when working with around 64 consumer processes I was also
> annoyed by the thundering herd problem when using NOTIFY.  I found
> various workaround solutions to that, but ultimately I think we need
> more precise wakeups for that sort of thing, which I hope to revisit
> one day.

I'm making a note of this because I also happen to have a different
scenario which has NOTIFY with well over 100 LISTEN consumers...  That's
not given me problems - yet - but I'm now aware of this should problems
arise.



Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Jeff Janes
In reply to this post by Jim Jarvie
On Tue, Aug 18, 2020 at 8:22 PM Jim Jarvie <[hidden email]> wrote:

I've tuned the LIMIT value both up and down.  As I move the limit up, the problem becomes substantially worse; 300 swamps it and the selects take > 1 hour to complete; at 600 they just all lock everything up and it stops processing.  I did try 1,000 but it basically resulted in nothing being processed.

You've only described what happens when you turn the LIMIT up.  What happens when you turn it down?  Why did you pick 250 in the first place?  I don't see the rationale for having 250*256 rows locked simultaneously.  I can see reasons you might want a LIMIT as high as 250, or for having 256 processes.  I just don't see why you would want to do both in the same system.

Less processes does not give the throughput required because the queue sends data elsewhre which has a long round trip time but does permit over 1K concurrent connections as their work-round for throughput.  I'm stuck having to scale up my concurrent processes in order to compensate for the long processing time of an individual queue item.

You've tied the database concurrency to the external process concurrency.  While this might be convenient, there is no reason to think it will be optimal.  If you achieve concurrency by having 256 processes, why does each process need to lock 250 rows at  time.  Having 64,000 rows locked to obtain 256-fold concurrency seems like a poor design.

With modern tools it should not be too hard to have just one process obtain 1000 rows, and launch 1000 concurrent external tasks.  Either with threads (making sure only one thread deals with the database), or with asynchronous operations.  (Then the problem would be how to harvest the results, it couldn't unlock the rows until all external tasks have finished, which would be a problem if some took much longer than others).

It is easy to reproduce scaling problems when you have a large number of processes trying to do ORDER BY id LIMIT 250 FOR UPDATE SKIP LOCKED without all the partitioning and stuff.  I don't know if the problems are as severe as you describe with your very elaborate setup--or even if they have the same bottleneck.  But in the simple case, there seems to be a lot of spin-lock contention, as every selecting query needs to figure out if every marked-as-locked row is truly locked, by asking if the apparently-locking transaction is still valid.

Cheers,

Jeff

    
Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Gunter
In reply to this post by Jim Jarvie
Hi all, especially Jim Jarvie, I saw your email to me only now on my
related issue. My issue remains this one:

> Well this got interesting  - the already moved error showed up:
and I have already gone through all those index pruning and all that
good stuff.

I remain with my original question from 30th of June, to me it feels
like a bug of some sort:

> "tuple to be locked was already moved to another partition due to
> concurrent update"
>
> This would not exactly look like a bug, because the message says "to
> be locked", so at least it's not allowing two workers to lock the same
> tuple. But it seems that the skip-locked mode should not make an error
> out of this, but treat it as the tuple was already locked. Why would
> it want to lock the tuple (representing the job) if another worker has
> already finished his UPDATE of the job to mark it as "done" (which is
> what makes the tuple move to the "completed" partition.)
>
> Either the SELECT for jobs to do returned a wrong tuple, which was
> already updated, or there is some lapse in the locking.
>
> Either way it would seem to be a waste of time throwing all these
> errors when the tuple should not even have been selected for update
> and locking.
>
> I wonder if anybody knows anything about that issue? Of course you'll
> want to see the DDL and SQL queries, etc. but you can't really try it
> out unless you do some massively parallel magic.

I still think that it should simply not happen. Don't waste time on old
tuples trying to fetch and lock something that's no longer there. It's a
waste of resources.

regards,
-Gunther

On 8/20/2020 6:39 PM, Jim Jarvie wrote:

>
>
> On 20-Aug.-2020 17:42, Michael Lewis wrote:
>> Can you share an explain analyze for the query that does the select for
>> update? I wouldn't assume that partition pruning is possible at all with
>> hash, and it would be interesting to see how it is finding those rows.
>
> Well this got interesting  - the already moved error showed up:  Note,
> the actual process partitions are regular table partitions, these are
> not hashed.  Only the incoming and completed are hashed due to row
> counts at either end of the processing; in flight (where the issue
> shows up) is quite small:
>
> [queuedb] # explain analyze select queueid,txobject,objectid,state
> from mq.queue where (state = 'tx_active' or state='tx_fail_retryable')
> and txobject = 'ticket' limit 250 for update skip locked;
> ERROR:  40001: tuple to be locked was already moved to another
> partition due to concurrent update
> LOCATION:  heapam_tuple_lock, heapam_handler.c:405
> Time: 579.131 ms
> [queuedb] # explain analyze select queueid,txobject,objectid,state
> from mq.queue where (state = 'tx_active' or state='tx_fail_retryable')
> and txobject = 'ticket' limit 250 for update skip locked;
> ERROR:  40001: tuple to be locked was already moved to another
> partition due to concurrent update
> LOCATION:  heapam_tuple_lock, heapam_handler.c:405
> Time: 568.008 ms
> [queuedb] # explain analyze select queueid,txobject,objectid,state
> from mq.queue where (state = 'tx_active' or state='tx_fail_retryable')
> and txobject = 'ticket' limit 250 for update skip locked;
>         QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=0.00..25.71 rows=250 width=34) (actual
> time=1306.041..1306.338 rows=250 loops=1)
>    ->  LockRows  (cost=0.00..7934.38 rows=77150 width=34) (actual
> time=1306.040..1306.315 rows=250 loops=1)
>          ->  Append  (cost=0.00..7162.88 rows=77150 width=34) (actual
> time=520.685..1148.347 rows=31500 loops=1)
>                ->  Seq Scan on queue_tx_active  (cost=0.00..6764.50
> rows=77000 width=34) (actual time=520.683..1145.258 rows=31500 loops=1)
>                      Filter: ((txobject = 'ticket'::mq.queue_object)
> AND ((state = 'tx_active'::mq.tx_state) OR (state =
> 'tx_fail_retryable'::mq.tx_state)))
>                ->  Seq Scan on queue_tx_fail_retryable
>  (cost=0.00..12.62 rows=150 width=34) (never executed)
>                      Filter: ((txobject = 'ticket'::mq.queue_object)
> AND ((state = 'tx_active'::mq.tx_state) OR (state =
> 'tx_fail_retryable'::mq.tx_state)))
>  Planning Time: 0.274 ms
>  Execution Time: 1306.380 ms
> (9 rows)
>
> Time: 1317.150 ms (00:01.317)
> [queuedb] #
>


Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Jim Jarvie

Hi Gunther

On 07-Sep.-2020 14:04, [hidden email] wrote:
Hi all, especially Jim Jarvie, I saw your email to me only now on my related issue. My issue remains this one:
---8<---
I remain with my original question from 30th of June, to me it feels like a bug of some sort:

"tuple to be locked was already moved to another partition due to concurrent update"

---8<---
I still think that it should simply not happen. Don't waste time on old tuples trying to fetch and lock something that's no longer there. It's a waste of resources.

I'm inclined to agree that the error seems to indicate PostgreSQL knows the row was locked & migrated, so attempting to lock it should not really result in the error when SKIP LOCKED is set (but it should behave as it does when there is no SKIP LOCKED).  For the SKIP LOCKED case, it should treat the migrated as being exactly the same as already locked.

I think this is an edge case on the SKIP LOCKED that is not handled as it should be.

Do others agree?

Jim



regards,
-Gunther

Reply | Threaded
Open this post in threaded view
|

Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

David Rowley
In reply to this post by Gunter
On Tue, 8 Sep 2020 at 06:05, Raj <[hidden email]> wrote:
>
> > This would not exactly look like a bug, because the message says "to
> > be locked", so at least it's not allowing two workers to lock the same
> > tuple. But it seems that the skip-locked mode should not make an error
> > out of this, but treat it as the tuple was already locked. Why would
> > it want to lock the tuple (representing the job) if another worker has
> > already finished his UPDATE of the job to mark it as "done" (which is
> > what makes the tuple move to the "completed" partition.)

(It's not very clear who wrote the above text since the quote does not
mention who the author is and the original email didn't appear to have
made it to the list)

It's not a bug. I think the quoted text is expecting a bit too much
from the database. It does not know that if the tuple is updated and
moved to another partition that it can be safely ignored.   For all
the database knows, the new version of the tuple that's in the new
partition still matches the query's WHERE clause and should be locked.
If we just go and ignore moved off tuples then we could miss
processing tuples that still need to be processed.

It's perhaps not impossible to make it work slightly better if it were
somehow possible to inform heapam_tuple_lock() that it's operating on
a partition and the query queried a partitioned table and that all but
1 partition was pruned with partition pruning.  In this case we could
be certain the new verison of the tuple can't match the WHERE clause
of the SELECT since partition pruning determined that all other
partitions don't match the WHERE clause. However, that's:

a) a pretty horrid thing to have to teach heapam_tuple_lock() about, and;
b) only going to work when 1 partition survives partition pruning,
which is pretty horrible since doing ATTACH PARTITION could suddenly
cause your queries to fail randomly.

If you had 3 partitions, one for "pending", "retry" and "complete",
and you wanted to lock all rows that are in a "pending" or "retry"
state, then when we encounter an updated row in the "pending"
partition, we have no knowledge if it was moved into the "retry" or
the "completed" partition.   If it's in "retry", then we do want to
find it and process it, but if it's in "completed", then it does not
match the WHERE clause of the query and we can ignore it.  Since we
don't know which, we can't make assumptions and must force the user to
try again, hence the serialisation failure error.

> > Either the SELECT for jobs to do returned a wrong tuple, which was
> > already updated, or there is some lapse in the locking.
> >
> > Either way it would seem to be a waste of time throwing all these
> > errors when the tuple should not even have been selected for update
> > and locking.
> >
> > I wonder if anybody knows anything about that issue? Of course you'll
> > want to see the DDL and SQL queries, etc. but you can't really try it
> > out unless you do some massively parallel magic.

I ready mentioned why this cannot work that way [1].  If you have some
idea on how to make it work correctly, then it would be interesting to
hear. Otherwise, I'm sorry to say that we can't just ignore these
tuples because it happens to suit your use case.

The solution is just to make the application retry on serialisation failures.

David

[1] https://www.postgresql.org/message-id/CAApHDvrDH6TQeLxTqnnAnhjrs55ru5g2_QMG=ME+WvD5MmpHQg@...


12