How to avoid UPDATE performance degradation in a transaction

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

How to avoid UPDATE performance degradation in a transaction

Karl Düüna
Hi

I recently came across a performance problem with a big transaction block, which doesn't make sense to me and hopefully someone more knowledgeable can explain the reasons and point out a direction for a solution.

-- TL; DR;

UPDATE on a row takes relatively constant amount of time outside a transaction block, but running UPDATE on a single row over and over inside a transaction gets slower and slower as the number of UPDATE operations increases.

Why is updating the same row large number of times progressively slower inside a transaction? And is there a way to avoid this performance degradation?

I set up a POC repository to demonstrate the problem: https://github.com/DeadAlready/pg-test

-- Backstory

Needed to run a large block of operations (a mix of inserts and updates) on a table. It took a considerable amount of time inside a transaction and was about 10x faster without the transaction. Since I need all the operations to run as a single block that can be rolled back this was unsatisfactory. Thus began my quest to locate the problem. Since the actual data structure is complex and involves a bunch of triggers, foreign keys etc it took some time to narrow down, but in the end I found that the structure itself is irrelevant. The issue occurs even if you have a single two column table with a handful of rows. The only requirement seems to be that the NR of UPDATEs per single row is large. While the update performance inside a transaction starts out faster than outside, the performance starts to degrade from the get go. It really isn't noticeable until about 5k UPDATEs on a single row. At around 100k UPDATEs it is about 2.5x slower than the same operation outside the transaction block and about 4x slower than at the beginning of the transaction.

Thanks,
Karl
Reply | Threaded
Open this post in threaded view
|

Re: How to avoid UPDATE performance degradation in a transaction

David G Johnston
On Thu, Feb 13, 2020 at 1:42 PM Karl Düüna <[hidden email]> wrote:
It really isn't noticeable until about 5k UPDATEs on a single row.

Don't know why, and never dealt with a scenario where this would even come up, but that this doesn't perform well inside a transaction isn't surprising to me.  Kinda surprised it works well at all actually.  I'd probably try and rework the processing algorithm to create an unlogged temporary table with data from the row's initial state, manipulate until my heart's content, then take the final result and update the single live row with the final state.

David J.

Reply | Threaded
Open this post in threaded view
|

Re: How to avoid UPDATE performance degradation in a transaction

Tom Lane-2
In reply to this post by Karl Düüna
=?UTF-8?B?S2FybCBEw7zDvG5h?= <[hidden email]> writes:
> -- TL; DR;
> UPDATE on a row takes relatively constant amount of time outside a
> transaction block, but running UPDATE on a single row over and over inside
> a transaction gets slower and slower as the number of UPDATE operations
> increases.

Yeah, that's unsurprising.  Each new update creates a new version of
its row.  When you do them in separate transactions, then as soon as
transaction N+1 commits the system can recognize that the row version
created by transaction N is dead (no longer visible to anybody) and
recycle it, allowing the number of row versions present on-disk to
stay more or less constant.  However, there's not equivalently good
housekeeping for row versions created by a transaction that's still
running.  So when you do N updates in one transaction, there are going
to be N doomed-but-not-yet-recyclable row versions on disk.

Aside from the disk-space bloat, this is bad because the later updates
have to scan through all the row versions created by earlier updates,
looking for the version they're supposed to update.  So you have an O(N^2)
cost associated with that, which no doubt is what you're observing.

There isn't any really good fix for this, other than "don't do that".
David's nearby suggestion of using a temp table won't help, because
this behavior is the same whether the table is temp or regular.

In principle perhaps we could improve the granularity of dead-row
detection, so that if a row version is both created and deleted by
the current transaction, and we have no live snapshots that could
see it, we could go ahead and mark the row dead.  But it's not clear
that that'd be worth the extra cost to do.  Certainly no existing PG
release tries to do it.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: How to avoid UPDATE performance degradation in a transaction

Karl Düüna
Thank you for the explanation.

That is pretty much what I suspected, but I held out hope that there is some functionality I could use to clear the bloat as the transaction progresses and bring the UPDATE time back down again.
"dont do that" is sensible, but much more easily said than done, as the in the actual use case I have, the single row updates are caused by various triggers running on separate operations - 
which means I will have to muck about with conditional trigger disabling and/or change a large part of the database logic around these tables.
But I guess that is a whole other issue.

Anyways thank you for the time and explanation,
Karl


On Thu, 13 Feb 2020 at 23:16, Tom Lane <[hidden email]> wrote:
=?UTF-8?B?S2FybCBEw7zDvG5h?= <[hidden email]> writes:
> -- TL; DR;
> UPDATE on a row takes relatively constant amount of time outside a
> transaction block, but running UPDATE on a single row over and over inside
> a transaction gets slower and slower as the number of UPDATE operations
> increases.

Yeah, that's unsurprising.  Each new update creates a new version of
its row.  When you do them in separate transactions, then as soon as
transaction N+1 commits the system can recognize that the row version
created by transaction N is dead (no longer visible to anybody) and
recycle it, allowing the number of row versions present on-disk to
stay more or less constant.  However, there's not equivalently good
housekeeping for row versions created by a transaction that's still
running.  So when you do N updates in one transaction, there are going
to be N doomed-but-not-yet-recyclable row versions on disk.

Aside from the disk-space bloat, this is bad because the later updates
have to scan through all the row versions created by earlier updates,
looking for the version they're supposed to update.  So you have an O(N^2)
cost associated with that, which no doubt is what you're observing.

There isn't any really good fix for this, other than "don't do that".
David's nearby suggestion of using a temp table won't help, because
this behavior is the same whether the table is temp or regular.

In principle perhaps we could improve the granularity of dead-row
detection, so that if a row version is both created and deleted by
the current transaction, and we have no live snapshots that could
see it, we could go ahead and mark the row dead.  But it's not clear
that that'd be worth the extra cost to do.  Certainly no existing PG
release tries to do it.

                        regards, tom lane
Reply | Threaded
Open this post in threaded view
|

Re: How to avoid UPDATE performance degradation in a transaction

Michael Lewis
If your trigger is supposed to change certain fields, you could return OLD instead of NEW if those fields have not been changed by the trigger. You could also check an updated_on timestamp field to verify if the row has already been modified and potentially skip the trigger altogether. Just a couple thoughts to avoid the bloat.
Reply | Threaded
Open this post in threaded view
|

Re: How to avoid UPDATE performance degradation in a transaction

Andres Freund
In reply to this post by Tom Lane-2
Hi,

On 2020-02-13 16:16:14 -0500, Tom Lane wrote:
> In principle perhaps we could improve the granularity of dead-row
> detection, so that if a row version is both created and deleted by
> the current transaction, and we have no live snapshots that could
> see it, we could go ahead and mark the row dead.  But it's not clear
> that that'd be worth the extra cost to do.  Certainly no existing PG
> release tries to do it.

I've repeatedly wondered about improving our logic around this. There's
a lot of cases where we deal with a lot of bloat solely because our
simplistic liveliness analysis.

It's not just within a single transaction, but also makes the impact of
longrunning transactions significantly worse. It's common to have
"areas" of some tables that change quickly, without normally causing a
lot of problems - but once there is a single longrunning transaction the
amount of bloat created is huge. It's not that bad to have the "hot
areas" increased in size by 2-3x, but right now it'll often be several
orders of magnitude.

But perhaps it doesn't make sense to conflate your suggestion above with
what I brought up: There'd might not be a lot of common
code/infrastructure between deleting row versions that are invisible due
to no backend having a snapshot to see them (presumably inferred via
xmin/xmax), and newly created row versions within a transaction that are
invisible because there's no snapshot with that cid.

Greetings,

Andres Freund