PATCH: logical_work_mem and logical streaming of large in-progress transactions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4
Hi all,

Attached is a patch series that implements two features to the logical
replication - ability to define a memory limit for the reorderbuffer
(responsible for building the decoded transactions), and ability to
stream large in-progress transactions (exceeding the memory limit).

I'm submitting those two changes together, because one builds on the
other, and it's beneficial to discuss them together.


PART 1: adding logical_work_mem memory limit (0001)
---------------------------------------------------

Currently, limiting the amount of memory consumed by logical decoding is
tricky (or you might say impossible) for several reasons:

* The value is hard-coded, so it's not quite possible to customize it.

* The amount of decoded changes to keep in memory is restricted by
number of changes. It's not very unclear how this relates to memory
consumption, as the change size depends on table structure, etc.

* The number is "per (sub)transaction", so a transaction with many
subtransactions may easily consume significant amount of memory without
actually hitting the limit.

So the patch does two things. Firstly, it introduces logical_work_mem, a
GUC restricting memory consumed by all transactions currently kept in
the reorder buffer.

Secondly, it adds a simple memory accounting by tracking the amount of
memory used in total (for the whole reorder buffer, to compare against
logical_work_mem) and per transaction (so that we can quickly pick
transaction to spill to disk).

The one wrinkle on the patch is that the memory limit can't be enforced
when reading changes spilled to disk - with multiple subtransactions, we
can't easily predict how many changes to pre-read for each of them. At
that point we still use the existing max_changes_in_memory limit.

Luckily, changes introduced in the other parts of the patch should allow
addressing this deficiency.


PART 2: streaming of large in-progress transactions (0002-0006)
---------------------------------------------------------------

Note: This part is split into multiple smaller chunks, addressing
different parts of the logical decoding infrastructure. That's mostly to
allow easier reviews, though. Ultimately, it's just one patch.

Processing large transactions often results in significant apply lag,
for a couple of reasons. One reason is network bandwidth - while we do
decode the changes incrementally (as we read the WAL), we keep them
locally, either in memory, or spilled to files. Then at commit time, all
the changes get sent to the downstream (and applied) at the same time.
For large transactions the time to do the network transfer may be
significant, causing apply lag.

This patch extends the logical replication infrastructure (output plugin
API, reorder buffer, pgoutput, replication protocol etc.) so allow
streaming of in-progress transactions instead of spilling them to local
files.

The extensions to the API are pretty straightforward. Aside from adding
methods to stream changes/messages and commit a streamed transaction,
the API needs a function to abort a streamed (sub)transaction, and
functions to demarcate a block of streamed changes.

To decode a transaction, we need to know all it's subtransactions, and
invalidations. Currently, those are only known at commit time (although
some assignments may be known earlier), but invalidations are only ever
written in the commit record.

So far that was fine, because we only decode/replay transactions at
commit time, when all of this is known (because it's either in commit
record, or written before it).

But for in-progress transactions (i.e. the subject of interest here),
that is not the case. So the patch modifies WAL-logging to ensure those
two bits of information are written immediately (for wal_level=logical).

For assignments that was fairly simple, thanks to existing caching. For
invalidations, it requires a new WAL record type and a couple of changes
in inval.c.

On the apply side, we simply receive the streamed changes, write them
into a file (one file for toplevel transaction, which is possible thanks
to the assignments being known immediately). And then at commit time the
changes are replayed locally, without having to copy a large chunk of
data over network.


WAL overhead
------------

Of course, these changes to WAL logging are not for free - logging
assignments individually (instead of multiple subtransactions at once)
means higher xlog record overhead. Similarly, (sub)transactions doing a
lot of DDL may result in a lot of invalidations written to WAL (again,
with full xlog record overhead per invalidation).

I've done a number of tests to measure the impact, and for extreme
corner cases the additional amount of WAL is about 40% in both cases.

By an "extreme corner case" I mean a workloads intentionally triggering
many assignments/invalidations, without doing a lot of meaningful work.

For assignments, imagine a single-row table (no indexes), and a
transaction like this one:

    BEGIN;
    UPDATE t SET v = v + 1;
    SAVEPOINT s1;
    UPDATE t SET v = v + 1;
    SAVEPOINT s2;
    UPDATE t SET v = v + 1;
    SAVEPOINT s3;
    ...
    UPDATE t SET v = v + 1;
    SAVEPOINT s10;
    UPDATE t SET v = v + 1;
    COMMIT;

For invalidations, add a CREATE TEMPORARY TABLE to each subtransaction.

For more realistic workloads (large table with indexes, runs long enough
to generate FPIs, etc.) the overhead drops below 5%. Which is much more
acceptable, of course, although not perfect.

In both cases, there was pretty much no measurable impact on performance
(as measured by tps).

I do not think there's a way around this requirement (having assignments
and invalidations), if we want to decode in-progress transactions. But
perhaps it would be possible to do some sort of caching (say, at command
level), to reduce the xlog record overhead? Not sure.

All ideas are welcome, of course. In the worst case, I think we can add
a GUC enabling this additional logging - when disabled, streaming of
in-progress transactions would not be possible.


Simplifying ReorderBuffer
-------------------------

One interesting consequence of having assignments is that we could get
rid of the ReorderBuffer iterator, used to merge changes from subxacts.
The assignments allow us to keep changes for each toplevel transaction
in a single list, in LSN order, and just walk it. Abort can be performed
by remembering position of the first change in each subxact, and just
discarding the tail.

This is what the apply worker does with the streamed changes and aborts.

It would also allow us to enforce the memory limit while restoring
transactions spilled to disk, because we would not have the problem with
restoring changes for many subtransactions.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-me.patch.gz (9K) Download Attachment
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical.patch.gz (3K) Download Attachment
0003-Issue-individual-invalidations-with-wal_level-logica.patch.gz (6K) Download Attachment
0004-Extend-the-output-plugin-API-with-stream-methods.patch.gz (7K) Download Attachment
0005-Implement-streaming-mode-in-ReorderBuffer.patch.gz (14K) Download Attachment
0006-Add-support-for-streaming-to-built-in-replication.patch.gz (18K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Erik Rijkers
On 2017-12-23 05:57, Tomas Vondra wrote:
> Hi all,
>
> Attached is a patch series that implements two features to the logical
> replication - ability to define a memory limit for the reorderbuffer
> (responsible for building the decoded transactions), and ability to
> stream large in-progress transactions (exceeding the memory limit).
>

logical replication of 2 instances is OK but 3 and up fail with:

TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
"reorderbuffer.c", Line: 1773)

I can cobble up a script but I hope you have enough from the assertion
to see what's going wrong...

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4


On 12/23/2017 03:03 PM, Erikjan Rijkers wrote:

> On 2017-12-23 05:57, Tomas Vondra wrote:
>> Hi all,
>>
>> Attached is a patch series that implements two features to the logical
>> replication - ability to define a memory limit for the reorderbuffer
>> (responsible for building the decoded transactions), and ability to
>> stream large in-progress transactions (exceeding the memory limit).
>>
>
> logical replication of 2 instances is OK but 3 and up fail with:
>
> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
> "reorderbuffer.c", Line: 1773)
>
> I can cobble up a script but I hope you have enough from the assertion
> to see what's going wrong...
The assertion says that the iterator produces changes in order that does
not correlate with LSN. But I have a hard time understanding how that
could happen, particularly because according to the line number this
happens in ReorderBufferCommit(), i.e. the current (non-streaming) case.

So instructions to reproduce the issue would be very helpful.

Attached is v2 of the patch series, fixing two bugs I discovered today.
I don't think any of these is related to your issue, though.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch.gz (9K) Download Attachment
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch.gz (3K) Download Attachment
0003-Issue-individual-invalidations-with-wal_level-log-v2.patch.gz (6K) Download Attachment
0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch.gz (7K) Download Attachment
0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch.gz (14K) Download Attachment
0006-Add-support-for-streaming-to-built-in-replication-v2.patch.gz (18K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Erik Rijkers
On 2017-12-23 21:06, Tomas Vondra wrote:

> On 12/23/2017 03:03 PM, Erikjan Rijkers wrote:
>> On 2017-12-23 05:57, Tomas Vondra wrote:
>>> Hi all,
>>>
>>> Attached is a patch series that implements two features to the
>>> logical
>>> replication - ability to define a memory limit for the reorderbuffer
>>> (responsible for building the decoded transactions), and ability to
>>> stream large in-progress transactions (exceeding the memory limit).
>>>
>>
>> logical replication of 2 instances is OK but 3 and up fail with:
>>
>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>> "reorderbuffer.c", Line: 1773)
>>
>> I can cobble up a script but I hope you have enough from the assertion
>> to see what's going wrong...
>
> The assertion says that the iterator produces changes in order that
> does
> not correlate with LSN. But I have a hard time understanding how that
> could happen, particularly because according to the line number this
> happens in ReorderBufferCommit(), i.e. the current (non-streaming)
> case.
>
> So instructions to reproduce the issue would be very helpful.
Using:

0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
0006-Add-support-for-streaming-to-built-in-replication-v2.patch

As you expected the problem is the same with these new patches.

I have now tested more, and seen that it not always fails.  I guess that
it here fails 3 times out of 4.  But the laptop I'm using at the moment
is old and slow -- it may well be a factor as we've seen before [1].

Attached is the bash that I put together.  I tested with
NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails
often.  This same program run with HEAD never seems to fail (I tried a
few dozen times).

thanks,

Erik Rijkers


[1]
https://www.postgresql.org/message-id/3897361c7010c4ac03f358173adbcd60%40xs4all.nl


test.sh (10K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4


On 12/23/2017 11:23 PM, Erik Rijkers wrote:

> On 2017-12-23 21:06, Tomas Vondra wrote:
>> On 12/23/2017 03:03 PM, Erikjan Rijkers wrote:
>>> On 2017-12-23 05:57, Tomas Vondra wrote:
>>>> Hi all,
>>>>
>>>> Attached is a patch series that implements two features to the logical
>>>> replication - ability to define a memory limit for the reorderbuffer
>>>> (responsible for building the decoded transactions), and ability to
>>>> stream large in-progress transactions (exceeding the memory limit).
>>>>
>>>
>>> logical replication of 2 instances is OK but 3 and up fail with:
>>>
>>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>>> "reorderbuffer.c", Line: 1773)
>>>
>>> I can cobble up a script but I hope you have enough from the assertion
>>> to see what's going wrong...
>>
>> The assertion says that the iterator produces changes in order that does
>> not correlate with LSN. But I have a hard time understanding how that
>> could happen, particularly because according to the line number this
>> happens in ReorderBufferCommit(), i.e. the current (non-streaming) case.
>>
>> So instructions to reproduce the issue would be very helpful.
>
> Using:
>
> 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
> 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
> 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
> 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
> 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
> 0006-Add-support-for-streaming-to-built-in-replication-v2.patch
>
> As you expected the problem is the same with these new patches.
>
> I have now tested more, and seen that it not always fails.  I guess that
> it here fails 3 times out of 4.  But the laptop I'm using at the moment
> is old and slow -- it may well be a factor as we've seen before [1].
>
> Attached is the bash that I put together.  I tested with
> NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails
> often.  This same program run with HEAD never seems to fail (I tried a
> few dozen times).
>

Thanks. Unfortunately I still can't reproduce the issue. I even tried
running it in valgrind, to see if there are some memory access issues
(which should also slow it down significantly).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Craig Ringer-3
In reply to this post by Tomas Vondra-4
On 23 December 2017 at 12:57, Tomas Vondra <[hidden email]> wrote:
Hi all,

Attached is a patch series that implements two features to the logical
replication - ability to define a memory limit for the reorderbuffer
(responsible for building the decoded transactions), and ability to
stream large in-progress transactions (exceeding the memory limit).

I'm submitting those two changes together, because one builds on the
other, and it's beneficial to discuss them together.


PART 1: adding logical_work_mem memory limit (0001)
---------------------------------------------------

Currently, limiting the amount of memory consumed by logical decoding is
tricky (or you might say impossible) for several reasons:

* The value is hard-coded, so it's not quite possible to customize it.

* The amount of decoded changes to keep in memory is restricted by
number of changes. It's not very unclear how this relates to memory
consumption, as the change size depends on table structure, etc.

* The number is "per (sub)transaction", so a transaction with many
subtransactions may easily consume significant amount of memory without
actually hitting the limit.

Also, even without subtransactions, we assemble a ReorderBufferTXN per transaction. Since transactions usually occur concurrently, systems with many concurrent txns can face lots of memory use.

We can't exclude tables that won't actually be replicated at the reorder buffering phase either. So txns use memory whether or not they do anything interesting as far as a given logical decoding session is concerned. Even if we'll throw all the data away we must buffer and assemble it first so we can make that decision.

Because logical decoding considers snapshots and cid increments even from other DBs (at least when the txn makes catalog changes) the memory use can get BIG too. I was recently working with a system that had accumulated 2GB of snapshots ... on each slot. With 7 slots, one for each DB.

So there's lots of room for difficulty with unpredictable memory use.

So the patch does two things. Firstly, it introduces logical_work_mem, a
GUC restricting memory consumed by all transactions currently kept in
the reorder buffer

Does this consider the (currently high, IIRC) overhead of tracking serialized changes?
 
--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Erik Rijkers
In reply to this post by Tomas Vondra-4
>>>>
>>>> logical replication of 2 instances is OK but 3 and up fail with:
>>>>
>>>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>>>> "reorderbuffer.c", Line: 1773)
>>>>
>>>> I can cobble up a script but I hope you have enough from the
>>>> assertion
>>>> to see what's going wrong...
>>>
>>> The assertion says that the iterator produces changes in order that
>>> does
>>> not correlate with LSN. But I have a hard time understanding how that
>>> could happen, particularly because according to the line number this
>>> happens in ReorderBufferCommit(), i.e. the current (non-streaming)
>>> case.
>>>
>>> So instructions to reproduce the issue would be very helpful.
>>
>> Using:
>>
>> 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
>> 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
>> 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
>> 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
>> 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
>> 0006-Add-support-for-streaming-to-built-in-replication-v2.patch
>>
>> As you expected the problem is the same with these new patches.
>>
>> I have now tested more, and seen that it not always fails.  I guess
>> that
>> it here fails 3 times out of 4.  But the laptop I'm using at the
>> moment
>> is old and slow -- it may well be a factor as we've seen before [1].
>>
>> Attached is the bash that I put together.  I tested with
>> NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which
>> fails
>> often.  This same program run with HEAD never seems to fail (I tried a
>> few dozen times).
>>
>
> Thanks. Unfortunately I still can't reproduce the issue. I even tried
> running it in valgrind, to see if there are some memory access issues
> (which should also slow it down significantly).

One wonders again if 2ndquadrant shouldn't invest in some old hardware
;)

Another Good Thing would be if there was a provision in the buildfarm to
test patches like these.

But I'm probably not to first one to suggest that; no doubt it'll be
possible someday.  In the meantime I'll try to repeat this crash on
other machines (but that will be after the holidays).


Erik Rijkers

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4
In reply to this post by Craig Ringer-3


On 12/24/2017 05:51 AM, Craig Ringer wrote:

> On 23 December 2017 at 12:57, Tomas Vondra <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hi all,
>
>     Attached is a patch series that implements two features to the logical
>     replication - ability to define a memory limit for the reorderbuffer
>     (responsible for building the decoded transactions), and ability to
>     stream large in-progress transactions (exceeding the memory limit).
>
>     I'm submitting those two changes together, because one builds on the
>     other, and it's beneficial to discuss them together.
>
>
>     PART 1: adding logical_work_mem memory limit (0001)
>     ---------------------------------------------------
>
>     Currently, limiting the amount of memory consumed by logical decoding is
>     tricky (or you might say impossible) for several reasons:
>
>     * The value is hard-coded, so it's not quite possible to customize it.
>
>     * The amount of decoded changes to keep in memory is restricted by
>     number of changes. It's not very unclear how this relates to memory
>     consumption, as the change size depends on table structure, etc.
>
>     * The number is "per (sub)transaction", so a transaction with many
>     subtransactions may easily consume significant amount of memory without
>     actually hitting the limit.
>
>
> Also, even without subtransactions, we assemble a ReorderBufferTXN
> per transaction. Since transactions usually occur concurrently,
> systems with many concurrent txns can face lots of memory use.
>

I don't see how that could be a problem, considering the number of
toplevel transactions is rather limited (to max_connections or so).

> We can't exclude tables that won't actually be replicated at the reorder
> buffering phase either. So txns use memory whether or not they do
> anything interesting as far as a given logical decoding session is
> concerned. Even if we'll throw all the data away we must buffer and
> assemble it first so we can make that decision.

Yep.

> Because logical decoding considers snapshots and cid increments even
> from other DBs (at least when the txn makes catalog changes) the memory
> use can get BIG too. I was recently working with a system that had
> accumulated 2GB of snapshots ... on each slot. With 7 slots, one for
> each DB.
>
> So there's lots of room for difficulty with unpredictable memory use.
>

Yep.

>     So the patch does two things. Firstly, it introduces logical_work_mem, a
>     GUC restricting memory consumed by all transactions currently kept in
>     the reorder buffer
>
>
> Does this consider the (currently high, IIRC) overhead of tracking
> serialized changes?
>  

Consider in what sense?


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4
In reply to this post by Erik Rijkers


On 12/24/2017 10:00 AM, Erik Rijkers wrote:

>>>>>
>>>>> logical replication of 2 instances is OK but 3 and up fail with:
>>>>>
>>>>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>>>>> "reorderbuffer.c", Line: 1773)
>>>>>
>>>>> I can cobble up a script but I hope you have enough from the assertion
>>>>> to see what's going wrong...
>>>>
>>>> The assertion says that the iterator produces changes in order that
>>>> does
>>>> not correlate with LSN. But I have a hard time understanding how that
>>>> could happen, particularly because according to the line number this
>>>> happens in ReorderBufferCommit(), i.e. the current (non-streaming)
>>>> case.
>>>>
>>>> So instructions to reproduce the issue would be very helpful.
>>>
>>> Using:
>>>
>>> 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
>>> 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
>>> 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
>>> 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
>>> 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
>>> 0006-Add-support-for-streaming-to-built-in-replication-v2.patch
>>>
>>> As you expected the problem is the same with these new patches.
>>>
>>> I have now tested more, and seen that it not always fails.  I guess that
>>> it here fails 3 times out of 4.  But the laptop I'm using at the moment
>>> is old and slow -- it may well be a factor as we've seen before [1].
>>>
>>> Attached is the bash that I put together.  I tested with
>>> NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails
>>> often.  This same program run with HEAD never seems to fail (I tried a
>>> few dozen times).
>>>
>>
>> Thanks. Unfortunately I still can't reproduce the issue. I even tried
>> running it in valgrind, to see if there are some memory access issues
>> (which should also slow it down significantly).
>
> One wonders again if 2ndquadrant shouldn't invest in some old hardware ;)
>
Well, I've done tests on various machines, including some really slow
ones, and I still haven't managed to reproduce the failures using your
script. So I don't think that would really help. But I have reproduced
it by using a custom stress test script.

Turns out the asserts are overly strict - instead of

  Assert(prev_lsn < current_lsn);

it should have been

  Assert(prev_lsn <= current_lsn);

because some XLOG records may contain multiple rows (e.g. MULTI_INSERT).

The attached v3 fixes this issue, and also a couple of other thinkos:

1) The AssertChangeLsnOrder assert check was somewhat broken.

2) We've been sending aborts for all subtransactions, even those not yet
streamed. So downstream got confused and fell over because of an assert.

3) The streamed transactions were written to /tmp, using filenames using
subscription OID and XID of the toplevel transaction. That's fine, as
long as there's just a single replica running - if there are more, the
filenames will clash, causing really strange failures. So move the files
to base/pgsql_tmp where regular temporary files are written. I'm not
claiming this is perfect, perhaps we need to invent another location.

FWIW I believe the relation sync cache is somewhat broken by the
streaming. I thought resetting it would be good enough, but it's more
complicated (and trickier) than that. I'm aware of it, and I'll look
into that next - but probably not before 2018.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v3.patch.gz (9K) Download Attachment
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v3.patch.gz (3K) Download Attachment
0003-Issue-individual-invalidations-with-wal_level-log-v3.patch.gz (6K) Download Attachment
0004-Extend-the-output-plugin-API-with-stream-methods-v3.patch.gz (7K) Download Attachment
0005-Implement-streaming-mode-in-ReorderBuffer-v3.patch.gz (14K) Download Attachment
0006-Add-support-for-streaming-to-built-in-replication-v3.patch.gz (19K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Erik Rijkers
That indeed fixed the problem: running that same pgbench test, I see no
crashes anymore (on any of 3 different machines, and with several
pgbench parameters).

Thank you,

Erik Rijkers

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Dmitry Dolgov
> On 25 December 2017 at 18:40, Tomas Vondra <[hidden email]> wrote:
> The attached v3 fixes this issue, and also a couple of other thinkos

Thank you for the patch, it looks quite interesting. After a quick look at it
(mostly the first one so far, but I'm going to continue) I have a few questions:

> + * XXX With many subtransactions this might be quite slow, because we'll have
> + * to walk through all of them. There are some options how we could improve
> + * that: (a) maintain some secondary structure with transactions sorted by
> + * amount of changes, (b) not looking for the entirely largest transaction,
> + * but e.g. for transaction using at least some fraction of the memory limit,
> + * and (c) evicting multiple transactions at once, e.g. to free a given portion
> + * of the memory limit (e.g. 50%).

Do you want to address these possible alternatives somehow in this patch or you
want to left it outside? Maybe it makes sense to apply some combination of
them, e.g. maintain a secondary structure with relatively large transactions,
and then start evicting them. If it's somehow not enough, then start to evict
multiple transactions at once (option "c").

> + /*
> + * We clamp manually-set values to at least 64kB. The maintenance_work_mem
> + * uses a higher minimum value (1MB), so this is OK.
> + */
> + if (*newval < 64)
> + *newval = 64;
> +

I'm not sure what's recommended practice here, but maybe it makes sense to
have a warning here about changing this value to 64kB? Otherwise it can be
unexpected.
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Peter Eisentraut-6
In reply to this post by Tomas Vondra-4
On 12/22/17 23:57, Tomas Vondra wrote:
> PART 1: adding logical_work_mem memory limit (0001)
> ---------------------------------------------------

The documentation in this patch contains some references to later
features (streaming).  Perhaps that could be separated so that the
patches can be applied independently.

I don't see the need to tie this setting to maintenance_work_mem.
maintenance_work_mem is often set to very large values, which could then
have undesirable side effects on this use.

Moreover, the name logical_work_mem makes it sound like it's a logical
version of work_mem.  Maybe we could think of another name.

I think we need a way to report on how much memory is actually used, so
the setting can be tuned.  Something analogous to log_temp_files perhaps.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4


On 01/02/2018 04:07 PM, Peter Eisentraut wrote:
> On 12/22/17 23:57, Tomas Vondra wrote:
>> PART 1: adding logical_work_mem memory limit (0001)
>> ---------------------------------------------------
>
> The documentation in this patch contains some references to later
> features (streaming).  Perhaps that could be separated so that the
> patches can be applied independently.
>

Yeah, that's probably a good idea. But now that you mention it, I wonder
if "streaming" is really a good term. We already use it for "streaming
replication" and it may be quite confusing to use it for another feature
(particularly when it's streaming within logical streaming replication).

But I can't really think of a better name ...

> I don't see the need to tie this setting to maintenance_work_mem.
> maintenance_work_mem is often set to very large values, which could
> then have undesirable side effects on this use.
>

Well, we need to pick some default value, and we can either use a fixed
value (not sure what would be a good default) or tie it to an existing
GUC. We only really have work_mem and maintenance_work_mem, and the
walsender process will never use more than one such buffer. Which seems
to be closer to maintenance_work_mem.

Pretty much any default value can have undesirable side effects.

> Moreover, the name logical_work_mem makes it sound like it's a logical
> version of work_mem.  Maybe we could think of another name.
>

I won't object to a better name, of course. Any proposals?

> I think we need a way to report on how much memory is actually used,
> so the setting can be tuned. Something analogous to log_temp_files
> perhaps.
>

Yes, I agree. I'm just about to submit an updated version of the patch
series, that also introduces a few columns pg_stat_replication, tracking
this type of stats (amount of data spilled to disk or streamed, etc.).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4
In reply to this post by Tomas Vondra-4
Hi,

attached is v4 of the patch series, with a couple of changes:

1) Fixes a bunch of bugs I discovered during stress testing.

I'm not going to go into details, but the main fixes are related to
properly updating progress from the worker, and not streaming when
creating the logical replication slot.

2) Introduces columns into pg_stat_replication.

The new columns track various kinds of statistics (number of xacts,
bytes, ...) about spill-to-disk/streaming. This will be useful when
tuning the GUC memory limit.

3) Two temporary bugfixes that make the patch series work.

The first one (0008) makes sure is_known_subxact is set properly for all
subtransactions, and there's a separate fix in the CF. So this will
eventually go away.

The second one (0009) fixes an issue that is specific to streaming. It
does fix the issue, but I need a bit more time to think about it before
merging it into 0005.

This does pass extensive stress testing with a workload mixing DML, DDL,
subtransactions, aborts, etc. under valgrind. I'm working on extending
the test coverage, and introducing various error conditions (e.g.
walsender/walreceiver timeouts, failures on both ends, etc.).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v4.patch.gz (13K) Download Attachment
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v4.patch.gz (3K) Download Attachment
0003-Issue-individual-invalidations-with-wal_level-log-v4.patch.gz (6K) Download Attachment
0004-Extend-the-output-plugin-API-with-stream-methods-v4.patch.gz (7K) Download Attachment
0005-Implement-streaming-mode-in-ReorderBuffer-v4.patch.gz (14K) Download Attachment
0006-Add-support-for-streaming-to-built-in-replication-v4.patch.gz (25K) Download Attachment
0007-Track-statistics-for-streaming-spilling-v4.patch.gz (5K) Download Attachment
0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as-v4.patch.gz (784 bytes) Download Attachment
0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup-v4.patch.gz (862 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4


On 01/03/2018 09:06 PM, Tomas Vondra wrote:

> Hi,
>
> attached is v4 of the patch series, with a couple of changes:
>
> 1) Fixes a bunch of bugs I discovered during stress testing.
>
> I'm not going to go into details, but the main fixes are related to
> properly updating progress from the worker, and not streaming when
> creating the logical replication slot.
>
> 2) Introduces columns into pg_stat_replication.
>
> The new columns track various kinds of statistics (number of xacts,
> bytes, ...) about spill-to-disk/streaming. This will be useful when
> tuning the GUC memory limit.
>
> 3) Two temporary bugfixes that make the patch series work.
>

Forgot to mention that the v4 also extends the CREATE SUBSCRIPTION to
allow customizing the streaming and memory limit. So you can do

    CREATE SUBSCRIPTION ... WITH (streaming=on, work_mem=1024)

and this subscription will allow streaming, and the logica_work_mem (on
provider) will be set to 1MB.

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Peter Eisentraut-6
In reply to this post by Tomas Vondra-4
On 1/3/18 14:53, Tomas Vondra wrote:

>> I don't see the need to tie this setting to maintenance_work_mem.
>> maintenance_work_mem is often set to very large values, which could
>> then have undesirable side effects on this use.
>
> Well, we need to pick some default value, and we can either use a fixed
> value (not sure what would be a good default) or tie it to an existing
> GUC. We only really have work_mem and maintenance_work_mem, and the
> walsender process will never use more than one such buffer. Which seems
> to be closer to maintenance_work_mem.
>
> Pretty much any default value can have undesirable side effects.

Let's just make it an independent setting unless we know any better.  We
don't have a lot of settings that depend on other settings, and the ones
we do have a very specific relationship.

>> Moreover, the name logical_work_mem makes it sound like it's a logical
>> version of work_mem.  Maybe we could think of another name.
>
> I won't object to a better name, of course. Any proposals?

logical_decoding_[work_]mem?

>> I think we need a way to report on how much memory is actually used,
>> so the setting can be tuned. Something analogous to log_temp_files
>> perhaps.
>
> Yes, I agree. I'm just about to submit an updated version of the patch
> series, that also introduces a few columns pg_stat_replication, tracking
> this type of stats (amount of data spilled to disk or streamed, etc.).

That seems OK.  Perhaps we could bring forward the part of that patch
that applies to this feature.

That would also help testing *this* feature and determine what
appropriate settings are.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Peter Eisentraut-6
In reply to this post by Tomas Vondra-4
On 1/3/18 15:13, Tomas Vondra wrote:
> Forgot to mention that the v4 also extends the CREATE SUBSCRIPTION to
> allow customizing the streaming and memory limit. So you can do
>
>     CREATE SUBSCRIPTION ... WITH (streaming=on, work_mem=1024)
>
> and this subscription will allow streaming, and the logica_work_mem (on
> provider) will be set to 1MB.

I was wondering already during PG10 development whether we should give
subscriptions a generic configuration array, like databases and roles
have, so we don't have to hardcode a bunch of similar stuff every time
we add an option like this.  At the time we only had synchronous_commit,
but now we're adding more.

Also, instead of sticking this into the START_REPLICATION command, could
we just run a SET command?  That should work over replication
connections as well.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Peter Eisentraut-6
In reply to this post by Tomas Vondra-4
On 12/22/17 23:57, Tomas Vondra wrote:
> PART 1: adding logical_work_mem memory limit (0001)
> ---------------------------------------------------
>
> Currently, limiting the amount of memory consumed by logical decoding is
> tricky (or you might say impossible) for several reasons:

I would like to see some more discussion on this, but I think not a lot
of people understand the details, so I'll try to write up an explanation
here.  This code is also somewhat new to me, so please correct me if
there are inaccuracies, while keeping in mind that I'm trying to simplify.

The data in the WAL is written as it happens, so the changes belonging
to different transactions are all mixed together.  One of the jobs of
logical decoding is to reassemble the changes belonging to each
transaction.  The top-level data structure for that is the infamous
ReorderBuffer.  So as it reads the WAL and sees something about a
transaction, it keeps a copy of that change in memory, indexed by
transaction ID (ReorderBufferChange).  When the transaction commits, the
accumulated changes are passed to the output plugin and then freed.  If
the transaction aborts, then changes are just thrown away.

So when logical decoding is active, a copy of the changes for each
active transaction is kept in memory (once per walsender).

More precisely, the above happens for each subtransaction.  When the
top-level transaction commits, it finds all its subtransactions in the
ReorderBuffer, reassembles everything in the right order, then invokes
the output plugin.

All this could end up using an unbounded amount of memory, so there is a
mechanism to spill changes to disk.  The way this currently works is
hardcoded, and this patch proposes to change that.

Currently, when a transaction or subtransaction has accumulated 4096
changes, it is spilled to disk.  When the top-level transaction commits,
things are read back from disk to do the final processing mentioned above.

This all works mostly fine, but you can construct some more extreme
cases where this can blow up.

Here is a mundane example.  Let's say a change entry takes 100 bytes (it
might contain a new row, or an update key and some new column values,
for example).  If you have 100 concurrent active sessions and no
subtransactions, then logical decoding memory is bounded by 4096 * 100 *
100 = 40 MB (per walsender) before things spill to disk.

Now let's say you are using a lot of subtransactions, because you are
using PL functions, exception handling, triggers, doing batch updates.
If you have 200 subtransactions on average per concurrent session, the
memory usage bound in that case would be 4096 * 100 * 100 * 200 = 8 GB
(per walsender).  And so on.  If you have more concurrent sessions or
larger changes or more subtransactions, you'll use much more than those
8 GB.  And if you don't have those 8 GB, then you're stuck at this point.

That is the consideration when we record changes, but we also need
memory when we do the final processing at commit time.  That is slightly
less problematic because we only process one top-level transaction at a
time, so the formula is only 4096 * avg_size_of_changes * nr_subxacts
(without the concurrent sessions factor).

So, this patch proposes to improve this as follows:

- We compute the actual size of each ReorderBufferChange and keep a
running tally for each transaction, instead of just counting the number
of changes.

- We have a configuration setting that allows us to change the limit
instead of the hardcoded 4096.  The configuration setting is also in
terms of memory, not in number of changes.

- The configuration setting is for the total memory usage per decoding
session, not per subtransaction.  (So we also keep a running tally for
the entire ReorderBuffer.)

There are two open issues with this patch:

One, this mechanism only applies when recording changes.  The processing
at commit time still uses the previous hardcoded mechanism.  The reason
for this is, AFAIU, that as things currently work, you have to have all
subtransactions in memory to do the final processing.  There are some
proposals to change this as well, but they are more involved.  Arguably,
per my explanation above, memory use at commit time is less likely to be
a problem.

Two, what to do when the memory limit is reached.  With the old
accounting, this was easy, because we'd decide for each subtransaction
independently whether to spill it to disk, when it has reached its 4096
limit.  Now, we are looking at a global limit, so we have to find a
transaction to spill in some other way.  The proposed patch searches
through the entire list of transactions to find the largest one.  But as
the patch says:

"XXX With many subtransactions this might be quite slow, because we'll
have to walk through all of them. There are some options how we could
improve that: (a) maintain some secondary structure with transactions
sorted by amount of changes, (b) not looking for the entirely largest
transaction, but e.g. for transaction using at least some fraction of
the memory limit, and (c) evicting multiple transactions at once, e.g.
to free a given portion of the memory limit (e.g. 50%)."

(a) would create more overhead for the case where everything fits into
memory, so it seems unattractive.  Some combination of (b) and (c) seems
useful, but we'd have to come up with something concrete.

Thoughts?

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Greg Stark
On 11 January 2018 at 19:41, Peter Eisentraut
<[hidden email]> wrote:

> Two, what to do when the memory limit is reached.  With the old
> accounting, this was easy, because we'd decide for each subtransaction
> independently whether to spill it to disk, when it has reached its 4096
> limit.  Now, we are looking at a global limit, so we have to find a
> transaction to spill in some other way.  The proposed patch searches
> through the entire list of transactions to find the largest one.  But as
> the patch says:
>
> "XXX With many subtransactions this might be quite slow, because we'll
> have to walk through all of them. There are some options how we could
> improve that: (a) maintain some secondary structure with transactions
> sorted by amount of changes, (b) not looking for the entirely largest
> transaction, but e.g. for transaction using at least some fraction of
> the memory limit, and (c) evicting multiple transactions at once, e.g.
> to free a given portion of the memory limit (e.g. 50%)."

AIUI spilling to disk doesn't affect absorbing future updates, we
would just keep accumulating them in memory right? We won't need to
unspill until it comes time to commit.

Is there any actual advantage to picking the largest transaction? it
means fewer spills and fewer unspills at commit time but that just a
bigger spike of i/o and more of a chance of spilling more than
necessary to get by. In the end it'll be more or less the same amount
of data read back, just all in one big spike when spilling and one big
spike when committing. If you spilled smaller transactions you would
have a small amount of i/o more frequently and have to read back small
amounts for many commits. But it would add up to the same amount of
i/o (or less if you avoid spilling more than necessary).

The real aim should be to try to pick the transaction that will be
committed furthest in the future. That gives you the most memory to
use for live transactions for the longest time and could let you
process the maximum amount of transactions without spilling them. So
either the oldest transaction (in the expectation that it's been open
a while and appears to be a long-lived batch job that will stay open
for a long time) or the youngest transaction (in the expectation that
all transactions are more or less equally long-lived) might make
sense.



--
greg

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Peter Eisentraut-6
On 1/11/18 18:23, Greg Stark wrote:
> AIUI spilling to disk doesn't affect absorbing future updates, we
> would just keep accumulating them in memory right? We won't need to
> unspill until it comes time to commit.

Once a transaction has been serialized, future updates keep accumulating
in memory, until perhaps it gets serialized again.  But then at commit
time, if a transaction has been partially serialized at all, all the
remaining changes are also serialized before the whole thing is read
back in (see reorderbuffer.c line 855).

So one optimization would be to specially keep track of all transactions
that have been serialized already and pick those first for further
serialization, because it will be done eventually anyway.

But this is only a secondary optimization, because it doesn't help in
the extreme cases that either no (or few) transactions have been
serialized or all (or most) transactions have been serialized.

> The real aim should be to try to pick the transaction that will be
> committed furthest in the future. That gives you the most memory to
> use for live transactions for the longest time and could let you
> process the maximum amount of transactions without spilling them. So
> either the oldest transaction (in the expectation that it's been open
> a while and appears to be a long-lived batch job that will stay open
> for a long time) or the youngest transaction (in the expectation that
> all transactions are more or less equally long-lived) might make
> sense.

Yes, that makes sense.  We'd still need to keep a separate ordered list
of transactions somewhere, but that might be easier if we just order
them in the order we see them.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

12
Previous Thread Next Thread