PATCH: logical_work_mem and logical streaming of large in-progress transactions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
539 messages Options
123456 ... 27
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Andres Freund
On 2018-03-03 02:34:06 +0100, Tomas Vondra wrote:

> On 03/03/2018 02:01 AM, Andres Freund wrote:
> > On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:
> >> That is somewhat misleading, I think. You're right the last version
> >> was submitted on 2018-01-19, but the next review arrived on
> >> 2018-01-31, i.e. right at the end of the CF. So it's not like the
> >> patch was sitting there with unresolved issues. Based on that
> >> review the patch was marked as RWF and thus not moved to 2018-03
> >> automatically.
> >
> > I don't see how this changes anything.
> >
>
> You've used "The patch hasn't moved forward since 2018-01-19," as an
> argument why the patch is not eligible for 2018-03. I suggest that
> argument is misleading, because patches generally do not move without
> reviews, and it's difficult to respond to a review that arrives on the
> last day of a commitfest.
>
> Consider that without the review, the patch would end up with NR status,
> and would be moved to the next CF automatically. Isn't that a bit weird?

Not sure I follow. The point is that nobody would have complained if
you'd moved the patch into this fest if you'd updated it *before* it
started?

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

David Steele
In reply to this post by Andres Freund
On 3/2/18 8:01 PM, Andres Freund wrote:
> On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:
>> That is somewhat misleading, I think. You're right the last version was
>> submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e.
>> right at the end of the CF. So it's not like the patch was sitting there
>> with unresolved issues. Based on that review the patch was marked as RWF
>> and thus not moved to 2018-03 automatically.
>
> I don't see how this changes anything.

I agree that things could be clearer, and Andres has produced a great
document that we can build on.  The old one had gotten a bit stale.

However, I think it's pretty obvious that a CF entry should be
accompanied with a patch.  It sounds like the timing was awkward but you
still had 28 days to produce a new patch.

I also notice that you submitted 7 patches in this CF but are reviewing
zero.

--
-David
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4
On 03/03/2018 02:37 AM, David Steele wrote:

> On 3/2/18 8:01 PM, Andres Freund wrote:
>> On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:
>>> That is somewhat misleading, I think. You're right the last version was
>>> submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e.
>>> right at the end of the CF. So it's not like the patch was sitting there
>>> with unresolved issues. Based on that review the patch was marked as RWF
>>> and thus not moved to 2018-03 automatically.
>>
>> I don't see how this changes anything.
>
> I agree that things could be clearer, and Andres has produced a great
> document that we can build on.  The old one had gotten a bit stale.
>
> However, I think it's pretty obvious that a CF entry should be
> accompanied with a patch. It sounds like the timing was awkward but
> you still had 28 days to produce a new patch.
>

Based on internal discussion I'm not so sure about the "pretty obvious"
part. It certainly wasn't that obvious to me, otherwise I'd submit the
revised patch earlier - hindsight is 20/20.

> I also notice that you submitted 7 patches in this CF but are
> reviewing zero.
>

I've volunteered to review a couple of patches at the FOSDEM Developer
Meeting - I thought Stephen was entering that into the CF app, not sure
where it got lost.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

David Steele
On 3/2/18 8:54 PM, Tomas Vondra wrote:

> On 03/03/2018 02:37 AM, David Steele wrote:
>> On 3/2/18 8:01 PM, Andres Freund wrote:
>>> On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:
>>>> That is somewhat misleading, I think. You're right the last version was
>>>> submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e.
>>>> right at the end of the CF. So it's not like the patch was sitting there
>>>> with unresolved issues. Based on that review the patch was marked as RWF
>>>> and thus not moved to 2018-03 automatically.
>>>
>>> I don't see how this changes anything.
>>
>> I agree that things could be clearer, and Andres has produced a great
>> document that we can build on.  The old one had gotten a bit stale.
>>
>> However, I think it's pretty obvious that a CF entry should be
>> accompanied with a patch. It sounds like the timing was awkward but
>> you still had 28 days to produce a new patch.
>
> Based on internal discussion I'm not so sure about the "pretty obvious"
> part. It certainly wasn't that obvious to me, otherwise I'd submit the
> revised patch earlier - hindsight is 20/20.

Indeed it is.  Be assured that nobody takes pleasure in pushing patches,
but we have limited resources and must make some choices.

>> I also notice that you submitted 7 patches in this CF but are
>> reviewing zero.
>
> I've volunteered to review a couple of patches at the FOSDEM Developer
> Meeting - I thought Stephen was entering that into the CF app, not sure
> where it got lost.

There are plenty of patches that need review, so go for it.

Regards,
--
-David
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Erik Rijkers
In reply to this post by Tomas Vondra-4
On 2018-03-03 01:55, Tomas Vondra wrote:
> Hi there,
>
> attached is an updated patch fixing all the reported issues (a bit more
> about those below).

Hi,

0007-Track-statistics-for-streaming-spilling.patch  won't apply.  All
the other patches apply ok.

patch complaints with:

patching file doc/src/sgml/monitoring.sgml
patching file src/backend/catalog/system_views.sql
Hunk #1 succeeded at 734 (offset 2 lines).
patching file src/backend/replication/logical/reorderbuffer.c
patching file src/backend/replication/walsender.c
patching file src/include/catalog/pg_proc.h
Hunk #1 FAILED at 2903.
1 out of 1 hunk FAILED -- saving rejects to file
src/include/catalog/pg_proc.h.rej
patching file src/include/replication/reorderbuffer.h
patching file src/include/replication/walsender_private.h
patching file src/test/regress/expected/rules.out
Hunk #1 succeeded at 1861 (offset 2 lines).

Attached the produced reject file.


thanks,

Erik Rijkers

pg_proc.h.rej (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4


On 03/03/2018 06:19 AM, Erik Rijkers wrote:

> On 2018-03-03 01:55, Tomas Vondra wrote:
>> Hi there,
>>
>> attached is an updated patch fixing all the reported issues (a bit more
>> about those below).
>
> Hi,
>
> 0007-Track-statistics-for-streaming-spilling.patch  won't apply.  All
> the other patches apply ok.
>
> patch complaints with:
>
> patching file doc/src/sgml/monitoring.sgml
> patching file src/backend/catalog/system_views.sql
> Hunk #1 succeeded at 734 (offset 2 lines).
> patching file src/backend/replication/logical/reorderbuffer.c
> patching file src/backend/replication/walsender.c
> patching file src/include/catalog/pg_proc.h
> Hunk #1 FAILED at 2903.
> 1 out of 1 hunk FAILED -- saving rejects to file
> src/include/catalog/pg_proc.h.rej
> patching file src/include/replication/reorderbuffer.h
> patching file src/include/replication/walsender_private.h
> patching file src/test/regress/expected/rules.out
> Hunk #1 succeeded at 1861 (offset 2 lines).
>
> Attached the produced reject file.
>
Yeah, that's due to fd1a421fe66 which changed columns in pg_proc.h.
Attached is a rebased patch, fixing this.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0001-Introduce-logical_work_mem-to-limit-ReorderBuffer.patch.gz (13K) Download Attachment
0002-Immediatel-WAL-log-assignments.patch.gz (9K) Download Attachment
0003-Issue-individual-invalidations-with-wal_level-logica.patch.gz (6K) Download Attachment
0004-Extend-the-output-plugin-API-with-stream-methods.patch.gz (7K) Download Attachment
0005-Implement-streaming-mode-in-ReorderBuffer.patch.gz (14K) Download Attachment
0006-Add-support-for-streaming-to-built-in-replication.patch.gz (26K) Download Attachment
0007-Track-statistics-for-streaming-spilling.patch.gz (5K) Download Attachment
0008-BUGFIX-make-sure-subxact-is-marked-as-is_known_as_su.patch.gz (780 bytes) Download Attachment
0009-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.patch.gz (858 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Peter Eisentraut-6
I think this patch is not going to be ready for PG11.

- It depends on some work in the thread "logical decoding of two-phase
transactions", which is still in progress.

- Various details in the logical_work_mem patch (0001) are unresolved.

- This being partially a performance feature, we haven't seen any
performance tests (e.g., which settings result in which latencies under
which workloads).

That said, the feature seems useful and desirable, and the
implementation makes sense.  There are documentation and tests.  But
there is a significant amount of design and coding work still necessary.

Attached is a fixup patch that I needed to make it compile.

The last two patches in your series (0008, 0009) are labeled as bug
fixes.  Would you like to argue that they should be applied
independently of the rest of the feature?

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0001-fixup-Track-statistics-for-streaming-spilling.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

konstantin knizhnik
In reply to this post by Peter Eisentraut-6


On 11.01.2018 22:41, Peter Eisentraut wrote:

> On 12/22/17 23:57, Tomas Vondra wrote:
>> PART 1: adding logical_work_mem memory limit (0001)
>> ---------------------------------------------------
>>
>> Currently, limiting the amount of memory consumed by logical decoding is
>> tricky (or you might say impossible) for several reasons:
> I would like to see some more discussion on this, but I think not a lot
> of people understand the details, so I'll try to write up an explanation
> here.  This code is also somewhat new to me, so please correct me if
> there are inaccuracies, while keeping in mind that I'm trying to simplify.
>
> The data in the WAL is written as it happens, so the changes belonging
> to different transactions are all mixed together.  One of the jobs of
> logical decoding is to reassemble the changes belonging to each
> transaction.  The top-level data structure for that is the infamous
> ReorderBuffer.  So as it reads the WAL and sees something about a
> transaction, it keeps a copy of that change in memory, indexed by
> transaction ID (ReorderBufferChange).  When the transaction commits, the
> accumulated changes are passed to the output plugin and then freed.  If
> the transaction aborts, then changes are just thrown away.
>
> So when logical decoding is active, a copy of the changes for each
> active transaction is kept in memory (once per walsender).
>
> More precisely, the above happens for each subtransaction.  When the
> top-level transaction commits, it finds all its subtransactions in the
> ReorderBuffer, reassembles everything in the right order, then invokes
> the output plugin.
>
> All this could end up using an unbounded amount of memory, so there is a
> mechanism to spill changes to disk.  The way this currently works is
> hardcoded, and this patch proposes to change that.
>
> Currently, when a transaction or subtransaction has accumulated 4096
> changes, it is spilled to disk.  When the top-level transaction commits,
> things are read back from disk to do the final processing mentioned above.
>
> This all works mostly fine, but you can construct some more extreme
> cases where this can blow up.
>
> Here is a mundane example.  Let's say a change entry takes 100 bytes (it
> might contain a new row, or an update key and some new column values,
> for example).  If you have 100 concurrent active sessions and no
> subtransactions, then logical decoding memory is bounded by 4096 * 100 *
> 100 = 40 MB (per walsender) before things spill to disk.
>
> Now let's say you are using a lot of subtransactions, because you are
> using PL functions, exception handling, triggers, doing batch updates.
> If you have 200 subtransactions on average per concurrent session, the
> memory usage bound in that case would be 4096 * 100 * 100 * 200 = 8 GB
> (per walsender).  And so on.  If you have more concurrent sessions or
> larger changes or more subtransactions, you'll use much more than those
> 8 GB.  And if you don't have those 8 GB, then you're stuck at this point.
>
> That is the consideration when we record changes, but we also need
> memory when we do the final processing at commit time.  That is slightly
> less problematic because we only process one top-level transaction at a
> time, so the formula is only 4096 * avg_size_of_changes * nr_subxacts
> (without the concurrent sessions factor).
>
> So, this patch proposes to improve this as follows:
>
> - We compute the actual size of each ReorderBufferChange and keep a
> running tally for each transaction, instead of just counting the number
> of changes.
>
> - We have a configuration setting that allows us to change the limit
> instead of the hardcoded 4096.  The configuration setting is also in
> terms of memory, not in number of changes.
>
> - The configuration setting is for the total memory usage per decoding
> session, not per subtransaction.  (So we also keep a running tally for
> the entire ReorderBuffer.)
>
> There are two open issues with this patch:
>
> One, this mechanism only applies when recording changes.  The processing
> at commit time still uses the previous hardcoded mechanism.  The reason
> for this is, AFAIU, that as things currently work, you have to have all
> subtransactions in memory to do the final processing.  There are some
> proposals to change this as well, but they are more involved.  Arguably,
> per my explanation above, memory use at commit time is less likely to be
> a problem.
>
> Two, what to do when the memory limit is reached.  With the old
> accounting, this was easy, because we'd decide for each subtransaction
> independently whether to spill it to disk, when it has reached its 4096
> limit.  Now, we are looking at a global limit, so we have to find a
> transaction to spill in some other way.  The proposed patch searches
> through the entire list of transactions to find the largest one.  But as
> the patch says:
>
> "XXX With many subtransactions this might be quite slow, because we'll
> have to walk through all of them. There are some options how we could
> improve that: (a) maintain some secondary structure with transactions
> sorted by amount of changes, (b) not looking for the entirely largest
> transaction, but e.g. for transaction using at least some fraction of
> the memory limit, and (c) evicting multiple transactions at once, e.g.
> to free a given portion of the memory limit (e.g. 50%)."
>
> (a) would create more overhead for the case where everything fits into
> memory, so it seems unattractive.  Some combination of (b) and (c) seems
> useful, but we'd have to come up with something concrete.
>
> Thoughts?
>

I am very sorry that I have not noticed this thread before.
Spilling to the file in reorder buffer is the main factor limiting speed
of importing data in multimaster and shardman (sharding based on FDW
with redundancy provided by LR).
This is why we think a lot about possible ways of addressing this issue.
Right now data of huge transaction is written to the disk three times
before it is applied at replica. And obviously read also three times.
First it is saved in WAL, then spilled to the disk by reorder buffer and
once again spilled to the disk at replica before assignment to the
particular apply worker
(last one is specific of multimaster, which can apply received
transactions concurrently).

We considered three different approaches:
1. Streaming. It is similar with the proposed patch, the main difference
is that we do not want to spill transaction in temporary file at
replica, but apply it immediately in separate backend and abort
transaction if it is aborted at master. Certainly it will work only with
2PC.
2. Elimination of spilling by rescanning WAL.
3. Bypass WAL: add hooks to heapam to buffer and propagate changes
immediately to replica and apply them in dedicated backend.
I have implemented prototype of such replication. With one replica it
shows about 1.5x slowdown comparing with standalone/async LR and about
2-3 improvement comparing with sync LR. For two replicas result is 2x
slower than async LR and 2-8 times faster than sync LR (depending on
number of concurrent connections).

Approach 3) seems to be specific to multimaster/shardman, so most likely
it can not be considered for general LR.
So I want to compare 1 and 2. Did you ever though about something like 2?

Right now in the proposed patch you just move spilling to the file from
master to replica.
It still can make sense to avoid memory overflow and reduce disk IO at
master.
But if we have just one huge transaction (COPY) importing gigabytes of
data to the database,
then performance will be almost the same with your patch or without it.
The only difference is where we serialize transaction: at master or at
replica side.
In this sense this patch doesn't solve the problem with slow load of
large bulks of data though LR.

Alternatively (approach 2) we can have small in-memory buffer for
decoding transaction and remember LSN and snapshot of this transaction
start.
In case of buffer overflow we just continue WAL traversal until we reach
end of the transaction. After it we restart scanning WAL from the
beginning of this transaction at this second pass
send changes directly to the output plugin. So we have to scan WAL
several times but do not need to spill anything to the disk neither at
publisher, neither at subscriber side.
Certainly this approach will be inefficient if we have several long
interleaving transactions. But in most customer's use cases we have
observed until now there is just one huge transaction performing bulk load.
May be I missed something, but this approach seems to be easier for
implementation than transaction streaming. And it doesn't require any
changes in output plugin API.
I realize that it is a little bit late to ask this question once your
patch is almost ready, but what do you think about it? Are there some
pitfals with this approach?

There is one more aspect and performance problem with LR we have faced
with shardman: if there are several publications for different subsets
of table at one instance,
then WAL senders have to do a lot of useless work. Them are decoding
transactions which have no relation to this publication. But WAL sender
doesn't know it until it reaches the end of this transaction. What is
worser: if transaction is huge, then all WAL senders will spill it to
the disk even through only one of them actually needs it. So data of
huge transaction is written not three times, but N times, where N is
number of publications. The only solution of the problem we can imagine
is to let backend somehow inform WAL sender (through shared message queue?)
about LSN-s it should considered. In this case WAL sender can skip large
portions of WAL without decoding. We also want to know opinion of
2ndQuandarnt about this idea.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Peter Eisentraut-6
In reply to this post by Peter Eisentraut-6
This patch set was not updated for the 2018-07 commitfest, so moved to -09.


On 09.03.18 17:07, Peter Eisentraut wrote:

> I think this patch is not going to be ready for PG11.
>
> - It depends on some work in the thread "logical decoding of two-phase
> transactions", which is still in progress.
>
> - Various details in the logical_work_mem patch (0001) are unresolved.
>
> - This being partially a performance feature, we haven't seen any
> performance tests (e.g., which settings result in which latencies under
> which workloads).
>
> That said, the feature seems useful and desirable, and the
> implementation makes sense.  There are documentation and tests.  But
> there is a significant amount of design and coding work still necessary.
>
> Attached is a fixup patch that I needed to make it compile.
>
> The last two patches in your series (0008, 0009) are labeled as bug
> fixes.  Would you like to argue that they should be applied
> independently of the rest of the feature?
>


--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Michael Paquier-2
In reply to this post by Tomas Vondra-4
On Sat, Mar 03, 2018 at 03:52:40PM +0100, Tomas Vondra wrote:
> Yeah, that's due to fd1a421fe66 which changed columns in pg_proc.h.
> Attached is a rebased patch, fixing this.

The latest patch set does not apply anymore, and had no activity for the
last two months, so I am marking it as returned with feedback.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4
In reply to this post by Peter Eisentraut-6
Hi,

Attached is an updated version of this patch series. It's meant to be
applied on top of the 2pc decoding patch [1], because streaming of
in-progress transactions requires handling of concurrent aborts. So it
may or may not apply directly to master, I'm not sure - unfortunately
that's likely to confuse the cputube thing, but I don't want to include
the 2pc decoding bits here because that would be just confusing.

If needed, the part introducing logical_work_mem limit for ReorderBuffer
can be separated and committed independently, but I do expect this to be
committed after the 2pc decoding patch so I've left it like this.

This new version is mostly just a rebase to current master (or almost,
because 2pc decoding only applies to 29180e5d78 due to minor bitrot),
but it also addresses the new stuff committed since last version (most
importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of
subxact assignments, where the assignment was included in records with
XID=0, essentially failing to track the subxact properly.

For the logical_work_mem part, I think this is quite solid. The main
question is how to pick transactions for eviction. For now it uses the
same approach as master (i.e. picking the largest top-level transaction,
although measured by amount of memory and not just number of changes).

But I've realized that may not work with Generation context that great,
because unlike AllocSet it does not reuse the memory. That's nice as it
allows freeing old blocks (which AllocSet can't), but it means a small
transaction can have a change on old blocks preventing free(). That is
something we have in pg11 already, because that's where Generation
context got introduced - I haven't seen this issue in practice, but we
might need to do something about it.

In any case, I'm thinking we may need to pick a different eviction
algorithm - say using a transaction with the oldest change (and loop
until we release at least one block in the Generation context), or maybe
look for block mixing changes from the smallest number of transactions,
or something like that. Other ideas are welcome. I don't think the exact
algorithm is particularly critical, because it's meant to be triggered
only very rarely (i.e. pick logical_work_mem high enough).

The in-progress streaming is mostly mechanical extension of existing
functionality (new methods in various APIs, ...) and refactoring of
ReorderBuffer to handle incremental decoding. I'm sure it'd benefit from
reviews, of course.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0001-Add-logical_work_mem-to-limit-ReorderBuffer-20181216.patch.gz (13K) Download Attachment
0002-Immediately-WAL-log-assignments-20181216.patch.gz (8K) Download Attachment
0003-Issue-individual-invalidations-with-wal_lev-20181216.patch.gz (6K) Download Attachment
0004-Extend-the-output-plugin-API-with-stream-me-20181216.patch.gz (7K) Download Attachment
0005-Implement-streaming-mode-in-ReorderBuffer-20181216.patch.gz (15K) Download Attachment
0006-Add-support-for-streaming-to-built-in-repli-20181216.patch.gz (26K) Download Attachment
0007-Track-statistics-for-streaming-spilling-20181216.patch.gz (5K) Download Attachment
0008-BUGFIX-set-final_lsn-for-subxacts-before-cl-20181216.patch.gz (874 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4
FWIW the original CF entry in 2018-07 [1] was marked as RWF. I'm not
sure what's the right way to resubmit such patches, so I've created a
new entry in 2019-01 [2] referencing the same hackers thread (and with
the same authors/reviewers metadata).

[1] https://commitfest.postgresql.org/19/1429/
[2] https://commitfest.postgresql.org/21/1927/

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Alexey Kondratov-2
In reply to this post by Tomas Vondra-4
Hi Tomas,

> This new version is mostly just a rebase to current master (or almost,
> because 2pc decoding only applies to 29180e5d78 due to minor bitrot),
> but it also addresses the new stuff committed since last version (most
> importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of
> subxact assignments, where the assignment was included in records with
> XID=0, essentially failing to track the subxact properly.

I started reviewing your patch about a month ago and tried to do an
in-depth review, since I am very interested in this patch too. The new
version is not applicable to master 29180e5d78, but everything is OK
after applying 2pc patch before. Anyway, I guess it may complicate
further testing and review, since any potential reviewer has to take
into account both patches at once. Previous version was applicable to
master and was working fine for me separately (excepting a few
patch-specific issues, which I try to explain below).


Patch review
========

First of all, I want to say thank you for such a huge work done. Here
are some problems, which I have found and hopefully fixed with my
additional patch (please, find attached, it should be applicable to the
last commit of your newest patch version):

1) The most important issue is that your tap tests were broken—there was
missing option "WITH (streaming=true)" in the subscription creating
statement. Therefore, spilling mechanism has been tested rather than
streaming.

2) After fixing tests the first one with simple streaming is immediately
failed, because of logical replication worker segmentation fault. It
happens, since worker tries to call stream_cleanup_files inside
stream_open_file at the stream start, while nxids is zero, then it goes
to the negative value and everything crashes. Something similar may
happen with xids array, so I added two checks there.

3) The next problem is much more critical and is dedicated to historic
MVCC visibility rules. Previously, walsender was starting to decode
transaction on commit and we were able to resolve all xmin, xmax,
combocids to cmin/cmax, build tuplecids hash and so on, but now we start
doing all these things on the fly.

Thus, rather difficult situation arises: HeapTupleSatisfiesHistoricMVCC
is trying to validate catalog tuples, which are currently in the future
relatively to the current decoder position inside transaction, e.g. we
may want to resolve cmin/cmax of a tuple, which was created with cid 3
and deleted with cid 5, while we are currently at cid 4, so our
tuplecids hash is not complete to handle such a case.

I have updated HeapTupleSatisfiesHistoricMVCC visibility rules with two
options:

/*
  * If we accidentally see a tuple from our transaction, but cannot
resolve its
  * cmin, so probably it is from the future, thus drop it.
  */
if (!resolved)
     return false;

and

/*
  * If we accidentally see a tuple from our transaction, but cannot
resolve its
  * cmax or cmax == InvalidCommandId, so probably it is still valid,
thus accept it.
  */
if (!resolved || cmax == InvalidCommandId)
     return true;

4) There was a problem with marking top-level transaction as having
catalog changes if one of its subtransactions has. It was causing a
problem with DDL statements just after subtransaction start (savepoint),
so data from new columns is not replicated.

5) Similar issue with schema send. You send schema only once per each
sub/transaction (IIRC), while we have to update schema on each catalog
change: invalidation execution, snapshot rebuild, adding new tuple cids.
So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
it is easy to set it inside RB and read in the output plugin. Probably,
we have to choose a better place for this flag.

6) To better handle all these tricky cases I added new tap
test—014_stream_tough_ddl.pl—which consist of really tough combination
of DDL, DML, savepoints and ROLLBACK/RELEASE in a single transaction.

I marked all my fixes and every questionable place with comment and
"TOCHECK:" label for easy search. Removing of pretty any of these fixes
leads to the tests fail due to the segmentation fault or replication
mismatch. Though I mostly read and tested old version of patch, but
after a quick look it seems that all these fixes are applicable to the
new version of patch as well.


Performance
========

I have also performed a series of performance tests, and found that
patch adds a huge overhead in the case of a large transaction consisting
of many small rows, e.g.:

CREATE TABLE large_test (num1 bigint, num2 double precision, num3 double
precision);

EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
SELECT round(random()*10), random(), random()*142
FROM generate_series(1, 1000000) s(i);

Execution Time: 2407.709 ms
Total Time: 11494,238 ms (00:11,494)

With synchronous_standby_names and 64 MB logical_work_mem it takes up to
x5 longer, while without patch it is about x2. Thus, logical replication
streaming is approximately x4 as slower for similar transactions.

However, dealing with large transactions consisting of a small number of
large rows is much better:

CREATE TABLE large_text (t TEXT);

EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 125);

Execution Time: 3545.642 ms
Total Time: 7678,617 ms (00:07,679)

It is around the same x2 as without patch. If someone is interested I
also added flamegraphs of walsender and logical replication worker
during first numerical transaction processing.


Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company


logical_repl_worker_new_perf.svg.zip (76K) Download Attachment
walsender_new_perf.svg.zip (47K) Download Attachment
0009-Fix-worker-historic-MVCC-visibility-rules-subxacts-s.patch (20K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4
Hi Alexey,

Thanks for the thorough and extremely valuable review!

On 12/17/18 5:23 PM, Alexey Kondratov wrote:

> Hi Tomas,
>
>> This new version is mostly just a rebase to current master (or almost,
>> because 2pc decoding only applies to 29180e5d78 due to minor bitrot),
>> but it also addresses the new stuff committed since last version (most
>> importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of
>> subxact assignments, where the assignment was included in records with
>> XID=0, essentially failing to track the subxact properly.
>
> I started reviewing your patch about a month ago and tried to do an
> in-depth review, since I am very interested in this patch too. The new
> version is not applicable to master 29180e5d78, but everything is OK
> after applying 2pc patch before. Anyway, I guess it may complicate
> further testing and review, since any potential reviewer has to take
> into account both patches at once. Previous version was applicable to
> master and was working fine for me separately (excepting a few
> patch-specific issues, which I try to explain below).
>

I agree it's somewhat annoying, but I don't think there's a better way,
unfortunately. Decoding in-progress transactions does require safe
handling of concurrent aborts, so it has to be committed after the 2pc
decoding patch (which makes that possible). But the 2pc patch also
touches the same places as this patch series (it reworks the reorder
buffer for example).

>
> Patch review
> ========
>
> First of all, I want to say thank you for such a huge work done. Here
> are some problems, which I have found and hopefully fixed with my
> additional patch (please, find attached, it should be applicable to the
> last commit of your newest patch version):
>
> 1) The most important issue is that your tap tests were broken—there was
> missing option "WITH (streaming=true)" in the subscription creating
> statement. Therefore, spilling mechanism has been tested rather than
> streaming.
>

D'oh!

> 2) After fixing tests the first one with simple streaming is immediately
> failed, because of logical replication worker segmentation fault. It
> happens, since worker tries to call stream_cleanup_files inside
> stream_open_file at the stream start, while nxids is zero, then it goes
> to the negative value and everything crashes. Something similar may
> happen with xids array, so I added two checks there.
>
> 3) The next problem is much more critical and is dedicated to historic
> MVCC visibility rules. Previously, walsender was starting to decode
> transaction on commit and we were able to resolve all xmin, xmax,
> combocids to cmin/cmax, build tuplecids hash and so on, but now we start
> doing all these things on the fly.
>
> Thus, rather difficult situation arises: HeapTupleSatisfiesHistoricMVCC
> is trying to validate catalog tuples, which are currently in the future
> relatively to the current decoder position inside transaction, e.g. we
> may want to resolve cmin/cmax of a tuple, which was created with cid 3
> and deleted with cid 5, while we are currently at cid 4, so our
> tuplecids hash is not complete to handle such a case.
>

Damn it! I ran into those two issues some time ago and I fixed it, but
I've forgotten to merge that fix into the patch. I'll merge those fixed
and compare them to your proposed fix, and send a new version tomorrow.

>
> 4) There was a problem with marking top-level transaction as having
> catalog changes if one of its subtransactions has. It was causing a
> problem with DDL statements just after subtransaction start (savepoint),
> so data from new columns is not replicated.
>
> 5) Similar issue with schema send. You send schema only once per each
> sub/transaction (IIRC), while we have to update schema on each catalog
> change: invalidation execution, snapshot rebuild, adding new tuple cids.
> So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
> it is easy to set it inside RB and read in the output plugin. Probably,
> we have to choose a better place for this flag.
>

Hmm. Can you share an example how to trigger these issues?

> 6) To better handle all these tricky cases I added new tap
> test—014_stream_tough_ddl.pl—which consist of really tough combination
> of DDL, DML, savepoints and ROLLBACK/RELEASE in a single transaction.
>

Thanks!

> I marked all my fixes and every questionable place with comment and
> "TOCHECK:" label for easy search. Removing of pretty any of these fixes
> leads to the tests fail due to the segmentation fault or replication
> mismatch. Though I mostly read and tested old version of patch, but
> after a quick look it seems that all these fixes are applicable to the
> new version of patch as well.
>

Thanks. I'll go through your patch tomorrow.

>
> Performance
> ========
>
> I have also performed a series of performance tests, and found that
> patch adds a huge overhead in the case of a large transaction consisting
> of many small rows, e.g.:
>
> CREATE TABLE large_test (num1 bigint, num2 double precision, num3 double
> precision);
>
> EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
> SELECT round(random()*10), random(), random()*142
> FROM generate_series(1, 1000000) s(i);
>
> Execution Time: 2407.709 ms
> Total Time: 11494,238 ms (00:11,494)
>
> With synchronous_standby_names and 64 MB logical_work_mem it takes up to
> x5 longer, while without patch it is about x2. Thus, logical replication
> streaming is approximately x4 as slower for similar transactions.
>
> However, dealing with large transactions consisting of a small number of
> large rows is much better:
>
> CREATE TABLE large_text (t TEXT);
>
> EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_text
> SELECT (SELECT string_agg('x', ',')
> FROM generate_series(1, 1000000)) FROM generate_series(1, 125);
>
> Execution Time: 3545.642 ms
> Total Time: 7678,617 ms (00:07,679)
>
> It is around the same x2 as without patch. If someone is interested I
> also added flamegraphs of walsender and logical replication worker
> during first numerical transaction processing.
>

Interesting. Any idea where does the extra overhead in this particular
case come from? It's hard to deduce that from the single flame graph,
when I don't have anything to compare it with (i.e. the flame graph for
the "normal" case).

I'll investigate this (probably not this week), but in general it's good
to keep in mind a couple of things:

1) Some overhead is expected, due to doing things incrementally.

2) The memory limit should be set to sufficiently high value to be hit
only infrequently.

3) And when the limit is actually hit, it's an alternative to spilling
large amounts of data locally (to disk) or incurring significant
replication lag later.

So I'm not particularly worried, but I'll look into that. I'd be much
more worried if there was measurable overhead in cases when there's no
streaming happening (either because it's disabled or the memory limit
was not hit).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Alexey Kondratov-2
On 18.12.2018 1:28, Tomas Vondra wrote:

>> 4) There was a problem with marking top-level transaction as having
>> catalog changes if one of its subtransactions has. It was causing a
>> problem with DDL statements just after subtransaction start (savepoint),
>> so data from new columns is not replicated.
>>
>> 5) Similar issue with schema send. You send schema only once per each
>> sub/transaction (IIRC), while we have to update schema on each catalog
>> change: invalidation execution, snapshot rebuild, adding new tuple cids.
>> So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
>> it is easy to set it inside RB and read in the output plugin. Probably,
>> we have to choose a better place for this flag.
>>
> Hmm. Can you share an example how to trigger these issues?
Test cases inside 014_stream_tough_ddl.pl and old ones (with
streaming=true option added) should reproduce all these issues. In
general, it happens in a txn like:

INSERT
SAVEPOINT
ALTER TABLE ... ADD COLUMN
INSERT

then the second insert may discover old version of catalog.

> Interesting. Any idea where does the extra overhead in this particular
> case come from? It's hard to deduce that from the single flame graph,
> when I don't have anything to compare it with (i.e. the flame graph for
> the "normal" case).

I guess that bottleneck is in disk operations. You can check
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
writes (~26%) take around 35% of CPU time in summary. To compare,
please, see attached flame graph for the following transaction:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);

Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)

where disk IO is only ~7-8% in total. So we get very roughly the same
~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests.

Therefore, probably you may write changes on receiver in bigger chunks,
not each change separately.

> So I'm not particularly worried, but I'll look into that. I'd be much
> more worried if there was measurable overhead in cases when there's no
> streaming happening (either because it's disabled or the memory limit
> was not hit).

What I have also just found, is that if a table row is large enough to
be TOASTed, e.g.:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);

then logical_work_mem limit is not hit and we neither stream, nor spill
to disk this transaction, while it is still large. In contrast, the
transaction above (with 1000000 smaller rows) being comparable in size
is streamed. Not sure, that it is easy to add proper accounting of
TOAST-able columns, but it worth it.

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company


logical_repl_worker_text_stream_perf.zip (57K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4
Hi Alexey,

Attached is an updated version of the patches, with all the fixes I've
done in the past. I believe it should fix at least some of the issues
you reported - certainly the problem with stream_cleanup_files, but
perhaps some of the other issues too.

I'm a bit confused by the changes to TAP tests. Per the patch summary,
some .pl files get renamed (nor sure why), a new one is added, etc. So
I've instead enabled streaming subscriptions in all tests, which with
this patch produces two failures:

Test Summary Report
-------------------
t/004_sync.pl                    (Wstat: 7424 Tests: 1 Failed: 0)
  Non-zero exit status: 29
  Parse errors: Bad plan.  You planned 7 tests but ran 1.
t/011_stream_ddl.pl              (Wstat: 256 Tests: 2 Failed: 1)
  Failed test:  2
  Non-zero exit status: 1

So yeah, there's more stuff to fix. But I can't directly apply your
fixes because the updated patches are somewhat different.


On 12/18/18 3:07 PM, Alexey Kondratov wrote:

> On 18.12.2018 1:28, Tomas Vondra wrote:
>>> 4) There was a problem with marking top-level transaction as having
>>> catalog changes if one of its subtransactions has. It was causing a
>>> problem with DDL statements just after subtransaction start (savepoint),
>>> so data from new columns is not replicated.
>>>
>>> 5) Similar issue with schema send. You send schema only once per each
>>> sub/transaction (IIRC), while we have to update schema on each catalog
>>> change: invalidation execution, snapshot rebuild, adding new tuple cids.
>>> So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
>>> it is easy to set it inside RB and read in the output plugin. Probably,
>>> we have to choose a better place for this flag.
>>>
>> Hmm. Can you share an example how to trigger these issues?
>
> Test cases inside 014_stream_tough_ddl.pl and old ones (with
> streaming=true option added) should reproduce all these issues. In
> general, it happens in a txn like:
>
> INSERT
> SAVEPOINT
> ALTER TABLE ... ADD COLUMN
> INSERT
>
> then the second insert may discover old version of catalog.
>
Yeah, that's the issue I've discovered before and thought it got fixed.

>> Interesting. Any idea where does the extra overhead in this particular
>> case come from? It's hard to deduce that from the single flame graph,
>> when I don't have anything to compare it with (i.e. the flame graph for
>> the "normal" case).
>
> I guess that bottleneck is in disk operations. You can check
> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
> writes (~26%) take around 35% of CPU time in summary. To compare,
> please, see attached flame graph for the following transaction:
>
> INSERT INTO large_text
> SELECT (SELECT string_agg('x', ',')
> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>
> Execution Time: 44519.816 ms
> Time: 98333,642 ms (01:38,334)
>
> where disk IO is only ~7-8% in total. So we get very roughly the same
> ~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests.
>
> Therefore, probably you may write changes on receiver in bigger chunks,
> not each change separately.
>
Possibly, I/O is certainly a possible culprit, although we should be
using buffered I/O and there certainly are not any fsyncs here. So I'm
not sure why would it be cheaper to do the writes in batches.

BTW does this mean you see the overhead on the apply side? Or are you
running this on a single machine, and it's difficult to decide?

>> So I'm not particularly worried, but I'll look into that. I'd be much
>> more worried if there was measurable overhead in cases when there's no
>> streaming happening (either because it's disabled or the memory limit
>> was not hit).
>
> What I have also just found, is that if a table row is large enough to
> be TOASTed, e.g.:
>
> INSERT INTO large_text
> SELECT (SELECT string_agg('x', ',')
> FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
>
> then logical_work_mem limit is not hit and we neither stream, nor spill
> to disk this transaction, while it is still large. In contrast, the
> transaction above (with 1000000 smaller rows) being comparable in size
> is streamed. Not sure, that it is easy to add proper accounting of
> TOAST-able columns, but it worth it.
>
That's certainly strange and possibly a bug in the memory accounting
code. I'm not sure why would that happen, though, because TOAST data
look just like regular INSERT changes. Interesting. I wonder if it's
already fixed in this updated version, but it's a bit too late to
investigate that today.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0001-Add-logical_work_mem-to-limit-ReorderBuffer-20181219.patch.gz (13K) Download Attachment
0002-Immediately-WAL-log-assignments-20181219.patch.gz (8K) Download Attachment
0003-Issue-individual-invalidations-with-wal_lev-20181219.patch.gz (6K) Download Attachment
0004-Extend-the-output-plugin-API-with-stream-me-20181219.patch.gz (8K) Download Attachment
0005-Implement-streaming-mode-in-ReorderBuffer-20181219.patch.gz (16K) Download Attachment
0006-Add-support-for-streaming-to-built-in-repli-20181219.patch.gz (27K) Download Attachment
0007-Track-statistics-for-streaming-spilling-20181219.patch.gz (5K) Download Attachment
0008-BUGFIX-set-final_lsn-for-subxacts-before-cl-20181219.patch.gz (870 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Alexey Kondratov-2
Hi Tomas,

> I'm a bit confused by the changes to TAP tests. Per the patch summary,
> some .pl files get renamed (nor sure why), a new one is added, etc.

I added new tap test case, streaming=true option inside old stream_*
ones and incremented streaming tests number (+2) because of the
collision between 009_matviews.pl / 009_stream_simple.pl and
010_truncate.pl / 010_stream_subxact.pl. At least in the previous
version of the patch they were under the same numbers. Nothing special,
but for simplicity, please, find attached my new tap test separately.

>   So
> I've instead enabled streaming subscriptions in all tests, which with
> this patch produces two failures:
>
> Test Summary Report
> -------------------
> t/004_sync.pl                    (Wstat: 7424 Tests: 1 Failed: 0)
>    Non-zero exit status: 29
>    Parse errors: Bad plan.  You planned 7 tests but ran 1.
> t/011_stream_ddl.pl              (Wstat: 256 Tests: 2 Failed: 1)
>    Failed test:  2
>    Non-zero exit status: 1
>
> So yeah, there's more stuff to fix. But I can't directly apply your
> fixes because the updated patches are somewhat different.
Fixes should apply clearly to the previous version of your patch. Also,
I am not sure, that it is a good idea to simply enable streaming
subscriptions in all tests (e.g. pre streaming patch t/004_sync.pl),
since then they do not hit not streaming code.

>>> Interesting. Any idea where does the extra overhead in this particular
>>> case come from? It's hard to deduce that from the single flame graph,
>>> when I don't have anything to compare it with (i.e. the flame graph for
>>> the "normal" case).
>> I guess that bottleneck is in disk operations. You can check
>> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>> writes (~26%) take around 35% of CPU time in summary. To compare,
>> please, see attached flame graph for the following transaction:
>>
>> INSERT INTO large_text
>> SELECT (SELECT string_agg('x', ',')
>> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>
>> Execution Time: 44519.816 ms
>> Time: 98333,642 ms (01:38,334)
>>
>> where disk IO is only ~7-8% in total. So we get very roughly the same
>> ~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests.
>>
>> Therefore, probably you may write changes on receiver in bigger chunks,
>> not each change separately.
>>
> Possibly, I/O is certainly a possible culprit, although we should be
> using buffered I/O and there certainly are not any fsyncs here. So I'm
> not sure why would it be cheaper to do the writes in batches.
>
> BTW does this mean you see the overhead on the apply side? Or are you
> running this on a single machine, and it's difficult to decide?
I run this on a single machine, but walsender and worker are utilizing
almost 100% of CPU per each process all the time, and at apply side I/O
syscalls take about 1/3 of CPU time. Though I am still not sure, but for
me this result somehow links performance drop with problems at receiver
side.

Writing in batches was just a hypothesis and to validate it I have
performed test with large txn, but consisting of a smaller number of
wide rows. This test does not exhibit any significant performance drop,
while it was streamed too. So it seems to be valid. Anyway, I do not
have other reasonable ideas beside that right now.


Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company


0xx_stream_tough_ddl.pl (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4
Hi,

Attached is an updated patch series, merging fixes and changes to TAP
tests proposed by Alexey. I've merged the fixes into the appropriate
patches, and I've kept the TAP changes / new tests as separate patches
towards the end of the series.

I'm a bit unhappy with two aspects of the current patch series:

1) We now track schema changes in two ways - using the pre-existing
schema_sent flag in RelationSyncEntry, and the (newly added) flag in
ReorderBuffer. While those options are used for regular vs. streamed
transactions, fundamentally it's the same thing and so having two
competing ways seems like a bad idea. Not sure what's the best way to
resolve this, though.

2) We've removed quite a few asserts, particularly ensuring sanity of
cmin/cmax values. To some extent that's expected, because by allowing
decoding of in-progress transactions relaxes some of those rules. But
I'd be much happier if some of those asserts could be reinstated, even
if only in a weaker form.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0001-Add-logical_work_mem-to-limit-ReorderBuffer-20190114.patch.gz (13K) Download Attachment
0002-Immediately-WAL-log-assignments-20190114.patch.gz (8K) Download Attachment
0003-Issue-individual-invalidations-with-wal_lev-20190114.patch.gz (6K) Download Attachment
0004-Extend-the-output-plugin-API-with-stream-me-20190114.patch.gz (8K) Download Attachment
0005-Implement-streaming-mode-in-ReorderBuffer-20190114.patch.gz (17K) Download Attachment
0006-Add-support-for-streaming-to-built-in-repli-20190114.patch.gz (27K) Download Attachment
0007-Track-statistics-for-streaming-spilling-20190114.patch.gz (5K) Download Attachment
0008-Enable-streaming-for-all-subscription-TAP-t-20190114.patch.gz (2K) Download Attachment
0009-BUGFIX-set-final_lsn-for-subxacts-before-cl-20190114.patch.gz (874 bytes) Download Attachment
0010-Add-TAP-test-for-streaming-vs.-DDL-20190114.patch.gz (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Michael Paquier-2
On Mon, Jan 14, 2019 at 07:23:31PM +0100, Tomas Vondra wrote:
> Attached is an updated patch series, merging fixes and changes to TAP
> tests proposed by Alexey. I've merged the fixes into the appropriate
> patches, and I've kept the TAP changes / new tests as separate patches
> towards the end of the series.

Patch 4 of the latest set fails to apply, so I have moved the patch to
next CF, waiting on author.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Alexey Kondratov-2
In reply to this post by Tomas Vondra-4
Hi Tomas,

On 14.01.2019 21:23, Tomas Vondra wrote:
> Attached is an updated patch series, merging fixes and changes to TAP
> tests proposed by Alexey. I've merged the fixes into the appropriate
> patches, and I've kept the TAP changes / new tests as separate patches
> towards the end of the series.

I had problems applying this patch along with 2pc streaming one to the
current master, but everything applied well on 97c39498e5. Regression
tests pass. What I personally do not like in the current TAP tests set
is that you have added "WITH (streaming=on)" to all tests including old
non-streaming ones. It seems unclear, which mechanism is tested there:
streaming, but those transactions probably do not hit memory limit, so
it depends on default server parameters; or non-streaming, but then what
is the need for (streaming=on)? I would prefer to add (streaming=on)
only to the new tests, where it is clearly necessary.

> I'm a bit unhappy with two aspects of the current patch series:
>
> 1) We now track schema changes in two ways - using the pre-existing
> schema_sent flag in RelationSyncEntry, and the (newly added) flag in
> ReorderBuffer. While those options are used for regular vs. streamed
> transactions, fundamentally it's the same thing and so having two
> competing ways seems like a bad idea. Not sure what's the best way to
> resolve this, though.

Yes, sure, when I have found problems with streaming of extensive DDL, I
added new flag in the simplest way, and it worked. Now, old schema_sent
flag is per relation based, while the new one - is_schema_sent - is per
top-level transaction based. If I get it correctly, the former seems to
be more thrifty, since new schema is sent only if we are streaming
change for relation, whose schema is outdated. In contrast, in the
latter case we will send new schema even if there will be no new changes
which belong to this relation.

I guess, it would be better to stick to the old behavior. I will try to
investigate how to better use it in the streaming mode as well.

> 2) We've removed quite a few asserts, particularly ensuring sanity of
> cmin/cmax values. To some extent that's expected, because by allowing
> decoding of in-progress transactions relaxes some of those rules. But
> I'd be much happier if some of those asserts could be reinstated, even
> if only in a weaker form.


Asserts have been removed from two places: (1)
HeapTupleSatisfiesHistoricMVCC, which seems inevitable, since we are
touching the essence of the MVCC visibility rules, when trying to decode
an in-progress transaction, and (2) ReorderBufferBuildTupleCidHash,
which is probably not related directly to the topic of the ongoing
patch, since Arseny Sher faced the same issue with simple repetitive DDL
decoding [1] recently.

Not many, but I agree, that replacing them with some softer asserts
would be better, than just removing, especially point 1).


[1] https://www.postgresql.org/message-id/flat/874l9p8hyw.fsf%40ars-thinkpad


Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company


123456 ... 27