PATCH: logical_work_mem and logical streaming of large in-progress transactions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
539 messages Options
1234567 ... 27
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Alexey Kondratov-2
Hi Tomas,

>>>> Interesting. Any idea where does the extra overhead in this particular
>>>> case come from? It's hard to deduce that from the single flame graph,
>>>> when I don't have anything to compare it with (i.e. the flame graph
>>>> for
>>>> the "normal" case).
>>> I guess that bottleneck is in disk operations. You can check
>>> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>> writes (~26%) take around 35% of CPU time in summary. To compare,
>>> please, see attached flame graph for the following transaction:
>>>
>>> INSERT INTO large_text
>>> SELECT (SELECT string_agg('x', ',')
>>> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>
>>> Execution Time: 44519.816 ms
>>> Time: 98333,642 ms (01:38,334)
>>>
>>> where disk IO is only ~7-8% in total. So we get very roughly the same
>>> ~x4-5 performance drop here. JFYI, I am using a machine with SSD for
>>> tests.
>>>
>>> Therefore, probably you may write changes on receiver in bigger chunks,
>>> not each change separately.
>>>
>> Possibly, I/O is certainly a possible culprit, although we should be
>> using buffered I/O and there certainly are not any fsyncs here. So I'm
>> not sure why would it be cheaper to do the writes in batches.
>>
>> BTW does this mean you see the overhead on the apply side? Or are you
>> running this on a single machine, and it's difficult to decide?
>
> I run this on a single machine, but walsender and worker are utilizing
> almost 100% of CPU per each process all the time, and at apply side
> I/O syscalls take about 1/3 of CPU time. Though I am still not sure,
> but for me this result somehow links performance drop with problems at
> receiver side.
>
> Writing in batches was just a hypothesis and to validate it I have
> performed test with large txn, but consisting of a smaller number of
> wide rows. This test does not exhibit any significant performance
> drop, while it was streamed too. So it seems to be valid. Anyway, I do
> not have other reasonable ideas beside that right now.
I've checked recently this patch again and tried to elaborate it in
terms of performance. As a result I've implemented a new POC version of
the applier (attached). Almost everything in streaming logic stayed
intact, but apply worker is significantly different.

As I wrote earlier I still claim, that spilling changes on disk at the
applier side adds additional overhead, but it is possible to get rid of
it. In my additional patch I do the following:

1) Maintain a pool of additional background workers (bgworkers), that
are connected with main logical apply worker via shm_mq's. Each worker
is dedicated to the processing of specific streamed transaction.

2) When we receive a streamed change for some transaction, we check
whether there is an existing dedicated bgworker in HTAB (xid ->
bgworker), or there are some in the idle list, or spawn a new one.

3) We pass all changes (between STREAM START/STOP) to that bgworker via
shm_mq_send without intermediate waiting. However, we wait for bgworker
to apply the entire changes chunk at STREAM STOP, since we don't want
transactions reordering.

4) When transaction is commited/aborted worker is being added to the
idle list and is waiting for reassigning message.

5) I have used the same machinery with apply_dispatch in bgworkers,
since most of actions are practically very similar.

Thus, we do not spill anything at the applier side, so transaction
changes are processed by bgworkers as normal backends do. In the same
time, changes processing is strictly serial, which prevents transactions
reordering and possible conflicts/anomalies. Even though we trade off
performance in favor of stability the result is rather impressive. I
have used a similar query for testing as before:

EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
     SELECT round(random()*10), random(), random()*142
     FROM generate_series(1, 1000000) s(i);

with 1kk (1000000), 3kk and 5kk rows; logical_work_mem = 64MB and
synchronous_standby_names = 'FIRST 1 (large_sub)'. Table schema is
following:

CREATE TABLE large_test (
     id serial primary key,
     num1 bigint,
     num2 double precision,
     num3 double precision
);

Here are the results:

-------------------------------------------------------------------
| N | Time on master, sec | Total xact time, sec |     Ratio      |
-------------------------------------------------------------------
|                        On commit (master, v13)                  |
-------------------------------------------------------------------
| 1kk | 6.5               | 17.6                 | x2.74          |
-------------------------------------------------------------------
| 3kk | 21                | 55.4                 | x2.64          |
-------------------------------------------------------------------
| 5kk | 38.3              | 91.5                 | x2.39          |
-------------------------------------------------------------------
|                        Stream + spill                           |
-------------------------------------------------------------------
| 1kk | 5.9               | 18                   | x3             |
-------------------------------------------------------------------
| 3kk | 19.5              | 52.4                 | x2.7           |
-------------------------------------------------------------------
| 5kk | 33.3              | 86.7                 | x2.86          |
-------------------------------------------------------------------
|                        Stream + BGW pool                        |
-------------------------------------------------------------------
| 1kk | 6                 | 12                   | x2             |
-------------------------------------------------------------------
| 3kk | 18.5              | 30.5                 | x1.65          |
-------------------------------------------------------------------
| 5kk | 35.6              | 53.9                 | x1.51          |
-------------------------------------------------------------------

It seems that overhead added by synchronous replica is lower by 2-3
times compared with Postgres master and streaming with spilling.
Therefore, the original patch eliminated delay before large transaction
processing start by sender, while this additional patch speeds up the
applier side.

Although the overall speed up is surely measurable, there is a room for
improvements yet:

1) Currently bgworkers are only spawned on demand without some initial
pool and never stopped. Maybe we should create a small pool on
replication start and offload some of idle bgworkers if they exceed some
limit?

2) Probably we can track somehow that incoming change has conflicts with
some of being processed xacts, so we can wait for specific bgworkers
only in that case?

3) Since the communication between main logical apply worker and each
bgworker from the pool is a 'single producer --- single consumer'
problem, then probably it is possible to wait and set/check flags
without locks, but using just atomics.

What do you think about this concept in general? Any concerns and
criticism are welcome!


Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

P.S. This patch shloud be applicable to your last patch set. I would rebase it against master, but it depends on 2pc patch, that I don't know well enough.


0011-BGWorkers-pool-for-streamed-transactions-apply-witho.patch (62K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4
On Wed, Aug 28, 2019 at 08:17:47PM +0300, Alexey Kondratov wrote:

>Hi Tomas,
>
>>>>>Interesting. Any idea where does the extra overhead in this particular
>>>>>case come from? It's hard to deduce that from the single flame graph,
>>>>>when I don't have anything to compare it with (i.e. the flame
>>>>>graph for
>>>>>the "normal" case).
>>>>I guess that bottleneck is in disk operations. You can check
>>>>logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>>>writes (~26%) take around 35% of CPU time in summary. To compare,
>>>>please, see attached flame graph for the following transaction:
>>>>
>>>>INSERT INTO large_text
>>>>SELECT (SELECT string_agg('x', ',')
>>>>FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>>
>>>>Execution Time: 44519.816 ms
>>>>Time: 98333,642 ms (01:38,334)
>>>>
>>>>where disk IO is only ~7-8% in total. So we get very roughly the same
>>>>~x4-5 performance drop here. JFYI, I am using a machine with SSD
>>>>for tests.
>>>>
>>>>Therefore, probably you may write changes on receiver in bigger chunks,
>>>>not each change separately.
>>>>
>>>Possibly, I/O is certainly a possible culprit, although we should be
>>>using buffered I/O and there certainly are not any fsyncs here. So I'm
>>>not sure why would it be cheaper to do the writes in batches.
>>>
>>>BTW does this mean you see the overhead on the apply side? Or are you
>>>running this on a single machine, and it's difficult to decide?
>>
>>I run this on a single machine, but walsender and worker are
>>utilizing almost 100% of CPU per each process all the time, and at
>>apply side I/O syscalls take about 1/3 of CPU time. Though I am
>>still not sure, but for me this result somehow links performance
>>drop with problems at receiver side.
>>
>>Writing in batches was just a hypothesis and to validate it I have
>>performed test with large txn, but consisting of a smaller number of
>>wide rows. This test does not exhibit any significant performance
>>drop, while it was streamed too. So it seems to be valid. Anyway, I
>>do not have other reasonable ideas beside that right now.
>
>I've checked recently this patch again and tried to elaborate it in
>terms of performance. As a result I've implemented a new POC version
>of the applier (attached). Almost everything in streaming logic stayed
>intact, but apply worker is significantly different.
>
>As I wrote earlier I still claim, that spilling changes on disk at the
>applier side adds additional overhead, but it is possible to get rid
>of it. In my additional patch I do the following:
>
>1) Maintain a pool of additional background workers (bgworkers), that
>are connected with main logical apply worker via shm_mq's. Each worker
>is dedicated to the processing of specific streamed transaction.
>
>2) When we receive a streamed change for some transaction, we check
>whether there is an existing dedicated bgworker in HTAB (xid ->
>bgworker), or there are some in the idle list, or spawn a new one.
>
>3) We pass all changes (between STREAM START/STOP) to that bgworker
>via shm_mq_send without intermediate waiting. However, we wait for
>bgworker to apply the entire changes chunk at STREAM STOP, since we
>don't want transactions reordering.
>
>4) When transaction is commited/aborted worker is being added to the
>idle list and is waiting for reassigning message.
>
>5) I have used the same machinery with apply_dispatch in bgworkers,
>since most of actions are practically very similar.
>
>Thus, we do not spill anything at the applier side, so transaction
>changes are processed by bgworkers as normal backends do. In the same
>time, changes processing is strictly serial, which prevents
>transactions reordering and possible conflicts/anomalies. Even though
>we trade off performance in favor of stability the result is rather
>impressive. I have used a similar query for testing as before:
>
>EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
>    SELECT round(random()*10), random(), random()*142
>    FROM generate_series(1, 1000000) s(i);
>
>with 1kk (1000000), 3kk and 5kk rows; logical_work_mem = 64MB and
>synchronous_standby_names = 'FIRST 1 (large_sub)'. Table schema is
>following:
>
>CREATE TABLE large_test (
>    id serial primary key,
>    num1 bigint,
>    num2 double precision,
>    num3 double precision
>);
>
>Here are the results:
>
>-------------------------------------------------------------------
>| N | Time on master, sec | Total xact time, sec |     Ratio      |
>-------------------------------------------------------------------
>|                        On commit (master, v13)                  |
>-------------------------------------------------------------------
>| 1kk | 6.5               | 17.6                 | x2.74          |
>-------------------------------------------------------------------
>| 3kk | 21                | 55.4                 | x2.64          |
>-------------------------------------------------------------------
>| 5kk | 38.3              | 91.5                 | x2.39          |
>-------------------------------------------------------------------
>|                        Stream + spill                           |
>-------------------------------------------------------------------
>| 1kk | 5.9               | 18                   | x3             |
>-------------------------------------------------------------------
>| 3kk | 19.5              | 52.4                 | x2.7           |
>-------------------------------------------------------------------
>| 5kk | 33.3              | 86.7                 | x2.86          |
>-------------------------------------------------------------------
>|                        Stream + BGW pool                        |
>-------------------------------------------------------------------
>| 1kk | 6                 | 12                   | x2             |
>-------------------------------------------------------------------
>| 3kk | 18.5              | 30.5                 | x1.65          |
>-------------------------------------------------------------------
>| 5kk | 35.6              | 53.9                 | x1.51          |
>-------------------------------------------------------------------
>
>It seems that overhead added by synchronous replica is lower by 2-3
>times compared with Postgres master and streaming with spilling.
>Therefore, the original patch eliminated delay before large
>transaction processing start by sender, while this additional patch
>speeds up the applier side.
>
>Although the overall speed up is surely measurable, there is a room
>for improvements yet:
>
>1) Currently bgworkers are only spawned on demand without some initial
>pool and never stopped. Maybe we should create a small pool on
>replication start and offload some of idle bgworkers if they exceed
>some limit?
>
>2) Probably we can track somehow that incoming change has conflicts
>with some of being processed xacts, so we can wait for specific
>bgworkers only in that case?
>
>3) Since the communication between main logical apply worker and each
>bgworker from the pool is a 'single producer --- single consumer'
>problem, then probably it is possible to wait and set/check flags
>without locks, but using just atomics.
>
>What do you think about this concept in general? Any concerns and
>criticism are welcome!
>

Hi Alexey,

I'm unable to do any in-depth review of the patch over the next two weeks
or so, but I think the idea of having a pool of apply workers is sound and
can be quite beneficial for some workloads.

I don't think it matters very much whether the workers are started at the
beginning or allocated ad hoc, that's IMO a minor implementation detail.

There's one huge challenge that I however don't see mentioned in your
message or in the patch (after cursory reading) - ensuring the same commit
order, and introducing deadlocks that would not exist in single-process
apply.

Surely, we want to end up with the same commit order as on the upstream,
otherwise we might easily get different data on the subscriber. So when we
pass the large transaction to a separate process, then this process has
to wait for the other processes processing transactions that committed
first. And similarly, other processes have to wait for this process.
Depending on the commit order. I might have missed something, but I don't
see anything like that in your patch.

Essentially, this means there needs to be some sort of wait between those
apply processes, enforcing the commit order.

That however means we can easily introduce deadlocks into workloads where
the serial-apply would not have that issue - imagine multiple large
transactions, touching the same set of rows. We may ship them to different
bgworkers, and those processes may deadlock.

Of course, the deadlock detector will come around (assuming the wait is
done in a way visible to the detector) and will abort one of the
processes. But we don't know it'll abort the right one - it may easily
abort the apply process that needs to comit first, and eveyone else is
waitiing for it. Which stalls the apply forever.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Alexey Kondratov-2
On 28.08.2019 22:06, Tomas Vondra wrote:

>
>>
>>>>>> Interesting. Any idea where does the extra overhead in this
>>>>>> particular
>>>>>> case come from? It's hard to deduce that from the single flame
>>>>>> graph,
>>>>>> when I don't have anything to compare it with (i.e. the flame
>>>>>> graph for
>>>>>> the "normal" case).
>>>>> I guess that bottleneck is in disk operations. You can check
>>>>> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>>>> writes (~26%) take around 35% of CPU time in summary. To compare,
>>>>> please, see attached flame graph for the following transaction:
>>>>>
>>>>> INSERT INTO large_text
>>>>> SELECT (SELECT string_agg('x', ',')
>>>>> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>>>
>>>>> Execution Time: 44519.816 ms
>>>>> Time: 98333,642 ms (01:38,334)
>>>>>
>>>>> where disk IO is only ~7-8% in total. So we get very roughly the same
>>>>> ~x4-5 performance drop here. JFYI, I am using a machine with SSD
>>>>> for tests.
>>>>>
>>>>> Therefore, probably you may write changes on receiver in bigger
>>>>> chunks,
>>>>> not each change separately.
>>>>>
>>>> Possibly, I/O is certainly a possible culprit, although we should be
>>>> using buffered I/O and there certainly are not any fsyncs here. So I'm
>>>> not sure why would it be cheaper to do the writes in batches.
>>>>
>>>> BTW does this mean you see the overhead on the apply side? Or are you
>>>> running this on a single machine, and it's difficult to decide?
>>>
>>> I run this on a single machine, but walsender and worker are
>>> utilizing almost 100% of CPU per each process all the time, and at
>>> apply side I/O syscalls take about 1/3 of CPU time. Though I am
>>> still not sure, but for me this result somehow links performance
>>> drop with problems at receiver side.
>>>
>>> Writing in batches was just a hypothesis and to validate it I have
>>> performed test with large txn, but consisting of a smaller number of
>>> wide rows. This test does not exhibit any significant performance
>>> drop, while it was streamed too. So it seems to be valid. Anyway, I
>>> do not have other reasonable ideas beside that right now.
>>
>> It seems that overhead added by synchronous replica is lower by 2-3
>> times compared with Postgres master and streaming with spilling.
>> Therefore, the original patch eliminated delay before large
>> transaction processing start by sender, while this additional patch
>> speeds up the applier side.
>>
>> Although the overall speed up is surely measurable, there is a room
>> for improvements yet:
>>
>> 1) Currently bgworkers are only spawned on demand without some
>> initial pool and never stopped. Maybe we should create a small pool
>> on replication start and offload some of idle bgworkers if they
>> exceed some limit?
>>
>> 2) Probably we can track somehow that incoming change has conflicts
>> with some of being processed xacts, so we can wait for specific
>> bgworkers only in that case?
>>
>> 3) Since the communication between main logical apply worker and each
>> bgworker from the pool is a 'single producer --- single consumer'
>> problem, then probably it is possible to wait and set/check flags
>> without locks, but using just atomics.
>>
>> What do you think about this concept in general? Any concerns and
>> criticism are welcome!
>>
>

Hi Tomas,

Thank you for a quick response.

> I don't think it matters very much whether the workers are started at the
> beginning or allocated ad hoc, that's IMO a minor implementation detail.

OK, I had the same vision about this point. Any minor differences here
will be neglectable for a sufficiently large transaction.

>
> There's one huge challenge that I however don't see mentioned in your
> message or in the patch (after cursory reading) - ensuring the same
> commit
> order, and introducing deadlocks that would not exist in single-process
> apply.

Probably I haven't explained well this part, sorry for that. In my patch
I don't use workers pool for a concurrent transaction apply, but rather
for a fast context switch between long-lived streamed transactions. In
other words we apply all changes arrived from the sender in a completely
serial manner. Being written step-by-step it looks like:

1) Read STREAM START message and figure out the target worker by xid.

2) Put all changes, which belongs to this xact to the selected worker
one by one via shm_mq_send.

3) Read STREAM STOP message and wait until our worker will apply all
changes in the queue.

4) Process all other chunks of streamed xacts in the same manner.

5) Process all non-streamed xacts immediately in the main apply worker loop.

6) If we read STREAMED COMMIT/ABORT we again wait until selected worker
either commits or aborts.

Thus, it automatically guaranties the same commit order on replica as on
master. Yes, we loose some performance here, since we don't apply
transactions concurrently, but it would bring all those problems you
have described.

However, you helped me to figure out another point I have forgotten.
Although we ensure commit order automatically, the beginning of streamed
xacts may reorder. It happens if some small xacts have been commited on
master since the streamed one started, because we do not start streaming
immediately, but only after logical_work_mem hit. I have performed some
tests with conflicting xacts and it seems that it's not a problem, since
locking mechanism in Postgres guarantees that if there would some
deadlocks, they will happen earlier on master. So if some records hit
the WAL, it is safe to apply the sequentially. Am I wrong?

Anyway, I'm going to double check the safety of this part later.


Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company



Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4
On Thu, Aug 29, 2019 at 05:37:45PM +0300, Alexey Kondratov wrote:

>On 28.08.2019 22:06, Tomas Vondra wrote:
>>
>>>
>>>>>>>Interesting. Any idea where does the extra overhead in
>>>>>>>this particular
>>>>>>>case come from? It's hard to deduce that from the single
>>>>>>>flame graph,
>>>>>>>when I don't have anything to compare it with (i.e. the
>>>>>>>flame graph for
>>>>>>>the "normal" case).
>>>>>>I guess that bottleneck is in disk operations. You can check
>>>>>>logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>>>>>writes (~26%) take around 35% of CPU time in summary. To compare,
>>>>>>please, see attached flame graph for the following transaction:
>>>>>>
>>>>>>INSERT INTO large_text
>>>>>>SELECT (SELECT string_agg('x', ',')
>>>>>>FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>>>>
>>>>>>Execution Time: 44519.816 ms
>>>>>>Time: 98333,642 ms (01:38,334)
>>>>>>
>>>>>>where disk IO is only ~7-8% in total. So we get very roughly the same
>>>>>>~x4-5 performance drop here. JFYI, I am using a machine with
>>>>>>SSD for tests.
>>>>>>
>>>>>>Therefore, probably you may write changes on receiver in
>>>>>>bigger chunks,
>>>>>>not each change separately.
>>>>>>
>>>>>Possibly, I/O is certainly a possible culprit, although we should be
>>>>>using buffered I/O and there certainly are not any fsyncs here. So I'm
>>>>>not sure why would it be cheaper to do the writes in batches.
>>>>>
>>>>>BTW does this mean you see the overhead on the apply side? Or are you
>>>>>running this on a single machine, and it's difficult to decide?
>>>>
>>>>I run this on a single machine, but walsender and worker are
>>>>utilizing almost 100% of CPU per each process all the time, and
>>>>at apply side I/O syscalls take about 1/3 of CPU time. Though I
>>>>am still not sure, but for me this result somehow links
>>>>performance drop with problems at receiver side.
>>>>
>>>>Writing in batches was just a hypothesis and to validate it I
>>>>have performed test with large txn, but consisting of a smaller
>>>>number of wide rows. This test does not exhibit any significant
>>>>performance drop, while it was streamed too. So it seems to be
>>>>valid. Anyway, I do not have other reasonable ideas beside that
>>>>right now.
>>>
>>>It seems that overhead added by synchronous replica is lower by
>>>2-3 times compared with Postgres master and streaming with
>>>spilling. Therefore, the original patch eliminated delay before
>>>large transaction processing start by sender, while this
>>>additional patch speeds up the applier side.
>>>
>>>Although the overall speed up is surely measurable, there is a
>>>room for improvements yet:
>>>
>>>1) Currently bgworkers are only spawned on demand without some
>>>initial pool and never stopped. Maybe we should create a small
>>>pool on replication start and offload some of idle bgworkers if
>>>they exceed some limit?
>>>
>>>2) Probably we can track somehow that incoming change has
>>>conflicts with some of being processed xacts, so we can wait for
>>>specific bgworkers only in that case?
>>>
>>>3) Since the communication between main logical apply worker and
>>>each bgworker from the pool is a 'single producer --- single
>>>consumer' problem, then probably it is possible to wait and
>>>set/check flags without locks, but using just atomics.
>>>
>>>What do you think about this concept in general? Any concerns and
>>>criticism are welcome!
>>>
>>
>
>Hi Tomas,
>
>Thank you for a quick response.
>
>>I don't think it matters very much whether the workers are started at the
>>beginning or allocated ad hoc, that's IMO a minor implementation detail.
>
>OK, I had the same vision about this point. Any minor differences here
>will be neglectable for a sufficiently large transaction.
>
>>
>>There's one huge challenge that I however don't see mentioned in your
>>message or in the patch (after cursory reading) - ensuring the same
>>commit
>>order, and introducing deadlocks that would not exist in single-process
>>apply.
>
>Probably I haven't explained well this part, sorry for that. In my
>patch I don't use workers pool for a concurrent transaction apply, but
>rather for a fast context switch between long-lived streamed
>transactions. In other words we apply all changes arrived from the
>sender in a completely serial manner. Being written step-by-step it
>looks like:
>
>1) Read STREAM START message and figure out the target worker by xid.
>
>2) Put all changes, which belongs to this xact to the selected worker
>one by one via shm_mq_send.
>
>3) Read STREAM STOP message and wait until our worker will apply all
>changes in the queue.
>
>4) Process all other chunks of streamed xacts in the same manner.
>
>5) Process all non-streamed xacts immediately in the main apply worker loop.
>
>6) If we read STREAMED COMMIT/ABORT we again wait until selected
>worker either commits or aborts.
>
>Thus, it automatically guaranties the same commit order on replica as
>on master. Yes, we loose some performance here, since we don't apply
>transactions concurrently, but it would bring all those problems you
>have described.
>

OK, so it's apply in multiple processes, but at any moment only a single
apply process is active.

>However, you helped me to figure out another point I have forgotten.
>Although we ensure commit order automatically, the beginning of
>streamed xacts may reorder. It happens if some small xacts have been
>commited on master since the streamed one started, because we do not
>start streaming immediately, but only after logical_work_mem hit. I
>have performed some tests with conflicting xacts and it seems that
>it's not a problem, since locking mechanism in Postgres guarantees
>that if there would some deadlocks, they will happen earlier on
>master. So if some records hit the WAL, it is safe to apply the
>sequentially. Am I wrong?
>

I think you're right the way you interleave the changes ensures you
can't introduce new deadlocks between transactions in this stream. I don't
think reordering the blocks of streamed trasactions does matter, as long
as the comit order is ensured in this case.

>Anyway, I'm going to double check the safety of this part later.
>

OK.

FWIW my understanding is that the speedup comes mostly from elimination of
the serialization to a file. That however requires savepoints to handle
aborts of subtransactions - I'm pretty sure I'd be trivial to create a
workload where this will be much slower (with many aborts of large
subtransactions).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

konstantin knizhnik

>
> FWIW my understanding is that the speedup comes mostly from
> elimination of
> the serialization to a file. That however requires savepoints to handle
> aborts of subtransactions - I'm pretty sure I'd be trivial to create a
> workload where this will be much slower (with many aborts of large
> subtransactions).
>
>

I think that instead of defining savepoints it is simpler and more
efficient to use

BeginInternalSubTransaction +
ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction

as it is done in PL/pgSQL (pl_exec.c).
Not sure if it can pr

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Alvaro Herrera-9
In reply to this post by Tomas Vondra-4
In the interest of moving things forward, how far are we from making
0001 committable?  If I understand correctly, the rest of this patchset
depends on https://commitfest.postgresql.org/24/944/ which seems to be
moving at a glacial pace (or, actually, slower, because glaciers do
move, which cannot be said of that other patch.)

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4
On Mon, Sep 02, 2019 at 06:06:50PM -0400, Alvaro Herrera wrote:
>In the interest of moving things forward, how far are we from making
>0001 committable?  If I understand correctly, the rest of this patchset
>depends on https://commitfest.postgresql.org/24/944/ which seems to be
>moving at a glacial pace (or, actually, slower, because glaciers do
>move, which cannot be said of that other patch.)
>

I think 0001 is mostly there. I think there's one bug in this patch
version, but I need to check and I'll post an updated version shortly if
needed.

FWIW maybe we should stop comparing things to glaciers. 50 years from not
people won't know what a glacier is, and it'll be just like the floppy
icon on the save button.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Alexey Kondratov-2
In reply to this post by konstantin knizhnik
>>
>> FWIW my understanding is that the speedup comes mostly from
>> elimination of
>> the serialization to a file. That however requires savepoints to handle
>> aborts of subtransactions - I'm pretty sure I'd be trivial to create a
>> workload where this will be much slower (with many aborts of large
>> subtransactions).
>>

Yes, and it was my main motivation to eliminate that extra serialization
to file. I've experimented a bit with large transactions + savepoints +
aborts and ended up with a following query (the same schema as before
with 600k rows):

BEGIN;
SAVEPOINT s1;
UPDATE large_test SET num1 = num1 + 1, num2 = num2 + 1, num3 = num3 + 1;
SAVEPOINT s2;
UPDATE large_test SET num1 = num1 + 1, num2 = num2 + 1, num3 = num3 + 1;
SAVEPOINT s3;
UPDATE large_test SET num1 = num1 + 1, num2 = num2 + 1, num3 = num3 + 1;
ROLLBACK TO SAVEPOINT s3;
ROLLBACK TO SAVEPOINT s2;
ROLLBACK TO SAVEPOINT s1;
END;

It looks like the worst case scenario, as we do a lot of work and then
abort all subxacts one by one. As expected,it takes much longer (up to
x30) to process using background worker instead of spilling to file.
Surely, it is much easier to truncate a file, than apply all changes +
abort. However, I guess that this kind of load pattern is not the most
typical for real-life applications.

Also this test helped me to find a bug in my current savepoints routine,
so new patch is attached.

On 30.08.2019 18:59, Konstantin Knizhnik wrote:

>
> I think that instead of defining savepoints it is simpler and more
> efficient to use
>
> BeginInternalSubTransaction +
> ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction
>
> as it is done in PL/pgSQL (pl_exec.c).
> Not sure if it can pr
>
Both BeginInternalSubTransaction and DefineSavepoint use
PushTransaction() internally for a normal subtransaction start. So they
seems to be identical from the performance perspective, which is also
stated in the comment section:

/*
  * BeginInternalSubTransaction
  *        This is the same as DefineSavepoint except it allows
TBLOCK_STARTED,
  *        TBLOCK_IMPLICIT_INPROGRESS, TBLOCK_END, and TBLOCK_PREPARE
states,
  *        and therefore it can safely be used in functions that might
be called
  *        when not inside a BEGIN block or when running deferred
triggers at
  *        COMMIT/PREPARE time.  Also, it automatically does
  *        CommitTransactionCommand/StartTransactionCommand instead of
expecting
  *        the caller to do it.
  */

Please, correct me if I'm wrong.

Anyway, I've performed a profiling of my apply worker (flamegraph is
attached) and it spends the vast amount of time (>90%) applying changes.
So the problem is not in the savepoints their-self, but in the fact that
we first apply all changes and then abort all the work. Not sure, that
it is possible to do something in this case.


Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company


worker_aborts_perf.zip (95K) Download Attachment
v2-0011-BGWorkers-pool-for-streamed-transactions-apply-wi.patch (62K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

konstantin knizhnik


On 16.09.2019 19:54, Alexey Kondratov wrote:

> On 30.08.2019 18:59, Konstantin Knizhnik wrote:
>>
>> I think that instead of defining savepoints it is simpler and more
>> efficient to use
>>
>> BeginInternalSubTransaction +
>> ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction
>>
>> as it is done in PL/pgSQL (pl_exec.c).
>> Not sure if it can pr
>>
>
> Both BeginInternalSubTransaction and DefineSavepoint use
> PushTransaction() internally for a normal subtransaction start. So
> they seems to be identical from the performance perspective, which is
> also stated in the comment section:

Yes, definitely them are using the same mechanism and most likely
provides similar performance.
But BeginInternalSubTransaction does not require to generate some
savepoint name which seems to be redundant in this case.


>
> Anyway, I've performed a profiling of my apply worker (flamegraph is
> attached) and it spends the vast amount of time (>90%) applying
> changes. So the problem is not in the savepoints their-self, but in
> the fact that we first apply all changes and then abort all the work.
> Not sure, that it is possible to do something in this case.
>

Looks like the only way to increase apply speed is to do it in parallel:
make it possible to concurrently execute non-conflicting transactions.




Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4
On Mon, Sep 16, 2019 at 10:29:18PM +0300, Konstantin Knizhnik wrote:

>
>
>On 16.09.2019 19:54, Alexey Kondratov wrote:
>>On 30.08.2019 18:59, Konstantin Knizhnik wrote:
>>>
>>>I think that instead of defining savepoints it is simpler and more
>>>efficient to use
>>>
>>>BeginInternalSubTransaction +
>>>ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction
>>>
>>>as it is done in PL/pgSQL (pl_exec.c).
>>>Not sure if it can pr
>>>
>>
>>Both BeginInternalSubTransaction and DefineSavepoint use
>>PushTransaction() internally for a normal subtransaction start. So
>>they seems to be identical from the performance perspective, which
>>is also stated in the comment section:
>
>Yes, definitely them are using the same mechanism and most likely
>provides similar performance.
>But BeginInternalSubTransaction does not require to generate some
>savepoint name which seems to be redundant in this case.
>
>
>>
>>Anyway, I've performed a profiling of my apply worker (flamegraph is
>>attached) and it spends the vast amount of time (>90%) applying
>>changes. So the problem is not in the savepoints their-self, but in
>>the fact that we first apply all changes and then abort all the
>>work. Not sure, that it is possible to do something in this case.
>>
>
>Looks like the only way to increase apply speed is to do it in
>parallel: make it possible to concurrently execute non-conflicting
>transactions.
>

True, although it seems like a massive can of worms to me. I'm not aware
a way to identify non-conflicting transactions in advance, so it would
have to be implemented as optimistic apply, with a detection and
recovery from conflicts.

I'm not against doing that, and I'm willing to spend some time on revies
etc. but it seems like a completely separate effort.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

akapila
In reply to this post by Alvaro Herrera-9
On Tue, Sep 3, 2019 at 4:30 AM Alvaro Herrera <[hidden email]> wrote:
>
> In the interest of moving things forward, how far are we from making
> 0001 committable?  If I understand correctly, the rest of this patchset
> depends on https://commitfest.postgresql.org/24/944/ which seems to be
> moving at a glacial pace (or, actually, slower, because glaciers do
> move, which cannot be said of that other patch.)
>

I am not sure if it is completely correct that the other part of the
patch is dependent on that CF entry.  I have studied both the threads
(not every detail) and it seems to me it is dependent on one of the
patches from that series which handles concurrent aborts.  It is patch
0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Jan4.patch
from what the Nikhil has posted on that thread [1].  Am, I wrong?

So IIUC, the problem of concurrent aborts is that if we allow catalog
scans for in-progress transactions, then we might get wrong answers in
cases where somebody has performed Alter-Abort-Alter which is clearly
explained with an example in email [2].  To solve that problem Nikhil
seems to have written a patch [1] which detects these concurrent
aborts during a system table scan and then aborts the decoding of such
a transaction.

Now, the problem is that patch has written considering 2PC
transactions and might not deal with all cases for in-progress
transactions especially when sub-transactions are involved as alluded
by Arseny Sher [3].  So, the problem seems to be for cases when some
sub-transaction aborts, but the main transaction still continued and
we try to decode it.  Nikhil's patch won't be able to deal with it
because I think it just checks top-level xid whereas for this we need
to check all-subxids which I think is possible now as Tomas seems to
have written WAL for each xid-assignment.  It might or might not be
the best solution to check the status of all-subxids, but I think
first we need to agree that the problem is just for concurrent aborts
and that we can solve it by using some part of the technology being
developed as part of patch "Logical decoding of two-phase
transactions" (https://commitfest.postgresql.org/24/944/) rather than
the entire patchset.

I hope I am not saying something very obvious here and it helps in
moving this patch forward.

Thoughts?

[1] - https://www.postgresql.org/message-id/CAMGcDxcBmN6jNeQkgWddfhX8HbSjQpW%3DUo70iBY3P_EPdp%2BLTQ%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/EEBD82AA-61EE-46F4-845E-05B94168E8F2%40postgrespro.ru
[3] - https://www.postgresql.org/message-id/87a7py4iwl.fsf%40ars-thinkpad

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

akapila
In reply to this post by Tomas Vondra-4
On Tue, Sep 3, 2019 at 4:16 PM Tomas Vondra
<[hidden email]> wrote:

>
> On Mon, Sep 02, 2019 at 06:06:50PM -0400, Alvaro Herrera wrote:
> >In the interest of moving things forward, how far are we from making
> >0001 committable?  If I understand correctly, the rest of this patchset
> >depends on https://commitfest.postgresql.org/24/944/ which seems to be
> >moving at a glacial pace (or, actually, slower, because glaciers do
> >move, which cannot be said of that other patch.)
> >
>
> I think 0001 is mostly there. I think there's one bug in this patch
> version, but I need to check and I'll post an updated version shortly if
> needed.
>

Did you get a chance to work on 0001?  I have a few comments on that patch:
1.
+ *   To limit the amount of memory used by decoded changes, we track memory
+ *   used at the reorder buffer level (i.e. total amount of memory), and for
+ *   each toplevel transaction. When the total amount of used memory exceeds
+ *   the limit, the toplevel transaction consuming the most memory is either
+ *   serialized or streamed.

Do we need to mention 'streamed' as part of this patch?  It seems to
me that this is an independent patch which can be committed without
patches that stream the changes. So, we can remove it from here and
other places where it is used.

2.
+ *   deserializing and applying very few changes). We probably to give more
+ *   memory to the oldest subtransactions.

/We probably to/
It seems some word is missing after probably.

3.
+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
+ *
+ * XXX With many subtransactions this might be quite slow, because we'll have
+ * to walk through all of them. There are some options how we could improve
+ * that: (a) maintain some secondary structure with transactions sorted by
+ * amount of changes, (b) not looking for the entirely largest transaction,
+ * but e.g. for transaction using at least some fraction of the memory limit,
+ * and (c) evicting multiple transactions at once, e.g. to free a given portion
+ * of the memory limit (e.g. 50%).
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTXN(ReorderBuffer *rb)

What is the guarantee that after evicting largest transaction, we
won't immediately hit the memory limit?  Say, all of the transactions
are of almost similar size which I don't think is that uncommon a
case.  Instead, the strategy mentioned in point (c) or something like
that seems more promising.  In that strategy, there is some risk that
it might lead to many smaller disk writes which we might want to
control via some threshold (like we should not flush more than N
xacts).  In this, we also need to ensure that the total memory freed
must be greater than the current change.

I think we have some discussion around this point but didn't reach any
conclusion which means some more brainstorming is required.

4.
+int logical_work_mem; /* 4MB */

What this 4MB in comments indicate?

5.
+/*
+ * Check whether the logical_work_mem limit was reached, and if yes pick
+ * the transaction tx should spill its data to disk.

The second part of the sentence "pick the transaction tx should spill"
seems to be incomplete.

Apart from this, I see that Peter E. has raised some other points on
this patch which are not yet addressed as those also need some
discussion, so I will respond to those separately with my opinion.

These comments are based on the last patch posted by you on this
thread [1].  You might have fixed some of these already, so ignore if
that is the case.

[1] - https://www.postgresql.org/message-id/76fc440e-91c3-afe2-b78a-987205b3c758%402ndquadrant.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4
In reply to this post by akapila
Hi,

Attached is an updated patch series, rebased on current master. It does
fix one memory accounting bug in ReorderBufferToastReplace (the code was
not properly updating the amount of memory).

I've also included the patch series with decoding of 2PC transactions,
which this depends on. This way we have a chance of making the cfbot
happy. So parts 0001-0004 and 0009-0014 are "this" patch series, while
0005-0008 are the extra pieces from the other patch.

I've done it like this because the initial parts are independent, and so
might be committed irrespectedly of the other patch series. In practice
that's only reasonable for 0001, which adds the memory limit - the rest
is infrastucture for the streaming of in-progress transactions.

On Wed, Sep 25, 2019 at 06:55:01PM +0530, Amit Kapila wrote:

>On Tue, Sep 3, 2019 at 4:30 AM Alvaro Herrera <[hidden email]> wrote:
>>
>> In the interest of moving things forward, how far are we from making
>> 0001 committable?  If I understand correctly, the rest of this patchset
>> depends on https://commitfest.postgresql.org/24/944/ which seems to be
>> moving at a glacial pace (or, actually, slower, because glaciers do
>> move, which cannot be said of that other patch.)
>>
>
>I am not sure if it is completely correct that the other part of the
>patch is dependent on that CF entry.  I have studied both the threads
>(not every detail) and it seems to me it is dependent on one of the
>patches from that series which handles concurrent aborts.  It is patch
>0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Jan4.patch
>from what the Nikhil has posted on that thread [1].  Am, I wrong?
>
You're right - the part handling aborts is the only part required. There
are dependencies on some other changes from the 2PC patch, but those are
mostly refactorings that can be undone (e.g. switch from independent
flags to a single bitmap in reorderbuffer).

>So IIUC, the problem of concurrent aborts is that if we allow catalog
>scans for in-progress transactions, then we might get wrong answers in
>cases where somebody has performed Alter-Abort-Alter which is clearly
>explained with an example in email [2].  To solve that problem Nikhil
>seems to have written a patch [1] which detects these concurrent
>aborts during a system table scan and then aborts the decoding of such
>a transaction.
>
>Now, the problem is that patch has written considering 2PC
>transactions and might not deal with all cases for in-progress
>transactions especially when sub-transactions are involved as alluded
>by Arseny Sher [3].  So, the problem seems to be for cases when some
>sub-transaction aborts, but the main transaction still continued and
>we try to decode it.  Nikhil's patch won't be able to deal with it
>because I think it just checks top-level xid whereas for this we need
>to check all-subxids which I think is possible now as Tomas seems to
>have written WAL for each xid-assignment.  It might or might not be
>the best solution to check the status of all-subxids, but I think
>first we need to agree that the problem is just for concurrent aborts
>and that we can solve it by using some part of the technology being
>developed as part of patch "Logical decoding of two-phase
>transactions" (https://commitfest.postgresql.org/24/944/) rather than
>the entire patchset.
>
>I hope I am not saying something very obvious here and it helps in
>moving this patch forward.
>
No, that's a good question, and I'm not sure what the answer is at the
moment. My understanding was that the infrastructure in the 2PC patch is
enough even for subtransactions, but I might be wrong. I need to think
about that for a while.

Maybe we should focus on the 0001 part for now - it can be committed
indepently and does provide useful feature.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

0001-Add-logical_work_mem-to-limit-ReorderBuffer-20190926.patch.gz (14K) Download Attachment
0002-Immediately-WAL-log-assignments-20190926.patch.gz (9K) Download Attachment
0003-Issue-individual-invalidations-with-wal_lev-20190926.patch.gz (6K) Download Attachment
0004-Extend-the-output-plugin-API-with-stream-me-20190926.patch.gz (8K) Download Attachment
0005-Cleaning-up-of-flags-in-ReorderBufferTXN-st-20190926.patch.gz (3K) Download Attachment
0006-Support-decoding-of-two-phase-transactions--20190926.patch.gz (12K) Download Attachment
0007-Gracefully-handle-concurrent-aborts-of-unco-20190926.patch.gz (6K) Download Attachment
0008-Teach-test_decoding-plugin-to-work-with-2PC-20190926.patch.gz (8K) Download Attachment
0009-Implement-streaming-mode-in-ReorderBuffer-20190926.patch.gz (17K) Download Attachment
0010-Add-support-for-streaming-to-built-in-repli-20190926.patch.gz (27K) Download Attachment
0011-Track-statistics-for-streaming-spilling-20190926.patch.gz (5K) Download Attachment
0012-Enable-streaming-for-all-subscription-TAP-t-20190926.patch.gz (2K) Download Attachment
0013-BUGFIX-set-final_lsn-for-subxacts-before-cl-20190926.patch.gz (868 bytes) Download Attachment
0014-Add-TAP-test-for-streaming-vs.-DDL-20190926.patch.gz (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4
In reply to this post by akapila
On Thu, Sep 26, 2019 at 06:58:17PM +0530, Amit Kapila wrote:

>On Tue, Sep 3, 2019 at 4:16 PM Tomas Vondra
><[hidden email]> wrote:
>>
>> On Mon, Sep 02, 2019 at 06:06:50PM -0400, Alvaro Herrera wrote:
>> >In the interest of moving things forward, how far are we from making
>> >0001 committable?  If I understand correctly, the rest of this patchset
>> >depends on https://commitfest.postgresql.org/24/944/ which seems to be
>> >moving at a glacial pace (or, actually, slower, because glaciers do
>> >move, which cannot be said of that other patch.)
>> >
>>
>> I think 0001 is mostly there. I think there's one bug in this patch
>> version, but I need to check and I'll post an updated version shortly if
>> needed.
>>
>
>Did you get a chance to work on 0001?  I have a few comments on that patch:
>1.
>+ *   To limit the amount of memory used by decoded changes, we track memory
>+ *   used at the reorder buffer level (i.e. total amount of memory), and for
>+ *   each toplevel transaction. When the total amount of used memory exceeds
>+ *   the limit, the toplevel transaction consuming the most memory is either
>+ *   serialized or streamed.
>
>Do we need to mention 'streamed' as part of this patch?  It seems to
>me that this is an independent patch which can be committed without
>patches that stream the changes. So, we can remove it from here and
>other places where it is used.
>

You're right - this patch should not mention streaming because the parts
adding that capability are later in the series. So it can trigger just
the serialization to disk.

>2.
>+ *   deserializing and applying very few changes). We probably to give more
>+ *   memory to the oldest subtransactions.
>
>/We probably to/
>It seems some word is missing after probably.
>

Yes.

>3.
>+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
>+ *
>+ * XXX With many subtransactions this might be quite slow, because we'll have
>+ * to walk through all of them. There are some options how we could improve
>+ * that: (a) maintain some secondary structure with transactions sorted by
>+ * amount of changes, (b) not looking for the entirely largest transaction,
>+ * but e.g. for transaction using at least some fraction of the memory limit,
>+ * and (c) evicting multiple transactions at once, e.g. to free a given portion
>+ * of the memory limit (e.g. 50%).
>+ */
>+static ReorderBufferTXN *
>+ReorderBufferLargestTXN(ReorderBuffer *rb)
>
>What is the guarantee that after evicting largest transaction, we
>won't immediately hit the memory limit?  Say, all of the transactions
>are of almost similar size which I don't think is that uncommon a
>case.

Not sure I understand - what do you mean 'immediately hit'?

We do check the limit after queueing a change, and we know that this
change is what got us over the limit. We pick the largest transaction
(which has to be larger than the change we just entered) and evict it,
getting below the memory limit again.

The next change can get us over the memory limit again, of course, but
there's not much we could do about that.

>  Instead, the strategy mentioned in point (c) or something like
>that seems more promising.  In that strategy, there is some risk that
>it might lead to many smaller disk writes which we might want to
>control via some threshold (like we should not flush more than N
>xacts).  In this, we also need to ensure that the total memory freed
>must be greater than the current change.
>
>I think we have some discussion around this point but didn't reach any
>conclusion which means some more brainstorming is required.
>

I agree it's worth investigating, but I'm not sure it's necessary before
committing v1 of the feature. I don't think there's a clear winner
strategy, and the current approach works fairly well I think.

The comment is concerned with the cost of ReorderBufferLargestTXN with
many transactions, but we can only have certain number of top-level
transactions (max_connections + certain number of not-yet-assigned
subtransactions). And 0002 patch essentially gets rid of the subxacts
entirely, further reducing the maximum number of xacts to walk.

>4.
>+int logical_work_mem; /* 4MB */
>
>What this 4MB in comments indicate?
>

Sorry, that's a mistake.

>5.
>+/*
>+ * Check whether the logical_work_mem limit was reached, and if yes pick
>+ * the transaction tx should spill its data to disk.
>
>The second part of the sentence "pick the transaction tx should spill"
>seems to be incomplete.
>

Yeah, that's a poor wording. Will fix.

>Apart from this, I see that Peter E. has raised some other points on
>this patch which are not yet addressed as those also need some
>discussion, so I will respond to those separately with my opinion.
>

OK, thanks.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Alvaro Herrera-9
In reply to this post by Tomas Vondra-4
On 2019-Sep-26, Tomas Vondra wrote:

> Hi,
>
> Attached is an updated patch series, rebased on current master. It does
> fix one memory accounting bug in ReorderBufferToastReplace (the code was
> not properly updating the amount of memory).

Cool.

Can we aim to get 0001 pushed during this commitfest, or is that a lost
cause?

The large new comment in reorderbuffer.c says that a transaction might
get spilled *or streamed*, but surely that second thing is not correct,
since before the subsequent patches it's not possible to stream
transactions that have not yet finished?

How certain are you about the approach to measure memory used by a
reorderbuffer transaction ... does it not cause a measurable performance
drop?  I wonder if it would make more sense to use a separate contexts
per transaction and use context-level accounting (per the patch Jeff
Davis posted elsewhere for hash joins ... though I see now that that
only works fot aset.c, not other memcxt implementations), or something
like that.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Alvaro Herrera-9
On 2019-Sep-26, Alvaro Herrera wrote:

> How certain are you about the approach to measure memory used by a
> reorderbuffer transaction ... does it not cause a measurable performance
> drop?  I wonder if it would make more sense to use a separate contexts
> per transaction and use context-level accounting (per the patch Jeff
> Davis posted elsewhere for hash joins ... though I see now that that
> only works fot aset.c, not other memcxt implementations), or something
> like that.

Oh, I just noticed that that patch was posted separately in its own
thread, and that that improved version does include support for other
memory context implementations.  Excellent.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4
On Thu, Sep 26, 2019 at 04:36:20PM -0300, Alvaro Herrera wrote:

>On 2019-Sep-26, Alvaro Herrera wrote:
>
>> How certain are you about the approach to measure memory used by a
>> reorderbuffer transaction ... does it not cause a measurable performance
>> drop?  I wonder if it would make more sense to use a separate contexts
>> per transaction and use context-level accounting (per the patch Jeff
>> Davis posted elsewhere for hash joins ... though I see now that that
>> only works fot aset.c, not other memcxt implementations), or something
>> like that.
>
>Oh, I just noticed that that patch was posted separately in its own
>thread, and that that improved version does include support for other
>memory context implementations.  Excellent.
>

Unfortunately, that won't fly, for two simple reasons:

1) The memory accounting patch is known to perform poorly with many
child contexts - this was why array_agg/string_agg were problematic,
before we rewrote them not to create memory context for each group.

It could be done differently (eager accounting) but then the overhead
for regular/common cases (with just a couple of contexts) is higher. So
that seems like a much inferior option.

2) We can't actually have a single context per transaction. Some parts
(REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID) of a transaction are not
evicted, so we'd have to keep them in a separate context.

It'd also mean higher allocation overhead, because now we can reuse
chunks cross-transaction. So one transaction commits or gets serialized,
and we reuse the chunks for something else. With per-transaction
contexts we'd lose some of this benefit - we could only reuse chunks
within a transaction (i.e. large transactions that get spilled to disk)
but not across commits.

I don't have any numbers, of course, but I wouldn't be surprised if it
was significant e.g. for small transactions that don't get spilled. And
creating/destroying the contexts is not free either, I think.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra-4
In reply to this post by Alvaro Herrera-9
On Thu, Sep 26, 2019 at 04:33:59PM -0300, Alvaro Herrera wrote:

>On 2019-Sep-26, Tomas Vondra wrote:
>
>> Hi,
>>
>> Attached is an updated patch series, rebased on current master. It does
>> fix one memory accounting bug in ReorderBufferToastReplace (the code was
>> not properly updating the amount of memory).
>
>Cool.
>
>Can we aim to get 0001 pushed during this commitfest, or is that a lost
>cause?
>

It's tempting. The patch has been in the queue for quite a bit of time,
and I think it's solid (at least 0001). I'll address the comments from
Peter's review about separating the GUC etc. and polish it a bit more.
If I manage to do that by Monday, I'll consider pushing it.

If anyone feels I shouldn't do that, let me know.

The one open question pointed out by Amit is how the patch picks the
trasction for eviction. My feeling is that's fine and if needed can be
improved later if necessary, but I'll try to construct a worst case
(max_connections xacts, each with 64 subxact) to verify.

>The large new comment in reorderbuffer.c says that a transaction might
>get spilled *or streamed*, but surely that second thing is not correct,
>since before the subsequent patches it's not possible to stream
>transactions that have not yet finished?
>

True. That's a residue of reordering the patch series repeatedly, I
think. I'll fix that while polishing the patch.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

akapila
In reply to this post by Tomas Vondra-4
On Fri, Sep 27, 2019 at 12:06 AM Tomas Vondra
<[hidden email]> wrote:

>
> On Thu, Sep 26, 2019 at 06:58:17PM +0530, Amit Kapila wrote:
>
> >3.
> >+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
> >+ *
> >+ * XXX With many subtransactions this might be quite slow, because we'll have
> >+ * to walk through all of them. There are some options how we could improve
> >+ * that: (a) maintain some secondary structure with transactions sorted by
> >+ * amount of changes, (b) not looking for the entirely largest transaction,
> >+ * but e.g. for transaction using at least some fraction of the memory limit,
> >+ * and (c) evicting multiple transactions at once, e.g. to free a given portion
> >+ * of the memory limit (e.g. 50%).
> >+ */
> >+static ReorderBufferTXN *
> >+ReorderBufferLargestTXN(ReorderBuffer *rb)
> >
> >What is the guarantee that after evicting largest transaction, we
> >won't immediately hit the memory limit?  Say, all of the transactions
> >are of almost similar size which I don't think is that uncommon a
> >case.
>
> Not sure I understand - what do you mean 'immediately hit'?
>
> We do check the limit after queueing a change, and we know that this
> change is what got us over the limit. We pick the largest transaction
> (which has to be larger than the change we just entered) and evict it,
> getting below the memory limit again.
>
> The next change can get us over the memory limit again, of course,
>

Yeah, this is what I want to say when I wrote that it can immediately hit again.

> but
> there's not much we could do about that.
>
> >  Instead, the strategy mentioned in point (c) or something like
> >that seems more promising.  In that strategy, there is some risk that
> >it might lead to many smaller disk writes which we might want to
> >control via some threshold (like we should not flush more than N
> >xacts).  In this, we also need to ensure that the total memory freed
> >must be greater than the current change.
> >
> >I think we have some discussion around this point but didn't reach any
> >conclusion which means some more brainstorming is required.
> >
>
> I agree it's worth investigating, but I'm not sure it's necessary before
> committing v1 of the feature. I don't think there's a clear winner
> strategy, and the current approach works fairly well I think.
>
> The comment is concerned with the cost of ReorderBufferLargestTXN with
> many transactions, but we can only have certain number of top-level
> transactions (max_connections + certain number of not-yet-assigned
> subtransactions). And 0002 patch essentially gets rid of the subxacts
> entirely, further reducing the maximum number of xacts to walk.
>

That would be good, but I don't understand how.  The second patch will
update the subxacts in top-level ReorderBufferTxn, but it won't remove
it from hash table.  It also doesn't seem to be caring for considering
the size of subxacts in top-level xact, so not sure how will it reduce
the number of xacts to walk.  I might be missing something here.  Can
you explain a bit how 0002 patch would help in reducing the maximum
number of xacts to walk?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

akapila
In reply to this post by Peter Eisentraut-6
On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut
<[hidden email]> wrote:

>
> On 1/3/18 14:53, Tomas Vondra wrote:
> >> I don't see the need to tie this setting to maintenance_work_mem.
> >> maintenance_work_mem is often set to very large values, which could
> >> then have undesirable side effects on this use.
> >
> > Well, we need to pick some default value, and we can either use a fixed
> > value (not sure what would be a good default) or tie it to an existing
> > GUC. We only really have work_mem and maintenance_work_mem, and the
> > walsender process will never use more than one such buffer. Which seems
> > to be closer to maintenance_work_mem.
> >
> > Pretty much any default value can have undesirable side effects.
>
> Let's just make it an independent setting unless we know any better.  We
> don't have a lot of settings that depend on other settings, and the ones
> we do have a very specific relationship.
>
> >> Moreover, the name logical_work_mem makes it sound like it's a logical
> >> version of work_mem.  Maybe we could think of another name.
> >
> > I won't object to a better name, of course. Any proposals?
>
> logical_decoding_[work_]mem?
>

Having a separate variable for this can give more flexibility, but
OTOH it will add one more knob which user might not have a good idea
to set.  What are the problems we see if directly use work_mem for
this case?

If we can't use work_mem, then I think the name proposed by you
(logical_decoding_work_mem) sounds good to me.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


1234567 ... 27