logical decoding of two-phase transactions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
293 messages Options
1234 ... 15
Reply | Threaded
Open this post in threaded view
|

logical decoding of two-phase transactions

Stas Kelvich-3
Here is resubmission of patch to implement logical decoding of two-phase transactions (instead of treating them
as usual transaction when commit) [1] I’ve slightly polished things and used test_decoding output plugin as client.

General idea quite simple here:

* Write gid along with commit/prepare records in case of 2pc
* Add several routines to decode prepare records in the same way as it already happens in logical decoding.

I’ve also added explicit LOCK statement in test_decoding regression suit to check that it doesn’t break thing. If
somebody can create scenario that will block decoding because of existing dummy backend lock that will be great
help. Right now all my tests passing (including TAP tests to check recovery of twophase tx in case of failures from
adjacent mail thread).

If we will agree about current approach than I’m ready to add this stuff to proposed in-core logical replication.

[1] https://www.postgresql.org/message-id/EE7452CA-3C39-4A0E-97EC-17A414972884%40postgrespro.ru




--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

logical_twophase.diff (24K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

Simon Riggs
On 31 December 2016 at 08:36, Stas Kelvich <[hidden email]> wrote:
> Here is resubmission of patch to implement logical decoding of two-phase transactions (instead of treating them
> as usual transaction when commit) [1] I’ve slightly polished things and used test_decoding output plugin as client.

Sounds good.

> General idea quite simple here:
>
> * Write gid along with commit/prepare records in case of 2pc

GID is now variable sized. You seem to have added this to every
commit, not just 2PC

> * Add several routines to decode prepare records in the same way as it already happens in logical decoding.
>
> I’ve also added explicit LOCK statement in test_decoding regression suit to check that it doesn’t break thing.

Please explain that in comments in the patch.

>  If
> somebody can create scenario that will block decoding because of existing dummy backend lock that will be great
> help. Right now all my tests passing (including TAP tests to check recovery of twophase tx in case of failures from
> adjacent mail thread).
>
> If we will agree about current approach than I’m ready to add this stuff to proposed in-core logical replication.
>
> [1] https://www.postgresql.org/message-id/EE7452CA-3C39-4A0E-97EC-17A414972884%40postgrespro.ru

We'll need some measurements about additional WAL space or mem usage
from these approaches. Thanks.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

Simon Riggs
On 4 January 2017 at 21:20, Simon Riggs <[hidden email]> wrote:

> On 31 December 2016 at 08:36, Stas Kelvich <[hidden email]> wrote:
>> Here is resubmission of patch to implement logical decoding of two-phase transactions (instead of treating them
>> as usual transaction when commit) [1] I’ve slightly polished things and used test_decoding output plugin as client.
>
> Sounds good.
>
>> General idea quite simple here:
>>
>> * Write gid along with commit/prepare records in case of 2pc
>
> GID is now variable sized. You seem to have added this to every
> commit, not just 2PC

I've just realised that you're adding GID because it allows you to
uniquely identify the prepared xact. But then the prepared xact will
also have a regular TransactionId, which is also unique. GID exists
for users to specify things, but it is not needed internally and we
don't need to add it here. What we do need is for the commit prepared
message to remember what the xid of the prepare was and then re-find
it using the commit WAL record's twophase_xid field. So we don't need
to add GID to any WAL records, nor to any in-memory structures.

Please re-work the patch to include twophase_xid, which should make
the patch smaller and much faster too.

Please add comments to explain how and why patches work. Design
comments allow us to check the design makes sense and if it does
whether all the lines in the patch are needed to follow the design.
Without that patches are much harder to commit and we all want patches
to be easier to commit.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

Stas Kelvich-3
Thank you for looking into this.

> On 5 Jan 2017, at 09:43, Simon Riggs <[hidden email]> wrote:
>>
>> GID is now variable sized. You seem to have added this to every
>> commit, not just 2PC
>

Hm, didn’t realise that, i’ll fix.

> I've just realised that you're adding GID because it allows you to
> uniquely identify the prepared xact. But then the prepared xact will
> also have a regular TransactionId, which is also unique. GID exists
> for users to specify things, but it is not needed internally and we
> don't need to add it here.

I think we anyway can’t avoid pushing down GID to the client side.

If we will push down only local TransactionId to remote server then we will lose mapping
of GID to TransactionId, and there will be no way for user to identify his transaction on
second server. Also Open XA and lots of libraries (e.g. J2EE) assumes that there is
the same GID everywhere and it’s the same GID that was issued by the client.

Requirements for two-phase decoding can be different depending on what one want
to build around it and I believe in some situations pushing down xid is enough. But IMO
dealing with reconnects, failures and client libraries will force programmer to use
the same GID everywhere.

> What we do need is for the commit prepared
> message to remember what the xid of the prepare was and then re-find
> it using the commit WAL record's twophase_xid field. So we don't need
> to add GID to any WAL records, nor to any in-memory structures.

Other part of the story is how to find GID during decoding of commit prepared record.
I did that by adding GID field to the commit WAL record, because by the time of decoding
all memory structures that were holding xid<->gid correspondence are already cleaned up.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

Simon Riggs
On 5 January 2017 at 10:21, Stas Kelvich <[hidden email]> wrote:

> Thank you for looking into this.
>
>> On 5 Jan 2017, at 09:43, Simon Riggs <[hidden email]> wrote:
>>>
>>> GID is now variable sized. You seem to have added this to every
>>> commit, not just 2PC
>>
>
> Hm, didn’t realise that, i’ll fix.
>
>> I've just realised that you're adding GID because it allows you to
>> uniquely identify the prepared xact. But then the prepared xact will
>> also have a regular TransactionId, which is also unique. GID exists
>> for users to specify things, but it is not needed internally and we
>> don't need to add it here.
>
> I think we anyway can’t avoid pushing down GID to the client side.
>
> If we will push down only local TransactionId to remote server then we will lose mapping
> of GID to TransactionId, and there will be no way for user to identify his transaction on
> second server. Also Open XA and lots of libraries (e.g. J2EE) assumes that there is
> the same GID everywhere and it’s the same GID that was issued by the client.
>
> Requirements for two-phase decoding can be different depending on what one want
> to build around it and I believe in some situations pushing down xid is enough. But IMO
> dealing with reconnects, failures and client libraries will force programmer to use
> the same GID everywhere.

Surely in this case the master server is acting as the Transaction
Manager, and it knows the mapping, so we are good?

I guess if you are using >2 nodes then you need to use full 2PC on each node.

But even then, if you adopt the naming convention that all in-progress
xacts will be called RepOriginId-EPOCH-XID, so they have a fully
unique GID on all of the child nodes then we don't need to add the
GID.

Please explain precisely how you expect to use this, to check that GID
is required.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

Stas Kelvich-3

> On 5 Jan 2017, at 13:49, Simon Riggs <[hidden email]> wrote:
>
> Surely in this case the master server is acting as the Transaction
> Manager, and it knows the mapping, so we are good?
>
> I guess if you are using >2 nodes then you need to use full 2PC on each node.
>
> Please explain precisely how you expect to use this, to check that GID
> is required.
>

For example if we are using logical replication just for failover/HA and allowing user
to be transaction manager itself. Then suppose that user prepared tx on server A and server A
crashed. After that client may want to reconnect to server B and commit/abort that tx.
But user only have GID that was used during prepare.

> But even then, if you adopt the naming convention that all in-progress
> xacts will be called RepOriginId-EPOCH-XID, so they have a fully
> unique GID on all of the child nodes then we don't need to add the
> GID.

Yes, that’s also possible but seems to be less flexible restricting us to some
specific GID format.

Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records
to know exactly what will be the cost of such approach.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

Craig Ringer-3
On 5 January 2017 at 20:43, Stas Kelvich <[hidden email]> wrote:

> Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records
> to know exactly what will be the cost of such approach.

Sounds like a good idea, especially if you remove any attempt to work
with GIDs for !2PC commits at the same time.

I don't think I care about having access to the GID for the use case I
have in mind, since we'd actually be wanting to hijack a normal COMMIT
and internally transform it to PREPARE TRANSACTION, <do stuff>, COMMIT
PREPARED. But for the more general case of logical decoding of 2PC I
can see the utility of having the xact identifier.

If we presume we're only interested in logically decoding 2PC xacts
that are not yet COMMIT PREPAREd, can we not avoid the WAL overhead of
writing the GID by looking it up in our shmem state at decoding-time
for PREPARE TRANSACTION? If we can't find the prepared transaction in
TwoPhaseState we know to expect a following ROLLBACK PREPARED or
COMMIT PREPARED, so we shouldn't decode it at the PREPARE TRANSACTION
stage.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

Simon Riggs
In reply to this post by Stas Kelvich-3
On 5 January 2017 at 12:43, Stas Kelvich <[hidden email]> wrote:

>
>> On 5 Jan 2017, at 13:49, Simon Riggs <[hidden email]> wrote:
>>
>> Surely in this case the master server is acting as the Transaction
>> Manager, and it knows the mapping, so we are good?
>>
>> I guess if you are using >2 nodes then you need to use full 2PC on each node.
>>
>> Please explain precisely how you expect to use this, to check that GID
>> is required.
>>
>
> For example if we are using logical replication just for failover/HA and allowing user
> to be transaction manager itself. Then suppose that user prepared tx on server A and server A
> crashed. After that client may want to reconnect to server B and commit/abort that tx.
> But user only have GID that was used during prepare.

I don't think that's the case your trying to support and I don't think
that's a common case that we want to pay the price to put into core in
a non-optional way.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

Craig Ringer-3
In reply to this post by Stas Kelvich-3
On 5 January 2017 at 20:43, Stas Kelvich <[hidden email]> wrote:

>
>> On 5 Jan 2017, at 13:49, Simon Riggs <[hidden email]> wrote:
>>
>> Surely in this case the master server is acting as the Transaction
>> Manager, and it knows the mapping, so we are good?
>>
>> I guess if you are using >2 nodes then you need to use full 2PC on each node.
>>
>> Please explain precisely how you expect to use this, to check that GID
>> is required.
>>
>
> For example if we are using logical replication just for failover/HA and allowing user
> to be transaction manager itself. Then suppose that user prepared tx on server A and server A
> crashed. After that client may want to reconnect to server B and commit/abort that tx.
> But user only have GID that was used during prepare.
>
>> But even then, if you adopt the naming convention that all in-progress
>> xacts will be called RepOriginId-EPOCH-XID, so they have a fully
>> unique GID on all of the child nodes then we don't need to add the
>> GID.
>
> Yes, that’s also possible but seems to be less flexible restricting us to some
> specific GID format.
>
> Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records
> to know exactly what will be the cost of such approach.

Stas,

Have you had a chance to look at this further?

I think the approach of storing just the xid and fetching the GID
during logical decoding of the PREPARE TRANSACTION is probably the
best way forward, per my prior mail. That should eliminate Simon's
objection re the cost of tracking GIDs and still let us have access to
them when we want them, which is the best of both worlds really.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

Stas Kelvich-3
>>
>> Yes, that’s also possible but seems to be less flexible restricting us to some
>> specific GID format.
>>
>> Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records
>> to know exactly what will be the cost of such approach.
>
> Stas,
>
> Have you had a chance to look at this further?

Generally i’m okay with Simon’s approach and will send send updated patch. Anyway want to
perform some test to estimate how much disk space is actually wasted by extra WAL records.

> I think the approach of storing just the xid and fetching the GID
> during logical decoding of the PREPARE TRANSACTION is probably the
> best way forward, per my prior mail.

I don’t think that’s possible in this way. If we will not put GID in commit record, than by the time
when logical decoding will happened transaction will be already committed/aborted and there will
be no easy way to get that GID. I thought about several possibilities:

* Tracking xid/gid map in memory also doesn’t help much — if server reboots between prepare
and commit we’ll lose that mapping.
* We can provide some hooks on prepared tx recovery during startup, but that approach also fails
if reboot happened between commit and decoding of that commit.
* Logical messages are WAL-logged, but they don’t have any redo function so don’t helps much.

So to support user-accessible 2PC over replication based on 2PC decoding we should invent
something more nasty like writing them into a table.

> That should eliminate Simon's
> objection re the cost of tracking GIDs and still let us have access to
> them when we want them, which is the best of both worlds really.

Having 2PC decoding in core is a good thing anyway even without GID tracking =)

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

Craig Ringer-3


On 26 Jan. 2017 18:43, "Stas Kelvich" <[hidden email]> wrote:
>>
>> Yes, that’s also possible but seems to be less flexible restricting us to some
>> specific GID format.
>>
>> Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records
>> to know exactly what will be the cost of such approach.
>
> I think the approach of storing just the xid and fetching the GID
> during logical decoding of the PREPARE TRANSACTION is probably the
> best way forward, per my prior mail.

I don’t think that’s possible in this way. If we will not put GID in commit record, than by the time when logical decoding will happened transaction will be already committed/aborted and there will
be no easy way to get that GID.

My thinking is that if the 2PC xact is by that point COMMIT PREPARED or ROLLBACK PREPARED we don't care that it was ever 2pc and should just decode it as a normal xact. Its gid has ceased to be significant and no longer holds meaning since the xact is resolved.

The point of logical decoding of 2pc is to allow peers to participate in a decision on whether to commit or not. Rather than only being able to decode the xact once committed as is currently the case.

If it's already committed there's no point treating it as anything special.

So when we get to the prepare transaction in xlog we look to see if it's already committed / rolled back. If so we proceed normally like current decoding does. Only if it's still prepared do we decode it as 2pc and supply the gid to a new output plugin callback for prepared xacts.

I thought about several possibilities:

* Tracking xid/gid map in memory also doesn’t help much — if server reboots between prepare
and commit we’ll lose that mapping.

Er what? That's why I suggested using the prepared xacts shmem state. It's persistent as you know from your work on prepared transaction files. It has all the required info.
Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

Stas Kelvich-3

> On 26 Jan 2017, at 12:51, Craig Ringer <[hidden email]> wrote:
>
> * Tracking xid/gid map in memory also doesn’t help much — if server reboots between prepare
> and commit we’ll lose that mapping.
>
> Er what? That's why I suggested using the prepared xacts shmem state. It's persistent as you know from your work on prepared transaction files. It has all the required info.

Imagine following scenario:

1. PREPARE happend
2. PREPARE decoded and sent where it should be sent
3. We got all responses from participating nodes and issuing COMMIT/ABORT
4. COMMIT/ABORT decoded and sent

After step 3 there is no more memory state associated with that prepared tx, so if will fail
between 3 and 4 then we can’t know GID unless we wrote it commit record (or table).

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

Craig Ringer-3
On 26 January 2017 at 19:34, Stas Kelvich <[hidden email]> wrote:

> Imagine following scenario:
>
> 1. PREPARE happend
> 2. PREPARE decoded and sent where it should be sent
> 3. We got all responses from participating nodes and issuing COMMIT/ABORT
> 4. COMMIT/ABORT decoded and sent
>
> After step 3 there is no more memory state associated with that prepared tx, so if will fail
> between 3 and 4 then we can’t know GID unless we wrote it commit record (or table).


If the decoding session crashes/disconnects and restarts between 3 and
4, we know the xact is now committed or rolled backand we don't care
about its gid anymore, we can decode it as a normal committed xact or
skip over it if aborted. If Pg crashes between 3 and 4 the same
applies, since all decoding sessions must restart.

No decoding session can ever start up between 3 and 4 without passing
through 1 and 2, since we always restart decoding at restart_lsn and
restart_lsn cannot be advanced past the assignment (BEGIN) of a given
xid until we pass its commit record and the downstream confirms it has
flushed the results.

The reorder buffer doesn't even really need to keep track of the gid
between 3 and 4, though it should do to save the output plugin and
downstream the hassle of keeping an xid to gid mapping. All it needs
is to know if we sent a given xact's data to the output plugin at
PREPARE time, so we can suppress sending them again at COMMIT time,
and we can store that info on the ReorderBufferTxn. We can store the
gid there too.

We'll need two new output plugin callbacks

   prepare_cb
   rollback_cb

since an xact can roll back after we decode PREPARE TRANSACTION (or
during it, even) and we have to be able to tell the downstream to
throw the data away.

I don't think the rollback callback should be called
abort_prepared_cb, because we'll later want to add the ability to
decode interleaved xacts' changes as they are made, before commit, and
in that case will also need to know if they abort. We won't care if
they were prepared xacts or not, but we'll know based on the
ReorderBufferTXN anyway.

We don't need a separate commit_prepared_cb, the existing commit_cb is
sufficient. The gid will be accessible on the ReorderBufferTXN.

Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
wal_level >= logical I don't think that's the end of the world. But
since we already have almost everything we need in memory, why not
just stash the gid on ReorderBufferTXN?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

Michael Paquier
On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <[hidden email]> wrote:
> Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
> wal_level >= logical I don't think that's the end of the world. But
> since we already have almost everything we need in memory, why not
> just stash the gid on ReorderBufferTXN?

I have been through this thread... And to be honest, I have a hard
time understanding for which purpose the information of a 2PC
transaction is useful in the case of logical decoding. The prepare and
commit prepared have been received by a node which is at the root of
the cluster tree, a node of the cluster at an upper level, or a
client, being in charge of issuing all the prepare queries, and then
issue the commit prepared to finish the transaction across a cluster.
In short, even if you do logical decoding from the root node, or the
one at a higher level, you would care just about the fact that it has
been committed.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

Michael Paquier
On Tue, Jan 31, 2017 at 3:29 PM, Michael Paquier
<[hidden email]> wrote:

> On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <[hidden email]> wrote:
>> Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
>> wal_level >= logical I don't think that's the end of the world. But
>> since we already have almost everything we need in memory, why not
>> just stash the gid on ReorderBufferTXN?
>
> I have been through this thread... And to be honest, I have a hard
> time understanding for which purpose the information of a 2PC
> transaction is useful in the case of logical decoding. The prepare and
> commit prepared have been received by a node which is at the root of
> the cluster tree, a node of the cluster at an upper level, or a
> client, being in charge of issuing all the prepare queries, and then
> issue the commit prepared to finish the transaction across a cluster.
> In short, even if you do logical decoding from the root node, or the
> one at a higher level, you would care just about the fact that it has
> been committed.

By the way, I have moved this patch to next CF, you guys seem to make
the discussion move on.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

Craig Ringer-3
In reply to this post by Michael Paquier


On 31 Jan. 2017 19:29, "Michael Paquier" <[hidden email]> wrote:
On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <[hidden email]> wrote:
> Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
> wal_level >= logical I don't think that's the end of the world. But
> since we already have almost everything we need in memory, why not
> just stash the gid on ReorderBufferTXN?

I have been through this thread... And to be honest, I have a hard
time understanding for which purpose the information of a 2PC
transaction is useful in the case of logical decoding.

TL;DR: this lets us decode the xact after prepare but before commit so decoding/replay outcomes can affect the commit-or-abort decision.


The prepare and
commit prepared have been received by a node which is at the root of
the cluster tree, a node of the cluster at an upper level, or a
client, being in charge of issuing all the prepare queries, and then
issue the commit prepared to finish the transaction across a cluster.
In short, even if you do logical decoding from the root node, or the
one at a higher level, you would care just about the fact that it has
been committed.

That's where you've misunderstood - it isn't committed yet. The point or this change is to allow us to do logical decoding at the PREPARE TRANSACTION point. The xact is not yet committed or rolled back.

This allows the results of logical decoding - or more interestingly results of replay on another node / to another app / whatever to influence the commit or rollback decision.

Stas wants this for a conflict-free logical semi-synchronous replication multi master solution. At PREPARE TRANSACTION time we replay the xact to other nodes, each of which applies it and PREPARE TRANSACTION, then replies to confirm it has successfully prepared the xact. When all nodes confirm the xact is prepared it is safe for the origin node to COMMIT PREPARED. The other nodes then see hat the first node has committed and they commit too.

Alternately if any node replies "could not replay xact" or "could not prepare xact" the origin node knows to ROLLBACK PREPARED. All the other nodes see that and rollback too.

This makes it possible to be much more confident that what's replicated is exactly the same on all nodes, with no after-the-fact MM conflict resolution that apps must be aware of to function correctly.

To really make it rock solid you also have to send the old and new values of a row, or have row versions, or send old row hashes. Something I also want to have, but we can mostly get that already with REPLICA IDENTITY FULL.

It is of interest to me because schema changes in MM logical replication are more challenging awkward and restrictive without it. Optimistic conflict resolution doesn't work well for schema changes and once the conflciting schema changes are committed on different nodes there is no going back. So you need your async system to have a global locking model for schema changes to stop conflicts arising. Or expect the user not to do anything silly / misunderstand anything and know all the relevant system limitations and requirements... which we all know works just great in practice. You also need a way to ensure that schema changes don't render committed-but-not-yet-replayed row changes from other peers nonsensical. The safest way is a barrier where all row changes committed on any node before committing the schema change on the origin node must be fully replayed on every other node, making an async MM system temporarily sync single master (and requiring all nodes to be up and reachable). Otherwise you need a way to figure out how to conflict-resolve incoming rows with missing columns / added columns / changed types / renamed tables  etc which is no fun and nearly impossible in the general case.

2PC decoding lets us avoid all this mess by sending all nodes the proposed schema change and waiting until they all confirm successful prepare before committing it. It can also be used to solve the row compatibility problems with some more lazy inter-node chat in logical WAL messages.

I think the purpose of having the GID available to the decoding output plugin at PREPARE TRANSACTION time is that it can co-operate with a global transaction manager that way. Each node can tell the GTM "I'm ready to commit [X]". It is IMO not crucial since you can otherwise use a (node-id, xid) tuple, but it'd be nice for coordinating with external systems, simplifying inter node chatter, integrating logical deocding into bigger systems with external transaction coordinators/arbitrators etc. It seems pretty silly _not_ to have it really.

Personally I don't think lack of access to the GID justifies blocking 2PC logical decoding. It can be added separately. But it'd be nice to have especially if it's cheap.
Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

konstantin knizhnik
In reply to this post by Michael Paquier


On 31.01.2017 09:29, Michael Paquier wrote:

> On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <[hidden email]> wrote:
>> Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
>> wal_level >= logical I don't think that's the end of the world. But
>> since we already have almost everything we need in memory, why not
>> just stash the gid on ReorderBufferTXN?
> I have been through this thread... And to be honest, I have a hard
> time understanding for which purpose the information of a 2PC
> transaction is useful in the case of logical decoding. The prepare and
> commit prepared have been received by a node which is at the root of
> the cluster tree, a node of the cluster at an upper level, or a
> client, being in charge of issuing all the prepare queries, and then
> issue the commit prepared to finish the transaction across a cluster.
> In short, even if you do logical decoding from the root node, or the
> one at a higher level, you would care just about the fact that it has
> been committed.
Sorry, may be I do not completely understand your arguments.
Actually our multimaster is completely based now on logical replication
and 2PC (more precisely we are using 3PC now:)
State of transaction (prepared, precommitted, committed) should be
persisted in WAL  to make it possible to perform recovery.
Recovery can involve transactions in any state. So there three records
in the WAL: PREPARE, PRECOMMIT, COMMIT_PREPARED and
recovery can involve either all of them, either
PRECOMMIT+COMMIT_PREPARED either just COMMIT_PREPARED.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

Craig Ringer-4


On 31 Jan. 2017 22:43, "Konstantin Knizhnik" <[hidden email]> wrote:


On 31.01.2017 09:29, Michael Paquier wrote:
On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <[hidden email]> wrote:
Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
wal_level >= logical I don't think that's the end of the world. But
since we already have almost everything we need in memory, why not
just stash the gid on ReorderBufferTXN?
I have been through this thread... And to be honest, I have a hard
time understanding for which purpose the information of a 2PC
transaction is useful in the case of logical decoding. The prepare and
commit prepared have been received by a node which is at the root of
the cluster tree, a node of the cluster at an upper level, or a
client, being in charge of issuing all the prepare queries, and then
issue the commit prepared to finish the transaction across a cluster.
In short, even if you do logical decoding from the root node, or the
one at a higher level, you would care just about the fact that it has
been committed.

 in any state. So there three records in the WAL: PREPARE, PRECOMMIT, COMMIT_PREPARED and
recovery can involve either all of them, either PRECOMMIT+COMMIT_PREPARED either just COMMIT_PREPARED.

That's your modified Pg though.

This 2pc logical decoding patch proposal is for core and I think it just confused things to introduce discussion of unrelated changes made by your product to the codebase.





--
Konstantin Knizhnik

Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

Michael Paquier
In reply to this post by Craig Ringer-3
On Tue, Jan 31, 2017 at 6:22 PM, Craig Ringer <[hidden email]> wrote:
> That's where you've misunderstood - it isn't committed yet. The point or
> this change is to allow us to do logical decoding at the PREPARE TRANSACTION
> point. The xact is not yet committed or rolled back.

Yes, I got that. I was looking for a why or an actual use-case.

> Stas wants this for a conflict-free logical semi-synchronous replication
> multi master solution.

This sentence is hard to decrypt, less without "multi master" as the
concept applies basically only to only one master node.

> At PREPARE TRANSACTION time we replay the xact to
> other nodes, each of which applies it and PREPARE TRANSACTION, then replies
> to confirm it has successfully prepared the xact. When all nodes confirm the
> xact is prepared it is safe for the origin node to COMMIT PREPARED. The
> other nodes then see hat the first node has committed and they commit too.

OK, this is the argument I was looking for. So in your schema the
origin node, the one generating the changes, is itself in charge of
deciding if the 2PC should work or not. There are two channels between
the origin node and the replicas replaying the logical changes, one is
for the logical decoder with a receiver, the second one is used to
communicate the WAL apply status. I thought about something like
postgres_fdw doing this job with a transaction that does writes across
several nodes, that's why I got confused about this feature.
Everything goes through one channel, so the failure handling is just
simplified.

> Alternately if any node replies "could not replay xact" or "could not
> prepare xact" the origin node knows to ROLLBACK PREPARED. All the other
> nodes see that and rollback too.

The origin node could just issue the ROLLBACK or COMMIT and the
logical replicas would just apply this change.

> To really make it rock solid you also have to send the old and new values of
> a row, or have row versions, or send old row hashes. Something I also want
> to have, but we can mostly get that already with REPLICA IDENTITY FULL.

On a primary key (or a unique index), the default replica identity is
enough I think.

> It is of interest to me because schema changes in MM logical replication are
> more challenging awkward and restrictive without it. Optimistic conflict
> resolution doesn't work well for schema changes and once the conflicting
> schema changes are committed on different nodes there is no going back. So
> you need your async system to have a global locking model for schema changes
> to stop conflicts arising. Or expect the user not to do anything silly /
> misunderstand anything and know all the relevant system limitations and
> requirements... which we all know works just great in practice. You also
> need a way to ensure that schema changes don't render
> committed-but-not-yet-replayed row changes from other peers nonsensical. The
> safest way is a barrier where all row changes committed on any node before
> committing the schema change on the origin node must be fully replayed on
> every other node, making an async MM system temporarily sync single master
> (and requiring all nodes to be up and reachable). Otherwise you need a way
> to figure out how to conflict-resolve incoming rows with missing columns /
> added columns / changed types / renamed tables  etc which is no fun and
> nearly impossible in the general case.

That's one vision of things, FDW-like approaches would be a second,
but those are not able to pass down utility statements natively,
though this stuff can be done with the utility hook.

> I think the purpose of having the GID available to the decoding output
> plugin at PREPARE TRANSACTION time is that it can co-operate with a global
> transaction manager that way. Each node can tell the GTM "I'm ready to
> commit [X]". It is IMO not crucial since you can otherwise use a (node-id,
> xid) tuple, but it'd be nice for coordinating with external systems,
> simplifying inter node chatter, integrating logical deocding into bigger
> systems with external transaction coordinators/arbitrators etc. It seems
> pretty silly _not_ to have it really.

Well, Postgres-XC/XL save the 2PC GID for this purpose in the GTM,
this way the COMMIT/ABORT PREPARED can be issued from any nodes, and
there is a centralized conflict resolution, the latter being done with
a huge cost, causing much bottleneck in scaling performance.

> Personally I don't think lack of access to the GID justifies blocking 2PC
> logical decoding. It can be added separately. But it'd be nice to have
> especially if it's cheap.

I think it should be added reading this thread.
--
Michael


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: logical decoding of two-phase transactions

Robert Haas
On Tue, Jan 31, 2017 at 9:05 PM, Michael Paquier
<[hidden email]> wrote:
>> Personally I don't think lack of access to the GID justifies blocking 2PC
>> logical decoding. It can be added separately. But it'd be nice to have
>> especially if it's cheap.
>
> I think it should be added reading this thread.

+1.  If on the logical replication master the user executes PREPARE
TRANSACTION 'mumble', isn't it sensible to want the logical replica to
prepare the same set of changes with the same GID?  To me, that not
only seems like *a* sensible thing to want to do but probably the
*most* sensible thing to want to do.  And then, when the eventual
COMMIT PREPAPARED 'mumble' comes along, you want to have the replica
run the same command.  If you don't do that, then the alternative is
that the replica has to make up new names based on the master's XID.
But that kinda sucks, because now if replication stops due to a
conflict or whatever and you have to disentangle things by hand, all
the names on the replica are basically meaningless.

Also, including the GID in the WAL for each COMMIT/ABORT PREPARED
doesn't seem inordinately expensive to me.  For that to really add up
to a significant cost, wouldn't you need to be doing LOTS of 2PC
transactions, each with very little work, so that the commit/abort
prepared records weren't swamped by everything else?  That seems like
an unlikely scenario, but if it does happen, that's exactly when
you'll be most grateful for the GID tracking.  I think.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
1234 ... 15