Here is resubmission of patch to implement logical decoding of two-phase transactions (instead of treating them
as usual transaction when commit) [1] I’ve slightly polished things and used test_decoding output plugin as client. General idea quite simple here: * Write gid along with commit/prepare records in case of 2pc * Add several routines to decode prepare records in the same way as it already happens in logical decoding. I’ve also added explicit LOCK statement in test_decoding regression suit to check that it doesn’t break thing. If somebody can create scenario that will block decoding because of existing dummy backend lock that will be great help. Right now all my tests passing (including TAP tests to check recovery of twophase tx in case of failures from adjacent mail thread). If we will agree about current approach than I’m ready to add this stuff to proposed in-core logical replication. [1] https://www.postgresql.org/message-id/EE7452CA-3C39-4A0E-97EC-17A414972884%40postgrespro.ru -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
On 31 December 2016 at 08:36, Stas Kelvich <[hidden email]> wrote:
> Here is resubmission of patch to implement logical decoding of two-phase transactions (instead of treating them > as usual transaction when commit) [1] I’ve slightly polished things and used test_decoding output plugin as client. Sounds good. > General idea quite simple here: > > * Write gid along with commit/prepare records in case of 2pc GID is now variable sized. You seem to have added this to every commit, not just 2PC > * Add several routines to decode prepare records in the same way as it already happens in logical decoding. > > I’ve also added explicit LOCK statement in test_decoding regression suit to check that it doesn’t break thing. Please explain that in comments in the patch. > If > somebody can create scenario that will block decoding because of existing dummy backend lock that will be great > help. Right now all my tests passing (including TAP tests to check recovery of twophase tx in case of failures from > adjacent mail thread). > > If we will agree about current approach than I’m ready to add this stuff to proposed in-core logical replication. > > [1] https://www.postgresql.org/message-id/EE7452CA-3C39-4A0E-97EC-17A414972884%40postgrespro.ru We'll need some measurements about additional WAL space or mem usage from these approaches. Thanks. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
On 4 January 2017 at 21:20, Simon Riggs <[hidden email]> wrote:
> On 31 December 2016 at 08:36, Stas Kelvich <[hidden email]> wrote: >> Here is resubmission of patch to implement logical decoding of two-phase transactions (instead of treating them >> as usual transaction when commit) [1] I’ve slightly polished things and used test_decoding output plugin as client. > > Sounds good. > >> General idea quite simple here: >> >> * Write gid along with commit/prepare records in case of 2pc > > GID is now variable sized. You seem to have added this to every > commit, not just 2PC I've just realised that you're adding GID because it allows you to uniquely identify the prepared xact. But then the prepared xact will also have a regular TransactionId, which is also unique. GID exists for users to specify things, but it is not needed internally and we don't need to add it here. What we do need is for the commit prepared message to remember what the xid of the prepare was and then re-find it using the commit WAL record's twophase_xid field. So we don't need to add GID to any WAL records, nor to any in-memory structures. Please re-work the patch to include twophase_xid, which should make the patch smaller and much faster too. Please add comments to explain how and why patches work. Design comments allow us to check the design makes sense and if it does whether all the lines in the patch are needed to follow the design. Without that patches are much harder to commit and we all want patches to be easier to commit. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
Thank you for looking into this.
> On 5 Jan 2017, at 09:43, Simon Riggs <[hidden email]> wrote: >> >> GID is now variable sized. You seem to have added this to every >> commit, not just 2PC > Hm, didn’t realise that, i’ll fix. > I've just realised that you're adding GID because it allows you to > uniquely identify the prepared xact. But then the prepared xact will > also have a regular TransactionId, which is also unique. GID exists > for users to specify things, but it is not needed internally and we > don't need to add it here. I think we anyway can’t avoid pushing down GID to the client side. If we will push down only local TransactionId to remote server then we will lose mapping of GID to TransactionId, and there will be no way for user to identify his transaction on second server. Also Open XA and lots of libraries (e.g. J2EE) assumes that there is the same GID everywhere and it’s the same GID that was issued by the client. Requirements for two-phase decoding can be different depending on what one want to build around it and I believe in some situations pushing down xid is enough. But IMO dealing with reconnects, failures and client libraries will force programmer to use the same GID everywhere. > What we do need is for the commit prepared > message to remember what the xid of the prepare was and then re-find > it using the commit WAL record's twophase_xid field. So we don't need > to add GID to any WAL records, nor to any in-memory structures. Other part of the story is how to find GID during decoding of commit prepared record. I did that by adding GID field to the commit WAL record, because by the time of decoding all memory structures that were holding xid<->gid correspondence are already cleaned up. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
On 5 January 2017 at 10:21, Stas Kelvich <[hidden email]> wrote:
> Thank you for looking into this. > >> On 5 Jan 2017, at 09:43, Simon Riggs <[hidden email]> wrote: >>> >>> GID is now variable sized. You seem to have added this to every >>> commit, not just 2PC >> > > Hm, didn’t realise that, i’ll fix. > >> I've just realised that you're adding GID because it allows you to >> uniquely identify the prepared xact. But then the prepared xact will >> also have a regular TransactionId, which is also unique. GID exists >> for users to specify things, but it is not needed internally and we >> don't need to add it here. > > I think we anyway can’t avoid pushing down GID to the client side. > > If we will push down only local TransactionId to remote server then we will lose mapping > of GID to TransactionId, and there will be no way for user to identify his transaction on > second server. Also Open XA and lots of libraries (e.g. J2EE) assumes that there is > the same GID everywhere and it’s the same GID that was issued by the client. > > Requirements for two-phase decoding can be different depending on what one want > to build around it and I believe in some situations pushing down xid is enough. But IMO > dealing with reconnects, failures and client libraries will force programmer to use > the same GID everywhere. Surely in this case the master server is acting as the Transaction Manager, and it knows the mapping, so we are good? I guess if you are using >2 nodes then you need to use full 2PC on each node. But even then, if you adopt the naming convention that all in-progress xacts will be called RepOriginId-EPOCH-XID, so they have a fully unique GID on all of the child nodes then we don't need to add the GID. Please explain precisely how you expect to use this, to check that GID is required. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
> On 5 Jan 2017, at 13:49, Simon Riggs <[hidden email]> wrote: > > Surely in this case the master server is acting as the Transaction > Manager, and it knows the mapping, so we are good? > > I guess if you are using >2 nodes then you need to use full 2PC on each node. > > Please explain precisely how you expect to use this, to check that GID > is required. > For example if we are using logical replication just for failover/HA and allowing user to be transaction manager itself. Then suppose that user prepared tx on server A and server A crashed. After that client may want to reconnect to server B and commit/abort that tx. But user only have GID that was used during prepare. > But even then, if you adopt the naming convention that all in-progress > xacts will be called RepOriginId-EPOCH-XID, so they have a fully > unique GID on all of the child nodes then we don't need to add the > GID. Yes, that’s also possible but seems to be less flexible restricting us to some specific GID format. Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records to know exactly what will be the cost of such approach. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
On 5 January 2017 at 20:43, Stas Kelvich <[hidden email]> wrote:
> Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records > to know exactly what will be the cost of such approach. Sounds like a good idea, especially if you remove any attempt to work with GIDs for !2PC commits at the same time. I don't think I care about having access to the GID for the use case I have in mind, since we'd actually be wanting to hijack a normal COMMIT and internally transform it to PREPARE TRANSACTION, <do stuff>, COMMIT PREPARED. But for the more general case of logical decoding of 2PC I can see the utility of having the xact identifier. If we presume we're only interested in logically decoding 2PC xacts that are not yet COMMIT PREPAREd, can we not avoid the WAL overhead of writing the GID by looking it up in our shmem state at decoding-time for PREPARE TRANSACTION? If we can't find the prepared transaction in TwoPhaseState we know to expect a following ROLLBACK PREPARED or COMMIT PREPARED, so we shouldn't decode it at the PREPARE TRANSACTION stage. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
In reply to this post by Stas Kelvich-3
On 5 January 2017 at 12:43, Stas Kelvich <[hidden email]> wrote:
> >> On 5 Jan 2017, at 13:49, Simon Riggs <[hidden email]> wrote: >> >> Surely in this case the master server is acting as the Transaction >> Manager, and it knows the mapping, so we are good? >> >> I guess if you are using >2 nodes then you need to use full 2PC on each node. >> >> Please explain precisely how you expect to use this, to check that GID >> is required. >> > > For example if we are using logical replication just for failover/HA and allowing user > to be transaction manager itself. Then suppose that user prepared tx on server A and server A > crashed. After that client may want to reconnect to server B and commit/abort that tx. > But user only have GID that was used during prepare. I don't think that's the case your trying to support and I don't think that's a common case that we want to pay the price to put into core in a non-optional way. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
In reply to this post by Stas Kelvich-3
On 5 January 2017 at 20:43, Stas Kelvich <[hidden email]> wrote:
> >> On 5 Jan 2017, at 13:49, Simon Riggs <[hidden email]> wrote: >> >> Surely in this case the master server is acting as the Transaction >> Manager, and it knows the mapping, so we are good? >> >> I guess if you are using >2 nodes then you need to use full 2PC on each node. >> >> Please explain precisely how you expect to use this, to check that GID >> is required. >> > > For example if we are using logical replication just for failover/HA and allowing user > to be transaction manager itself. Then suppose that user prepared tx on server A and server A > crashed. After that client may want to reconnect to server B and commit/abort that tx. > But user only have GID that was used during prepare. > >> But even then, if you adopt the naming convention that all in-progress >> xacts will be called RepOriginId-EPOCH-XID, so they have a fully >> unique GID on all of the child nodes then we don't need to add the >> GID. > > Yes, that’s also possible but seems to be less flexible restricting us to some > specific GID format. > > Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records > to know exactly what will be the cost of such approach. Stas, Have you had a chance to look at this further? I think the approach of storing just the xid and fetching the GID during logical decoding of the PREPARE TRANSACTION is probably the best way forward, per my prior mail. That should eliminate Simon's objection re the cost of tracking GIDs and still let us have access to them when we want them, which is the best of both worlds really. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
>>
>> Yes, that’s also possible but seems to be less flexible restricting us to some >> specific GID format. >> >> Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records >> to know exactly what will be the cost of such approach. > > Stas, > > Have you had a chance to look at this further? Generally i’m okay with Simon’s approach and will send send updated patch. Anyway want to perform some test to estimate how much disk space is actually wasted by extra WAL records. > I think the approach of storing just the xid and fetching the GID > during logical decoding of the PREPARE TRANSACTION is probably the > best way forward, per my prior mail. I don’t think that’s possible in this way. If we will not put GID in commit record, than by the time when logical decoding will happened transaction will be already committed/aborted and there will be no easy way to get that GID. I thought about several possibilities: * Tracking xid/gid map in memory also doesn’t help much — if server reboots between prepare and commit we’ll lose that mapping. * We can provide some hooks on prepared tx recovery during startup, but that approach also fails if reboot happened between commit and decoding of that commit. * Logical messages are WAL-logged, but they don’t have any redo function so don’t helps much. So to support user-accessible 2PC over replication based on 2PC decoding we should invent something more nasty like writing them into a table. > That should eliminate Simon's > objection re the cost of tracking GIDs and still let us have access to > them when we want them, which is the best of both worlds really. Having 2PC decoding in core is a good thing anyway even without GID tracking =) -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
On 26 Jan. 2017 18:43, "Stas Kelvich" <[hidden email]> wrote:
My thinking is that if the 2PC xact is by that point COMMIT PREPARED or ROLLBACK PREPARED we don't care that it was ever 2pc and should just decode it as a normal xact. Its gid has ceased to be significant and no longer holds meaning since the xact is resolved. The point of logical decoding of 2pc is to allow peers to participate in a decision on whether to commit or not. Rather than only being able to decode the xact once committed as is currently the case. If it's already committed there's no point treating it as anything special. So when we get to the prepare transaction in xlog we look to see if it's already committed / rolled back. If so we proceed normally like current decoding does. Only if it's still prepared do we decode it as 2pc and supply the gid to a new output plugin callback for prepared xacts. I thought about several possibilities: Er what? That's why I suggested using the prepared xacts shmem state. It's persistent as you know from your work on prepared transaction files. It has all the required info. |
> On 26 Jan 2017, at 12:51, Craig Ringer <[hidden email]> wrote: > > * Tracking xid/gid map in memory also doesn’t help much — if server reboots between prepare > and commit we’ll lose that mapping. > > Er what? That's why I suggested using the prepared xacts shmem state. It's persistent as you know from your work on prepared transaction files. It has all the required info. Imagine following scenario: 1. PREPARE happend 2. PREPARE decoded and sent where it should be sent 3. We got all responses from participating nodes and issuing COMMIT/ABORT 4. COMMIT/ABORT decoded and sent After step 3 there is no more memory state associated with that prepared tx, so if will fail between 3 and 4 then we can’t know GID unless we wrote it commit record (or table). -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
On 26 January 2017 at 19:34, Stas Kelvich <[hidden email]> wrote:
> Imagine following scenario: > > 1. PREPARE happend > 2. PREPARE decoded and sent where it should be sent > 3. We got all responses from participating nodes and issuing COMMIT/ABORT > 4. COMMIT/ABORT decoded and sent > > After step 3 there is no more memory state associated with that prepared tx, so if will fail > between 3 and 4 then we can’t know GID unless we wrote it commit record (or table). If the decoding session crashes/disconnects and restarts between 3 and 4, we know the xact is now committed or rolled backand we don't care about its gid anymore, we can decode it as a normal committed xact or skip over it if aborted. If Pg crashes between 3 and 4 the same applies, since all decoding sessions must restart. No decoding session can ever start up between 3 and 4 without passing through 1 and 2, since we always restart decoding at restart_lsn and restart_lsn cannot be advanced past the assignment (BEGIN) of a given xid until we pass its commit record and the downstream confirms it has flushed the results. The reorder buffer doesn't even really need to keep track of the gid between 3 and 4, though it should do to save the output plugin and downstream the hassle of keeping an xid to gid mapping. All it needs is to know if we sent a given xact's data to the output plugin at PREPARE time, so we can suppress sending them again at COMMIT time, and we can store that info on the ReorderBufferTxn. We can store the gid there too. We'll need two new output plugin callbacks prepare_cb rollback_cb since an xact can roll back after we decode PREPARE TRANSACTION (or during it, even) and we have to be able to tell the downstream to throw the data away. I don't think the rollback callback should be called abort_prepared_cb, because we'll later want to add the ability to decode interleaved xacts' changes as they are made, before commit, and in that case will also need to know if they abort. We won't care if they were prepared xacts or not, but we'll know based on the ReorderBufferTXN anyway. We don't need a separate commit_prepared_cb, the existing commit_cb is sufficient. The gid will be accessible on the ReorderBufferTXN. Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when wal_level >= logical I don't think that's the end of the world. But since we already have almost everything we need in memory, why not just stash the gid on ReorderBufferTXN? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <[hidden email]> wrote:
> Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when > wal_level >= logical I don't think that's the end of the world. But > since we already have almost everything we need in memory, why not > just stash the gid on ReorderBufferTXN? I have been through this thread... And to be honest, I have a hard time understanding for which purpose the information of a 2PC transaction is useful in the case of logical decoding. The prepare and commit prepared have been received by a node which is at the root of the cluster tree, a node of the cluster at an upper level, or a client, being in charge of issuing all the prepare queries, and then issue the commit prepared to finish the transaction across a cluster. In short, even if you do logical decoding from the root node, or the one at a higher level, you would care just about the fact that it has been committed. -- Michael -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
On Tue, Jan 31, 2017 at 3:29 PM, Michael Paquier
<[hidden email]> wrote: > On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <[hidden email]> wrote: >> Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when >> wal_level >= logical I don't think that's the end of the world. But >> since we already have almost everything we need in memory, why not >> just stash the gid on ReorderBufferTXN? > > I have been through this thread... And to be honest, I have a hard > time understanding for which purpose the information of a 2PC > transaction is useful in the case of logical decoding. The prepare and > commit prepared have been received by a node which is at the root of > the cluster tree, a node of the cluster at an upper level, or a > client, being in charge of issuing all the prepare queries, and then > issue the commit prepared to finish the transaction across a cluster. > In short, even if you do logical decoding from the root node, or the > one at a higher level, you would care just about the fact that it has > been committed. By the way, I have moved this patch to next CF, you guys seem to make the discussion move on. -- Michael -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
In reply to this post by Michael Paquier
On 31 Jan. 2017 19:29, "Michael Paquier" <[hidden email]> wrote:
TL;DR: this lets us decode the xact after prepare but before commit so decoding/replay outcomes can affect the commit-or-abort decision. The prepare and That's where you've misunderstood - it isn't committed yet. The point or this change is to allow us to do logical decoding at the PREPARE TRANSACTION point. The xact is not yet committed or rolled back. This allows the results of logical decoding - or more interestingly results of replay on another node / to another app / whatever to influence the commit or rollback decision. Stas wants this for a conflict-free logical semi-synchronous replication multi master solution. At PREPARE TRANSACTION time we replay the xact to other nodes, each of which applies it and PREPARE TRANSACTION, then replies to confirm it has successfully prepared the xact. When all nodes confirm the xact is prepared it is safe for the origin node to COMMIT PREPARED. The other nodes then see hat the first node has committed and they commit too. Alternately if any node replies "could not replay xact" or "could not prepare xact" the origin node knows to ROLLBACK PREPARED. All the other nodes see that and rollback too. This makes it possible to be much more confident that what's replicated is exactly the same on all nodes, with no after-the-fact MM conflict resolution that apps must be aware of to function correctly. To really make it rock solid you also have to send the old and new values of a row, or have row versions, or send old row hashes. Something I also want to have, but we can mostly get that already with REPLICA IDENTITY FULL. It is of interest to me because schema changes in MM logical replication are more challenging awkward and restrictive without it. Optimistic conflict resolution doesn't work well for schema changes and once the conflciting schema changes are committed on different nodes there is no going back. So you need your async system to have a global locking model for schema changes to stop conflicts arising. Or expect the user not to do anything silly / misunderstand anything and know all the relevant system limitations and requirements... which we all know works just great in practice. You also need a way to ensure that schema changes don't render committed-but-not-yet-replayed row changes from other peers nonsensical. The safest way is a barrier where all row changes committed on any node before committing the schema change on the origin node must be fully replayed on every other node, making an async MM system temporarily sync single master (and requiring all nodes to be up and reachable). Otherwise you need a way to figure out how to conflict-resolve incoming rows with missing columns / added columns / changed types / renamed tables etc which is no fun and nearly impossible in the general case. 2PC decoding lets us avoid all this mess by sending all nodes the proposed schema change and waiting until they all confirm successful prepare before committing it. It can also be used to solve the row compatibility problems with some more lazy inter-node chat in logical WAL messages. I think the purpose of having the GID available to the decoding output plugin at PREPARE TRANSACTION time is that it can co-operate with a global transaction manager that way. Each node can tell the GTM "I'm ready to commit [X]". It is IMO not crucial since you can otherwise use a (node-id, xid) tuple, but it'd be nice for coordinating with external systems, simplifying inter node chatter, integrating logical deocding into bigger systems with external transaction coordinators/arbitrators etc. It seems pretty silly _not_ to have it really. Personally I don't think lack of access to the GID justifies blocking 2PC logical decoding. It can be added separately. But it'd be nice to have especially if it's cheap. |
In reply to this post by Michael Paquier
On 31.01.2017 09:29, Michael Paquier wrote: > On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <[hidden email]> wrote: >> Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when >> wal_level >= logical I don't think that's the end of the world. But >> since we already have almost everything we need in memory, why not >> just stash the gid on ReorderBufferTXN? > I have been through this thread... And to be honest, I have a hard > time understanding for which purpose the information of a 2PC > transaction is useful in the case of logical decoding. The prepare and > commit prepared have been received by a node which is at the root of > the cluster tree, a node of the cluster at an upper level, or a > client, being in charge of issuing all the prepare queries, and then > issue the commit prepared to finish the transaction across a cluster. > In short, even if you do logical decoding from the root node, or the > one at a higher level, you would care just about the fact that it has > been committed. Actually our multimaster is completely based now on logical replication and 2PC (more precisely we are using 3PC now:) State of transaction (prepared, precommitted, committed) should be persisted in WAL to make it possible to perform recovery. Recovery can involve transactions in any state. So there three records in the WAL: PREPARE, PRECOMMIT, COMMIT_PREPARED and recovery can involve either all of them, either PRECOMMIT+COMMIT_PREPARED either just COMMIT_PREPARED. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
On 31 Jan. 2017 22:43, "Konstantin Knizhnik" <[hidden email]> wrote:
in any state. So there three records in the WAL: PREPARE, PRECOMMIT, COMMIT_PREPARED and That's your modified Pg though. This 2pc logical decoding patch proposal is for core and I think it just confused things to introduce discussion of unrelated changes made by your product to the codebase.
|
In reply to this post by Craig Ringer-3
On Tue, Jan 31, 2017 at 6:22 PM, Craig Ringer <[hidden email]> wrote:
> That's where you've misunderstood - it isn't committed yet. The point or > this change is to allow us to do logical decoding at the PREPARE TRANSACTION > point. The xact is not yet committed or rolled back. Yes, I got that. I was looking for a why or an actual use-case. > Stas wants this for a conflict-free logical semi-synchronous replication > multi master solution. This sentence is hard to decrypt, less without "multi master" as the concept applies basically only to only one master node. > At PREPARE TRANSACTION time we replay the xact to > other nodes, each of which applies it and PREPARE TRANSACTION, then replies > to confirm it has successfully prepared the xact. When all nodes confirm the > xact is prepared it is safe for the origin node to COMMIT PREPARED. The > other nodes then see hat the first node has committed and they commit too. OK, this is the argument I was looking for. So in your schema the origin node, the one generating the changes, is itself in charge of deciding if the 2PC should work or not. There are two channels between the origin node and the replicas replaying the logical changes, one is for the logical decoder with a receiver, the second one is used to communicate the WAL apply status. I thought about something like postgres_fdw doing this job with a transaction that does writes across several nodes, that's why I got confused about this feature. Everything goes through one channel, so the failure handling is just simplified. > Alternately if any node replies "could not replay xact" or "could not > prepare xact" the origin node knows to ROLLBACK PREPARED. All the other > nodes see that and rollback too. The origin node could just issue the ROLLBACK or COMMIT and the logical replicas would just apply this change. > To really make it rock solid you also have to send the old and new values of > a row, or have row versions, or send old row hashes. Something I also want > to have, but we can mostly get that already with REPLICA IDENTITY FULL. On a primary key (or a unique index), the default replica identity is enough I think. > It is of interest to me because schema changes in MM logical replication are > more challenging awkward and restrictive without it. Optimistic conflict > resolution doesn't work well for schema changes and once the conflicting > schema changes are committed on different nodes there is no going back. So > you need your async system to have a global locking model for schema changes > to stop conflicts arising. Or expect the user not to do anything silly / > misunderstand anything and know all the relevant system limitations and > requirements... which we all know works just great in practice. You also > need a way to ensure that schema changes don't render > committed-but-not-yet-replayed row changes from other peers nonsensical. The > safest way is a barrier where all row changes committed on any node before > committing the schema change on the origin node must be fully replayed on > every other node, making an async MM system temporarily sync single master > (and requiring all nodes to be up and reachable). Otherwise you need a way > to figure out how to conflict-resolve incoming rows with missing columns / > added columns / changed types / renamed tables etc which is no fun and > nearly impossible in the general case. That's one vision of things, FDW-like approaches would be a second, but those are not able to pass down utility statements natively, though this stuff can be done with the utility hook. > I think the purpose of having the GID available to the decoding output > plugin at PREPARE TRANSACTION time is that it can co-operate with a global > transaction manager that way. Each node can tell the GTM "I'm ready to > commit [X]". It is IMO not crucial since you can otherwise use a (node-id, > xid) tuple, but it'd be nice for coordinating with external systems, > simplifying inter node chatter, integrating logical deocding into bigger > systems with external transaction coordinators/arbitrators etc. It seems > pretty silly _not_ to have it really. Well, Postgres-XC/XL save the 2PC GID for this purpose in the GTM, this way the COMMIT/ABORT PREPARED can be issued from any nodes, and there is a centralized conflict resolution, the latter being done with a huge cost, causing much bottleneck in scaling performance. > Personally I don't think lack of access to the GID justifies blocking 2PC > logical decoding. It can be added separately. But it'd be nice to have > especially if it's cheap. I think it should be added reading this thread. -- Michael -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
On Tue, Jan 31, 2017 at 9:05 PM, Michael Paquier
<[hidden email]> wrote: >> Personally I don't think lack of access to the GID justifies blocking 2PC >> logical decoding. It can be added separately. But it'd be nice to have >> especially if it's cheap. > > I think it should be added reading this thread. +1. If on the logical replication master the user executes PREPARE TRANSACTION 'mumble', isn't it sensible to want the logical replica to prepare the same set of changes with the same GID? To me, that not only seems like *a* sensible thing to want to do but probably the *most* sensible thing to want to do. And then, when the eventual COMMIT PREPAPARED 'mumble' comes along, you want to have the replica run the same command. If you don't do that, then the alternative is that the replica has to make up new names based on the master's XID. But that kinda sucks, because now if replication stops due to a conflict or whatever and you have to disentangle things by hand, all the names on the replica are basically meaningless. Also, including the GID in the WAL for each COMMIT/ABORT PREPARED doesn't seem inordinately expensive to me. For that to really add up to a significant cost, wouldn't you need to be doing LOTS of 2PC transactions, each with very little work, so that the commit/abort prepared records weren't swamped by everything else? That seems like an unlikely scenario, but if it does happen, that's exactly when you'll be most grateful for the GID tracking. I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
Free forum by Nabble | Edit this page |