Hi hackers,
Few days earlier I've finished my work on WAITLSN statement utility, so I’d like to share it. Introduction ============ Our clients who deal with 9.5 and use asynchronous master-slave replication, asked to make the wait-mechanism on the slave side to prevent the situation when slave handles query which needs data (LSN) that was received, flushed, but still not replayed. Problem description =================== The implementation: Must handle the wait-mechanism using pg_sleep() in order not to load system Must avoid race conditions if different backend want to wait for different LSN Must not take snapshot of DB, to avoid troubles with sudden minXID change Must have optional timeout parameter if LSN traffic has stalled. Must release on postmaster’s death or interrupts. Implementation ============== To avoid troubles with snapshots, WAITLSN was implemented as a utility statement, this allows us to circumvent the snapshot-taking mechanism. We tried different variants and the most effective way was to use Latches. To handle interprocess interaction all Latches are stored in shared memory and to cope with race conditions, each Latch is protected by a Spinlock. Timeout was made optional parameter, it is set in milliseconds. What works ========== Actually, it works well even with significant timeout or wait period values, but of course there might be things I've overlooked. How to use it ========== WAITLSN ‘LSN’ [, timeout in ms]; #Wait until LSN 0/303EC60 will be replayed, or 10 second passed. WAITLSN ‘0/303EC60’, 10000; #Or same without timeout. WAITLSN ‘0/303EC60’; Notice: WAITLSN will release on PostmasterDeath or Interruption events if they come earlier then LSN or timeout. Testing the implementation ====================== The implementation was tested with testgres and unittest python modules. How to test this implementation: Start master server Make table test, insert tuple 1 Make asynchronous slave replication (9.5 wal_level = standby, 9.6 or higher wal_level = replica) Slave: START TRANSACTION ISOLATION LEVEL REPEATABLE READ ; SELECT * FROM test; Master: delete tuple + make vacuum + get new LSN Slave: WAITLSN ‘newLSN’, 60000; Waitlsn finished with FALSE “LSN doesn`t reached” Slave: COMMIT; WAITLSN ‘newLSN’, 60000; Waitlsn finished with success (without NOTICE message) The WAITLSN as expected wait LSN, and interrupts on PostmasterDeath, interrupts or timeout. Your feedback is welcome! --- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
On 31 August 2016 at 22:16, Ivan Kartyshov <[hidden email]> wrote:
> Our clients who deal with 9.5 and use asynchronous master-slave replication, > asked to make the wait-mechanism on the slave side to prevent the situation > when slave handles query which needs data (LSN) that was received, flushed, > but still not replayed. I like the broad idea - I've wanted something like it for a while. BDR has pg_xlog_wait_remote_receive() and pg_xlog_wait_remote_apply() for use in tests for this reason, but they act on the *upstream* side, waiting until the downstream has acked the data. Not as useful for ensuring that apps connected to both master and one or more replicas get a consistent view of data. How do you get the commit LSN to watch for? Grab pg_current_xlog_insert_location() just after the commit and figure that replaying to that point guarantees you get the commit? Some time ago[1] I raised the idea of reporting commit LSN on the wire to clients. That didn't go anywhere due to compatibility and security concerns. I think those were resolvable, but it wasn't enough of a priority to push hard on at the time. A truly "right" solution has to wait for a protocol bump, but I think good-enough solutions are possible now. So you might want to read that thread. It also mentions hesitations about exposing LSN to clients even more. I think we're *way* past that now - we have replication origins and replication slots relying on it, it's exposed in a pg_lsn datatype, a bunch of views expose it, etc. But it might be reasonable to ask "should the client instead be expected to wait for the confirmed commit of a 64-bit epoch-extended xid, like that returned by txid_current()?" . One advantage of using xid is that you can get it while you're still in the xact, so there's no race between commit and checking the lsn after commit. Are you specifically trying to ensure "this commit has replayed on the replica before we run queries on it" ? Or something else? (Also, on a side note, Kevin mentioned that it may be possible to use SSI data to achieve SERIALIZABLE read-only queries on replicas, where they get the same protection from commit-order related anomalies as queries on the master. You might want to look more deeply into that too at some stage, if you're trying to ensure the app can do read only queries on the master and expect fully consistent results). [1] https://www.postgresql.org/message-id/flat/53E41EC1.5050603%402ndquadrant.com#53E41EC1.5050603@... -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
On 08/31/2016 05:54 PM, Craig Ringer wrote:
> How do you get the commit LSN to watch for? Grab > pg_current_xlog_insert_location() just after the commit and figure > that replaying to that point guarantees you get the commit? That's the point, it was created in order to provide the cosistent view of data between master and replica. You almost guessed, I used GetXLogReplayRecPtr() right after LSN was physically replayed on downstream. > Some time ago[1] I raised the idea of reporting commit LSN on the wire > to clients. That didn't go anywhere due to compatibility and security > concerns. I think those were resolvable, but it wasn't enough of a > priority to push hard on at the time. A truly "right" solution has to > wait for a protocol bump, but I think good-enough solutions are > possible now. So you might want to read that thread. Thank you for pointing to your thread, it was very informative! It seems that you have solved the very similar problem. > It also mentions hesitations about exposing LSN to clients even more. > I think we're *way* past that now - we have replication origins and > replication slots relying on it, it's exposed in a pg_lsn datatype, a > bunch of views expose it, etc. But it might be reasonable to ask > "should the client instead be expected to wait for the confirmed > commit of a 64-bit epoch-extended xid, like that returned by > txid_current()?" . One advantage of using xid is that you can get it > while you're still in the xact, so there's no race between commit and > checking the lsn after commit. That sounds reasonable, but I dont think it will give us any considerable benefits. But I`ll work out this variant. > Are you specifically trying to ensure "this commit has replayed on the > replica before we run queries on it" ? Or something else? Yes you are right, I want to ensure data consistency on downstream before running queries on it. Our clients would use it as a part of background worker and maybe directly in apps too. --- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
In reply to this post by Kartyshov Ivan
On Thu, Sep 1, 2016 at 2:16 AM, Ivan Kartyshov
<[hidden email]> wrote: > Hi hackers, > > Few days earlier I've finished my work on WAITLSN statement utility, so I’d > like to share it. > [...] > Your feedback is welcome! > > [waitlsn_10dev.patch] Hi Ivan, Thanks for working on this. Here are some general thoughts on the feature, and an initial review. +1 for this feature. Explicitly waiting for a given commit to be applied is one of several approaches to achieve "causal consistency" for reads on replica nodes, and I think it will be very useful if combined with a convenient way to get the values to wait for when you run COMMIT. This could be used either by applications directly, or by middleware that somehow keeps track of dependencies between transactions and inserts waits. I liked the way Heikki Linnakangas imagined this feature[1]: BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE; ... or perhaps it could be spelled like this: BEGIN [isolation stuff] WAIT FOR COMMIT TOKEN <xxx> TIMEOUT <timeout>; That allows waiting only at the start of a transaction, whereas your idea of making a utility command would allow a single READ COMMITTED transaction to wait multiple times for transactions it has heard about through side channels, which may be useful. Perhaps we could support the same syntax in a stand alone statement inside a transaction OR as part of a BEGIN ... statement. Being able to do it as part of BEGIN means that you can use this feature for single-snapshot transactions, ie REPEATABLE READ and SERIALIZABLE (of course you can't use SERIALIZABLE on hot standbys yet but that'll be fixed one day). Otherwise you'd be waiting for the LSN in the middle of your transaction but not be able to see the result because you don't take a new snapshot. Or maybe it's enough to use a standalone WAIT ... statement inside a REPEATABLE READ or SERIALIZABLE transaction as long as it's the first statement, and should be an error to do so any time later? I think working in terms of LSNs or XIDs explicitly is not a good idea: encouraging clients to think in terms of anything other than opaque 'commit tokens' seems like a bad idea because it limits future changes. For example, there is on-going discussion about introducing CSNs (commit sequence numbers), and there are some related concepts lurking in the SSI code; maybe we'd want to use those one day. Do you think it would make sense to have a concept of a commit token that is a non-analysable string as far as clients are concerned, so that clients are not encouraged to do anything at all with them except use them in a WAIT FOR COMMIT TOKEN <xxx> statement? INITIAL FEEDBACK ON THE PATCH I didn't get as far as testing or detailed review because it has some obvious bugs and compiler warnings which I figured we should talk about first, and I also have some higher level questions about the design. gram.y:12882:15: error: assignment makes pointer from integer without a cast [-Werror=int-conversion] n->delay = $3; It looks like struct WaitLSNStmt accidentally has 'delay' as a pointer to int. Perhaps you want an int? Maybe it would be useful to include the units (millisecond, ms) in the name? waitlsn.c: In function 'WLDisownLatch': waitlsn.c:82:2: error: suggest parentheses around assignment used as truth value [-Werror=parentheses] if (MyBackendId = state->backend_maxid) ^~ Pretty sure you want == here. waitlsn.c: In function 'WaitLSNUtility': waitlsn.c:153:17: error: initialization makes integer from pointer without a cast [-Werror=int-conversion] int tdelay = delay; ^~~~~ Another place where I think you wanted an int but used a pointer to int? To fix that warning you need tdelay = *delay, but I think delay should really not be taken by pointer at all. @@ -6922,6 +6923,11 @@ StartupXLOG(void) + /* + * After update lastReplayedEndRecPtr set Latches in SHMEM array + */ + WaitLSNSetLatch(); + I think you should try to do this only after commit records are replayed, not after every record. Only commit records can make transactions visible, and the raison d'être for this feature is to let users wait for transactions they know about to become visible. You probably can't do it directly in xact_redo_commit though, because at that point XLogCtl->lastReplayedEndRecPtr hasn't been updated yet so a backend that wakes up might not see that it has advanced and go back to sleep. It is updated in the StartupXLOG loop after the redo function runs. That is the reason why WalRcvForceReply() is called from there rather than in xact_redo_commit, to implement remote_apply for replication. Perhaps you need something similar? + tdelay -= (GetCurrentTimestamp() - timer); You can't do arithmetic with TimestampTz like this. Depending on configure option --disable-integer-datetimes (which controls macro HAVE_INT64_TIMESTAMP), it may be a floating point number of seconds since the epoch, or an integer number of microseconds since the epoch. It looks like maybe the above code assumes it works in milliseconds, since you're using that to adjust your delay which is in milliseconds? I would try to figure out how to express the logic you want with TimestampTzPlusMilliseconds and TimestampDifference. I'd probably do something like compute the absolute timeout time with TimestampTzPlusMilliseconds(GetCurrentTimestamp(), delay) at the start and then compute the remaining delay each time through the latch wait loop with TimestampDifference(GetCurrentTimestamp(), timeout, ...), though that requires converting (seconds, microseconds) to millisecond. +void +WaitLSNSetLatch(void) +{ + uint i; + for (i = 1; i < (state->backend_maxid+1); i++) + { + SpinLockAcquire(&state->l_arr[i].slock); + if (state->l_arr[i].pid != 0) + SetLatch(&state->l_arr[i].latch); + SpinLockRelease(&state->l_arr[i].slock); + } +} So your approach here is to let regular backends each have their own slot indexed by backend ID, which seems good because it means that they don't have to contend for a lock, but it's bad because it means that the recovery process has to spin through the array looking for backends to wake up every time it advances, and wake them all up no matter whether they are interested in the current LSN or not. That means that they may get woken up many times before they see the value they're waiting for. Did you also consider a design where there would be a wait list/queue, and the recovery process would wake up only those backends that are waiting for LSNs <= the current replayed location? That would make the work for the recovery process cheaper (it just has to pop waiters from one end of a list sorted by the LSN they're waiting for), and let the waiting backends sleep without so many spurious wake-ups, but it would create potential for contention between backends that are calling WAITLSN at the same time because they all need to add themselves to that queue which would involve some kind of mutex. I don't know if that would be better or not, but it's probably the first way that I would have tried to do this. See syncrep.c which does something similar. +static void +WLOwnLatch(void) +{ + SpinLockAcquire(&state->l_arr[MyBackendId].slock); + OwnLatch(&state->l_arr[MyBackendId].latch); + is_latch_owned = true; + if (MyBackendId > state->backend_maxid) + state->backend_maxid += 1; + state->l_arr[MyBackendId].pid = MyProcPid; + SpinLockRelease(&state->l_arr[MyBackendId].slock); +} I'm a bit confused about state->backend_maxid. It looks like you are using that to limit the range of slots that WaitLSNSetLatch has to scan. Isn't it supposed to contain the highest MyBackendId that has ever been seen? You appear to be incrementing backend_maxid by one, instead of recording the index of the highest slot in use, but then WaitLSNSetLatch is using it to restrict the range of indexes. Then there is the question of the synchronisation of access to backend_maxid. You hold a spinlock in one arbitrary slot, but that doesn't seem sufficient: another backend may also read it, compute a new value and then write it, while holding a different spin lock. Or am I missing something? + if (delay > 0) + latch_events = WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH; + else + latch_events = WL_LATCH_SET | WL_POSTMASTER_DEATH; Here you are using delay <= 0 as 'wait forever'. I wonder if it would be useful to have two different special values: one meaning 'wait forever', and another meaning 'don't wait at all: if it's not applied yet, then timeout immediately'. In any case I'd consider using names for special wait times and using those for clarity: WAITLSN_INFINITE_WAIT, WAITLSN_NO_WAIT. Later I'll have feedback on the error messages, documentation and comments but let's talk just about the design and code for now. I hope this is helpful! [1] https://www.postgresql.org/message-id/5642FF8F.4080803%40iki.fi -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
On Thu, Sep 15, 2016 at 2:41 PM, Thomas Munro
<[hidden email]> wrote: > On Thu, Sep 1, 2016 at 2:16 AM, Ivan Kartyshov > <[hidden email]> wrote: >> Hi hackers, >> >> Few days earlier I've finished my work on WAITLSN statement utility, so I’d >> like to share it. >> [...] >> Your feedback is welcome! >> >> [waitlsn_10dev.patch] > > Thanks for working on this. Here are some general thoughts on the > feature, and an initial review. Hi Ivan I'm marking the patch Returned with Feedback, since there hasn't been any response or a new patch. I encourage you to keep working on this feature, and I'll be happy to review future patches. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
In reply to this post by Thomas Munro-3
Thank you for reviews and suggested improvements.
I rewrote patch to make it more stable. Changes ======= I've made a few changes: 1) WAITLSN now doesn`t depend on snapshot 2) Check current replayed LSN rather than in xact_redo_commit 3) Add syntax WAITLSN_INFINITE '0/693FF800' - for infinite wait and WAITLSN_NO_WAIT '0/693FF800' for check if LSN was replayed as you advised. 4) Reduce the count of loops with GUCs (WalRcvForceReply() which in 9.5 doesn`t exist). 5) Optimize loop that set latches. 6) Add two GUCs that helps us to configure influence on StartupXLOG: count_waitlsn (denominator to check not each LSN) interval_waitlsn (Interval in milliseconds to additional LSN check) Feedback ======== On 09/15/2016 05:41 AM, Thomas Munro wrote: > You hold a spinlock in one arbitrary slot, but that > doesn't seem sufficient: another backend may also read it, compute a > new value and then write it, while holding a different spin lock. Or > am I missing something? We acquire an individual spinlock on each member of array, so you cannot compute new value and write it concurrently. Tested ====== We have been tested it on different servers and OS`s, in different cases and workloads. New version is nearly as fast as vanilla on primary and bring tiny influence on standby performance. Hardware: 144 Intel Cores with HT 3TB RAM all data on ramdisk primary + hotstandby on the same node. A dataset was created with "pgbench -i -s 1000" command. For each round of test we pause replay on standby, make 1000000 transaction on primary with pgbench, start replay on standby and measure replication gap disappearing time under different standby workload. The workload was "WAITLSN ('Very/FarLSN', 1000ms timeout)" followed by "select abalance from pgbench_accounts there aid = random_aid;" For vanilla 1000ms timeout was enforced on pgbench side by -R option. GUC waitlsn parameters was adopted for 1000ms timeout on standby with 35000 tps rate on primary. interval_waitlsn = 500 (ms) count_waitlsn = 30000 On 200 clients, slave caching up master as vanilla without significant delay. On 500 clients, slave caching up master 3% slower then vanilla. On 1000 clients, 12% slower. On 5000 clients, 3 time slower because it far above our hardware ability. How to use it ========== WAITLSN ‘LSN’ [, timeout in ms]; WAITLSN_INFINITE ‘LSN’; WAITLSN_NO_WAIT ‘LSN’; #Wait until LSN 0/303EC60 will be replayed, or 10 second passed. WAITLSN ‘0/303EC60’, 10000; #Or same without timeout. WAITLSN ‘0/303EC60’; orfile:///home/vis/Downloads/waitlsn_10dev_v2.patch WAITLSN_INFINITE '0/693FF800'; #To check if LSN is replayed can be used. WAITLSN_NO_WAIT '0/693FF800'; Notice: WAITLSN will release on PostmasterDeath or Interruption events if they come earlier then target LSN or timeout. Thank you for reading, will be glad to get your feedback. -- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
On 23 January 2017 at 11:56, Ivan Kartyshov <[hidden email]> wrote:
> Thank you for reviews and suggested improvements. > I rewrote patch to make it more stable. > > Changes > ======= > I've made a few changes: > 1) WAITLSN now doesn`t depend on snapshot > 2) Check current replayed LSN rather than in xact_redo_commit > 3) Add syntax WAITLSN_INFINITE '0/693FF800' - for infinite wait and > WAITLSN_NO_WAIT '0/693FF800' for check if LSN was replayed as you > advised. > 4) Reduce the count of loops with GUCs (WalRcvForceReply() which in 9.5 > doesn`t exist). > 5) Optimize loop that set latches. > 6) Add two GUCs that helps us to configure influence on StartupXLOG: > count_waitlsn (denominator to check not each LSN) > interval_waitlsn (Interval in milliseconds to additional LSN check) > > Feedback > ======== > On 09/15/2016 05:41 AM, Thomas Munro wrote: >> >> You hold a spinlock in one arbitrary slot, but that >> doesn't seem sufficient: another backend may also read it, compute a >> new value and then write it, while holding a different spin lock. Or >> am I missing something? > > > We acquire an individual spinlock on each member of array, so you cannot > compute new value and write it concurrently. > > Tested > ====== > We have been tested it on different servers and OS`s, in different cases and > workloads. New version is nearly as fast as vanilla on primary and bring > tiny influence on standby performance. > > Hardware: > 144 Intel Cores with HT > 3TB RAM > all data on ramdisk > primary + hotstandby on the same node. > > A dataset was created with "pgbench -i -s 1000" command. For each round of > test we pause replay on standby, make 1000000 transaction on primary with > pgbench, start replay on standby and measure replication gap disappearing > time under different standby workload. The workload was "WAITLSN > ('Very/FarLSN', 1000ms timeout)" followed by "select abalance from > pgbench_accounts there aid = random_aid;" > For vanilla 1000ms timeout was enforced on pgbench side by -R option. > GUC waitlsn parameters was adopted for 1000ms timeout on standby with 35000 > tps rate on primary. > interval_waitlsn = 500 (ms) > count_waitlsn = 30000 > > On 200 clients, slave caching up master as vanilla without significant > delay. > On 500 clients, slave caching up master 3% slower then vanilla. > On 1000 clients, 12% slower. > On 5000 clients, 3 time slower because it far above our hardware ability. > > How to use it > ========== > WAITLSN ‘LSN’ [, timeout in ms]; > WAITLSN_INFINITE ‘LSN’; > WAITLSN_NO_WAIT ‘LSN’; > > #Wait until LSN 0/303EC60 will be replayed, or 10 second passed. > WAITLSN ‘0/303EC60’, 10000; > > #Or same without timeout. > WAITLSN ‘0/303EC60’; > orfile:///home/vis/Downloads/waitlsn_10dev_v2.patch > WAITLSN_INFINITE '0/693FF800'; > > #To check if LSN is replayed can be used. > WAITLSN_NO_WAIT '0/693FF800'; > > Notice: WAITLSN will release on PostmasterDeath or Interruption events > if they come earlier then target LSN or timeout. > > > Thank you for reading, will be glad to get your feedback. Could you please rebase your patch as it no longer applies cleanly. Thanks Thom -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
On Thu, Feb 23, 2017 at 3:08 AM, Thom Brown <[hidden email]> wrote:
> On 23 January 2017 at 11:56, Ivan Kartyshov <[hidden email]> wrote: >> >> Thank you for reading, will be glad to get your feedback. > > Could you please rebase your patch as it no longer applies cleanly. +1 -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
Hi Ivan,
On 2/27/17 3:52 PM, Thomas Munro wrote: > On Thu, Feb 23, 2017 at 3:08 AM, Thom Brown <[hidden email]> wrote: >> On 23 January 2017 at 11:56, Ivan Kartyshov <[hidden email]> wrote: >>> >>> Thank you for reading, will be glad to get your feedback. >> >> Could you please rebase your patch as it no longer applies cleanly. > > +1 Please provide a rebased patch as soon as possible. Thanks, -- -David [hidden email] -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
Rebase done.
Meanwhile I made some more changes. Changes ======= 1) WAITLSN is now implemented as an extension called "pg_waitlsn" 2) Call new hook "lsn_updated_hook" right after xact_redo_commit (xlog.c) 3) Corresponding functions: pg_waitlsn('0/693FF800', 10000) - wait 10 seconds pg_waitlsn_infinite('0/693FF800') - for infinite wait pg_waitlsn_no_wait('0/693FF800') - once check if LSN was replayed or not. 4) Add two GUCs which help tuning influence on StartupXLOG: count_waitlsn (denominator to check not each LSN) int count_waitlsn = 10; interval_waitlsn (Interval in milliseconds to additional LSN check) int interval_waitlsn = 100; 5) Optimize loop that set latches. How to use it ========== Master: 1) Make "wal_level = replica" Slave: 2) Add shared_preload_libraries = 'pg_waitlsn' hot_standby = on (in postgresql.conf) 3) Create extension pg_waitlsn; 4) And in hot_standby you can wait for LSN (pgsleep), when LSN will replayed on slave pg_waitlsn will release select pg_waitlsn(‘LSN’ [, timeout in ms]); select pg_waitlsn_infinite(‘LSN’); select pg_waitlsn_no_wait(‘LSN’); #Wait until LSN 0/303EC60 will be replayed, or 10 second passed. select pg_waitlsn(‘0/303EC60’, 10000); #Or same without timeout. select pg_waitlsn(‘0/303EC60’); select pg_waitlsn_infinite('0/693FF800'); #To check if LSN is replayed can be used. select pg_waitlsn_no_wait('0/693FF800'); Notice: select pg_waitlsn will release on PostmasterDeath or Interruption events if they come earlier then target LSN or timeout. -- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
On Tue, Mar 7, 2017 at 8:48 PM, Ivan Kartyshov
<[hidden email]> wrote: > Rebase done. Thank you for updating the patch. > > Meanwhile I made some more changes. > > Changes > ======= > 1) WAITLSN is now implemented as an extension called "pg_waitlsn" I've read the discussion so far but I didn't see the reason why you've changed it to as a contrib module. Could you tell me about that? I guess this feature would be more useful if provided as a core feature and we need to discuss about syntax as Thomas mentioned. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
On Wed, Mar 8, 2017 at 1:58 AM, Masahiko Sawada <[hidden email]> wrote:
> On Tue, Mar 7, 2017 at 8:48 PM, Ivan Kartyshov > <[hidden email]> wrote: >> Rebase done. > > Thank you for updating the patch. > >> >> Meanwhile I made some more changes. >> >> Changes >> ======= >> 1) WAITLSN is now implemented as an extension called "pg_waitlsn" > > I've read the discussion so far but I didn't see the reason why you've > changed it to as a contrib module. Could you tell me about that? I > guess this feature would be more useful if provided as a core feature > and we need to discuss about syntax as Thomas mentioned. The problem with using functions like pg_waitlsn(‘LSN’ [, timeout in ms]) instead of new syntax for transaction starting commands like BEGIN TRANSACTION ... WAIT FOR ... is that it doesn't work for the higher isolation levels. In READ COMMITTED it's fine, because every statement runs with its own snapshot, so when SELECT pg_waitlsn(some_lsn) returns, the next statement will run with a snapshot that can see the effects of some_lsn being applied. But in REPEATABLE READ and SERIALIZABLE, even though pg_waitlsn(some_lsn) waits for the LSN to be applied, the next statement will run with the snapshot from before and never see the transaction you were waiting for. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
In reply to this post by Masahiko Sawada
Hello
On 07.03.2017 15:58, Masahiko Sawada wrote: > I've read the discussion so far but I didn't see the reason why you've > changed it to as a contrib module. Could you tell me about that? I did it on the initiative of our customer, who preferred the implementation in the form of contrib. Contrib realization of WAITLSN adds to core the only hook. On 07.03.2017 15:58, Masahiko Sawada wrote: > I guess this feature would be more useful if provided as a core > feature and we need to discuss about syntax as Thomas mentioned. Here I attached rebased patch waitlsn_10dev_v3 (core feature) I will leave the choice of implementation (core/contrib) to the discretion of the community. Will be glad to hear your suggestion about syntax, code and patch. -- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
On Fri, Mar 10, 2017 at 1:49 AM, Ivan Kartyshov
<[hidden email]> wrote: > Here I attached rebased patch waitlsn_10dev_v3 (core feature) > I will leave the choice of implementation (core/contrib) to the discretion > of the community. > > Will be glad to hear your suggestion about syntax, code and patch. Hi Ivan, Here is some feedback based on a first read-through of the v4 patch. I haven't tested it yet. First, I'll restate my view of the syntax-vs-function question: I think an fmgr function is the wrong place to do this, because it doesn't work for our 2 higher isolation levels as mentioned. Most people probably use READ COMMITTED most of the time so the extension version you've proposed is certainly useful for many people and I like it, but I will vote against inclusion in core of any feature that doesn't consider all three of our isolation levels, especially if there is no way to extend support to other levels later. I don't want PostgreSQL to keep adding features that eventually force everyone to use READ COMMITTED because they want to use feature X, Y or Z. Maybe someone can think of a clever way for an extension to insert a wait for a user-supplied LSN *before* acquiring a snapshot so it can work for the higher levels, or maybe the hooks should go into core PostgreSQL so that the extension can exist as an external project not requiring a patched PostgreSQL installation, or maybe this should be done with new core syntax that extends transaction commands. Do other people have views on this? + * Portions Copyright (c) 2012-2017, PostgresPro Global Development Group This should say PostgreSQL. +wl_lsn_updated_hook(void) +{ + uint i; + /* + * After update lastReplayedEndRecPtr set Latches in SHMEM array + */ + if (counter_waitlsn % count_waitlsn == 0 + || TimestampDifferenceExceeds(time_waitlsn,GetCurrentTimestamp(),interval_waitlsn)) + { Doesn't this mean that if you are waiting for LSN 1234, and the primary sends that LSN and then doesn't send anything for another hour, a standby waiting in pg_waitlsn is quite likely to skip that update (either because of count_waitlsn or interval_waitlsn), and then finish up waiting for a very long time? I'm not sure if this is a good idea, but it's an idea: You could keep your update skipping logic, but make sure you always catch the important case where recovery hits the end of WAL, by invoking your callback from WaitForWALToBecomeAvailable. It could have a boolean parameter that means 'don't skip this one!'. In other words, it's OK to skip updates, but not if you know there is no more WAL available to apply (since you have no idea how long it will be for more to arrive). Calling GetCurrentTimestamp() at high frequency (after every WAL record is replayed) may be a bad idea. It's a slow system call on some operating systems. Can we use an existing timestamp for free, like recoveryLastXTime? Remembering that the motivation for using this feature is to wait for *whole transactions* to be replayed and become visible, there is no sensible reason to wait for random WAL positions inside a transaction, so if you used that then you would effectively skip all non-COMMIT records and also skip some COMMIT records that are coming down the pipe too fast. +static void +wl_own_latch(void) +{ + SpinLockAcquire(&state->l_arr[MyBackendId].slock); + OwnLatch(&state->l_arr[MyBackendId].latch); + is_latch_owned = true; + + if (state->backend_maxid < MyBackendId) + state->backend_maxid = MyBackendId; + + state->l_arr[MyBackendId].pid = MyProcPid; + SpinLockRelease(&state->l_arr[MyBackendId].slock); +} What is the point of using extra latches for this? Why not just use the standard latch? Syncrep and many other things do that. I'm not actually sure if there is ever a reason to create more latches in regular backends. SIGUSR1 will be delivered and set the main latch anyway. There are cases of specialised latches in the system, like the wal receiver latch, and I'm reliably informed that that solves problems like delivering a wakeup message without having to know which backend is currently the wal receiver (a problem we might consider solving today with a condition variable?) I don't think anything like that applies here. + for (i = 0; i <= state->backend_maxid; i++) + { + SpinLockAcquire(&state->l_arr[i].slock); + if (state->l_arr[i].pid != 0) + SetLatch(&state->l_arr[i].latch); + SpinLockRelease(&state->l_arr[i].slock); + } Once we get through the update-skipping logic above, we hit this loop which acquires spinlocks for every backend one after another and sets the latches of every backend, no matter whether they are waiting for the LSN that has been applied. Assuming we go with this scan-the-whole-array approach, why not include the LSN waited for in the array slots, so that we can avoid disturbing processes that are waiting for a later LSN? Could you talk a bit about the trade-off between this approach and a queue based approach? I would like to understand whether this really is the best way to do it. One way to use a queue would be ConditionVariableBroadcast(&state->lsn_moved), and then waiters would use a condition variable wait loop. That would make your code much simpler (you wouldn't even need your array of spinlocks + latches) but would still have the problem of processes waking up even though they're actually waiting for a later LSN. Another choice would be to create a custom wait list which actually holds the LSNs waited for in sorted order, so that we wake up exactly the right processes, cheaply, or in arbitrary order which makes insertion cheaper but search more expensive. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
Hi Ivan,
On 3/12/17 10:20 PM, Thomas Munro wrote: > On Fri, Mar 10, 2017 at 1:49 AM, Ivan Kartyshov > <[hidden email]> wrote: >> Here I attached rebased patch waitlsn_10dev_v3 (core feature) >> I will leave the choice of implementation (core/contrib) to the discretion >> of the community. >> >> Will be glad to hear your suggestion about syntax, code and patch. > > Hi Ivan, > > Here is some feedback based on a first read-through of the v4 patch. > I haven't tested it yet. This thread has been idle for over a week. Please respond and/or post a new patch by 2017-03-24 00:00 AoE (UTC-12) or this submission will be marked "Returned with Feedback". Thanks, -- -David [hidden email] -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
In reply to this post by Thomas Munro-3
On Sun, Mar 12, 2017 at 10:20 PM, Thomas Munro
<[hidden email]> wrote: > Maybe someone can think of a clever way for an extension to insert a > wait for a user-supplied LSN *before* acquiring a snapshot so it can > work for the higher levels, or maybe the hooks should go into core > PostgreSQL so that the extension can exist as an external project not > requiring a patched PostgreSQL installation, or maybe this should be > done with new core syntax that extends transaction commands. Do other > people have views on this? IMHO, trying to do this using a function-based interface is a really bad idea for exactly the reasons you mention. I don't see why we'd resist the idea of core syntax here; transactions are a core part of PostgreSQL. There is, of course, the question of whether making LSNs such a user-visible thing is a good idea in the first place, but that's a separate question from issue of what syntax for such a thing is best. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
In reply to this post by David Steele
Hi Ivan,
On 3/21/17 1:06 PM, David Steele wrote: > Hi Ivan, > > On 3/12/17 10:20 PM, Thomas Munro wrote: >> On Fri, Mar 10, 2017 at 1:49 AM, Ivan Kartyshov >> <[hidden email]> wrote: >>> Here I attached rebased patch waitlsn_10dev_v3 (core feature) >>> I will leave the choice of implementation (core/contrib) to the >>> discretion >>> of the community. >>> >>> Will be glad to hear your suggestion about syntax, code and patch. >> >> Hi Ivan, >> >> Here is some feedback based on a first read-through of the v4 patch. >> I haven't tested it yet. > > This thread has been idle for over a week. Please respond and/or post a > new patch by 2017-03-24 00:00 AoE (UTC-12) or this submission will be > marked "Returned with Feedback". This submission has been marked "Returned with Feedback". Please feel free to resubmit to a future commitfest. Regards, -- -David [hidden email] -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
In reply to this post by Kartyshov Ivan
On 3/9/17 07:49, Ivan Kartyshov wrote:
> Here I attached rebased patch waitlsn_10dev_v3 (core feature) > I will leave the choice of implementation (core/contrib) to the > discretion of the community. This patch is registered in the upcoming commit fest, but it needs to be rebased. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
In reply to this post by Robert Haas
On 22 March 2017 at 01:17, Robert Haas <[hidden email]> wrote: On Sun, Mar 12, 2017 at 10:20 PM, Thomas Munro (I know this is old, but): That ship sailed a long time ago unfortunately, they're all over pg_stat_replication and pg_replication_slots and so on. They're already routinely used for monitoring replication lag in bytes, waiting for a peer to catch up, etc. |
In reply to this post by Thomas Munro-3
Hello, thank you for your comments over main idea and code.
On 13.03.2017 05:20, Thomas Munro wrote: 1) > First, I'll restate my view of the syntax-vs-function question: I > think an fmgr function is the wrong place to do this, because it > doesn't work for our 2 higher isolation levels as mentioned. Most > people probably use READ COMMITTED most of the time so the extension > version you've proposed is certainly useful for many people and I like > it, but I will vote against inclusion in core of any feature that > doesn't consider all three of our isolation levels, especially if > there is no way to extend support to other levels later. I don't want > PostgreSQL to keep adding features that eventually force everyone to > use READ COMMITTED because they want to use feature X, Y or Z. Waiting for LSN is expected to be used on hot standby READ ONLY replication. Only READ COMMITTED and REPEATABLE READ, are allowed on hot standby. > Maybe someone can think of a clever way for an extension to insert a > wait for a user-supplied LSN *before* acquiring a snapshot so it can > work for the higher levels, or maybe the hooks should go into core > PostgreSQL so that the extension can exist as an external project not > requiring a patched PostgreSQL installation, or maybe this should be > done with new core syntax that extends transaction commands. Do other > people have views on this? I think it is a good idea, but it is not clear for me, how to do it properly. 2) > +wl_lsn_updated_hook(void) > +{ > + uint i; > + /* > + * After update lastReplayedEndRecPtr set Latches in SHMEM array > + */ > + if (counter_waitlsn % count_waitlsn == 0 > + || > TimestampDifferenceExceeds(time_waitlsn,GetCurrentTimestamp(),interval_waitlsn)) > + { > > Doesn't this mean that if you are waiting for LSN 1234, and the > primary sends that LSN and then doesn't send anything for another > hour, a standby waiting in pg_waitlsn is quite likely to skip that > update (either because of count_waitlsn or interval_waitlsn), and then > finish up waiting for a very long time? > > I'm not sure if this is a good idea, but it's an idea: You could keep > your update skipping logic, but make sure you always catch the > important case where recovery hits the end of WAL, by invoking your > callback from WaitForWALToBecomeAvailable. It could have a boolean > parameter that means 'don't skip this one!'. In other words, it's OK > to skip updates, but not if you know there is no more WAL available to > apply (since you have no idea how long it will be for more to arrive). > > Calling GetCurrentTimestamp() at high frequency (after every WAL > record is replayed) may be a bad idea. It's a slow system call on > some operating systems. Can we use an existing timestamp for free, > like recoveryLastXTime? Remembering that the motivation for using > this feature is to wait for *whole transactions* to be replayed and > become visible, there is no sensible reason to wait for random WAL > positions inside a transaction, so if you used that then you would > effectively skip all non-COMMIT records and also skip some COMMIT > records that are coming down the pipe too fast. Yes, I applied this idea and prepared new patch. 3) > +static void > +wl_own_latch(void) > +{ > + SpinLockAcquire(&state->l_arr[MyBackendId].slock); > + OwnLatch(&state->l_arr[MyBackendId].latch); > + is_latch_owned = true; > + > + if (state->backend_maxid < MyBackendId) > + state->backend_maxid = MyBackendId; > + > + state->l_arr[MyBackendId].pid = MyProcPid; > + SpinLockRelease(&state->l_arr[MyBackendId].slock); > +} > > What is the point of using extra latches for this? Why not just use > the standard latch? Syncrep and many other things do that. I'm not > actually sure if there is ever a reason to create more latches in > regular backends. SIGUSR1 will be delivered and set the main latch > anyway. > > There are cases of specialised latches in the system, like the wal > receiver latch, and I'm reliably informed that that solves problems > like delivering a wakeup message without having to know which backend > is currently the wal receiver (a problem we might consider solving > today with a condition variable?) I don't think anything like that > applies here. In my case I create a bunch of shared latches for each backend, I`ll think how to use standard latches in an efficient way. And about specialized latches on standby they are already in busy with wal replaying functions. 4) > + for (i = 0; i <= state->backend_maxid; i++) > + { > + SpinLockAcquire(&state->l_arr[i].slock); > + if (state->l_arr[i].pid != 0) > + SetLatch(&state->l_arr[i].latch); > + SpinLockRelease(&state->l_arr[i].slock); > + } > > Once we get through the update-skipping logic above, we hit this loop > which acquires spinlocks for every backend one after another and sets > the latches of every backend, no matter whether they are waiting for > the LSN that has been applied. Assuming we go with this > scan-the-whole-array approach, why not include the LSN waited for in > the array slots, so that we can avoid disturbing processes that are > waiting for a later LSN? Done. > Could you talk a bit about the trade-off between this approach and a > queue based approach? I would like to understand whether this really > is the best way to do it. > One way to use a queue would be > ConditionVariableBroadcast(&state->lsn_moved), and then waiters would > use a condition variable wait loop. That would make your code much > simpler (you wouldn't even need your array of spinlocks + latches) but > would still have the problem of processes waking up even though > they're actually waiting for a later LSN. Another choice would be to > create a custom wait list which actually holds the LSNs waited for in > sorted order, so that we wake up exactly the right processes, cheaply, > or in arbitrary order which makes insertion cheaper but search more > expensive. I`ll think how to implemented waiting for lsn with queue in next patch. Thank you for your review. -- Ivan Kartyshov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list ([hidden email]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
Free forum by Nabble | Edit this page |