make async slave to wait for lsn to be replayed

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
59 messages Options
123
Reply | Threaded
Open this post in threaded view
|

make async slave to wait for lsn to be replayed

Kartyshov Ivan
Hi hackers,

Few days earlier I've finished my work on WAITLSN statement utility, so
I’d like to share it.


Introduction
============

Our clients who deal with 9.5 and use asynchronous master-slave
replication, asked to make the wait-mechanism on the slave side to
prevent the situation when slave handles query which needs data (LSN)
that was received, flushed, but still not replayed.


Problem description
===================

The implementation:
Must handle the wait-mechanism using pg_sleep() in order not to load system
Must avoid race conditions if different backend want to wait for
different LSN
Must not take snapshot of DB, to avoid troubles with sudden minXID change
Must have optional timeout parameter if LSN traffic has stalled.
Must release on postmaster’s death or interrupts.


Implementation
==============

To avoid troubles with snapshots, WAITLSN was implemented as a utility
statement, this allows us to circumvent the snapshot-taking mechanism.
We tried different variants and the most effective way was to use Latches.
To handle interprocess interaction all Latches are stored in shared
memory and to cope with race conditions, each Latch is protected by a
Spinlock.
Timeout was made optional parameter, it is set in milliseconds.


What works
==========

Actually, it works well even with significant timeout or wait period
values, but of course there might be things I've overlooked.

How to use it
==========

WAITLSN ‘LSN’ [, timeout in ms];

#Wait until LSN 0/303EC60 will be replayed, or 10 second passed.
WAITLSN ‘0/303EC60’, 10000;

#Or same without timeout.
WAITLSN ‘0/303EC60’;

Notice: WAITLSN will release on PostmasterDeath or Interruption events
if they come earlier then LSN or timeout.

Testing the implementation
======================

The implementation was tested with testgres and unittest python modules.

How to test this implementation:
Start master server
Make table test, insert tuple 1
Make asynchronous slave replication (9.5 wal_level = standby, 9.6 or
higher wal_level =  replica)
Slave: START TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
                SELECT * FROM test;
Master: delete tuple + make vacuum + get new LSN
Slave: WAITLSN ‘newLSN’, 60000;
                Waitlsn finished with FALSE “LSN doesn`t reached”
Slave: COMMIT;
                WAITLSN ‘newLSN’, 60000;
                Waitlsn finished with success (without NOTICE message)

The WAITLSN as expected wait LSN, and interrupts on PostmasterDeath,
interrupts or timeout.

Your feedback is welcome!


---
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

waitlsn_10dev.patch (16K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

Craig Ringer-3
On 31 August 2016 at 22:16, Ivan Kartyshov <[hidden email]> wrote:

> Our clients who deal with 9.5 and use asynchronous master-slave replication,
> asked to make the wait-mechanism on the slave side to prevent the situation
> when slave handles query which needs data (LSN) that was received, flushed,
> but still not replayed.

I like the broad idea - I've wanted something like it for a while. BDR
has pg_xlog_wait_remote_receive() and pg_xlog_wait_remote_apply() for
use in tests for this reason, but they act on the *upstream* side,
waiting until the downstream has acked the data. Not as useful for
ensuring that apps connected to both master and one or more replicas
get a consistent view of data.

How do you get the commit LSN to watch for? Grab
pg_current_xlog_insert_location() just after the commit and figure
that replaying to that point guarantees you get the commit?

Some time ago[1] I raised the idea of reporting commit LSN on the wire
to clients. That didn't go anywhere due to compatibility and security
concerns. I think those were resolvable, but it wasn't enough of a
priority to push hard on at the time. A truly "right" solution has to
wait for a protocol bump, but I think good-enough solutions are
possible now. So you might want to read that thread.

It also mentions hesitations about exposing LSN to clients even more.
I think we're *way* past that now - we have replication origins and
replication slots relying on it, it's exposed in a pg_lsn datatype, a
bunch of views expose it, etc. But it might be reasonable to ask
"should the client instead be expected to wait for the confirmed
commit of a 64-bit epoch-extended xid, like that returned by
txid_current()?" . One advantage of using xid is that you can get it
while you're still in the xact, so there's no race between commit and
checking the lsn after commit.

Are you specifically trying to ensure "this commit has replayed on the
replica before we run queries on it" ? Or something else?

(Also, on a side note, Kevin mentioned that it may be possible to use
SSI data to achieve SERIALIZABLE read-only queries on replicas, where
they get the same protection from commit-order related anomalies as
queries on the master. You might want to look more deeply into that
too at some stage, if you're trying to ensure the app can do read only
queries on the master and expect fully consistent results).

[1] https://www.postgresql.org/message-id/flat/53E41EC1.5050603%402ndquadrant.com#53E41EC1.5050603@...

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

Kartyshov Ivan
On 08/31/2016 05:54 PM, Craig Ringer wrote:
> How do you get the commit LSN to watch for? Grab
> pg_current_xlog_insert_location() just after the commit and figure
> that replaying to that point guarantees you get the commit?

That's the point, it was created in order to provide the cosistent view
of data between master and replica. You almost guessed, I used
GetXLogReplayRecPtr() right after LSN was physically replayed on downstream.

> Some time ago[1] I raised the idea of reporting commit LSN on the wire
> to clients. That didn't go anywhere due to compatibility and security
> concerns. I think those were resolvable, but it wasn't enough of a
> priority to push hard on at the time. A truly "right" solution has to
> wait for a protocol bump, but I think good-enough solutions are
> possible now. So you might want to read that thread.

Thank you for pointing to your thread, it was very informative!
It seems that you have solved the very similar problem.

> It also mentions hesitations about exposing LSN to clients even more.
> I think we're *way* past that now - we have replication origins and
> replication slots relying on it, it's exposed in a pg_lsn datatype, a
> bunch of views expose it, etc. But it might be reasonable to ask
> "should the client instead be expected to wait for the confirmed
> commit of a 64-bit epoch-extended xid, like that returned by
> txid_current()?" . One advantage of using xid is that you can get it
> while you're still in the xact, so there's no race between commit and
> checking the lsn after commit.

That sounds reasonable, but I dont think it will give us any
considerable benefits. But I`ll work out this variant.

> Are you specifically trying to ensure "this commit has replayed on the
> replica before we run queries on it" ? Or something else?

Yes you are right, I want to ensure data consistency on downstream
before running queries on it. Our clients would use it as a part of
background worker and maybe directly in apps too.


---
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

Thomas Munro-3
In reply to this post by Kartyshov Ivan
On Thu, Sep 1, 2016 at 2:16 AM, Ivan Kartyshov
<[hidden email]> wrote:
> Hi hackers,
>
> Few days earlier I've finished my work on WAITLSN statement utility, so I’d
> like to share it.
> [...]
> Your feedback is welcome!
>
> [waitlsn_10dev.patch]

Hi Ivan,

Thanks for working on this.  Here are some general thoughts on the
feature, and an initial review.

+1 for this feature.  Explicitly waiting for a given commit to be
applied is one of several approaches to achieve "causal consistency"
for reads on replica nodes, and I think it will be very useful if
combined with a convenient way to get the values to wait for when you
run COMMIT.  This could be used either by applications directly, or by
middleware that somehow keeps track of dependencies between
transactions and inserts waits.

I liked the way Heikki Linnakangas imagined this feature[1]:

  BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;

... or perhaps it could be spelled like this:

  BEGIN [isolation stuff] WAIT FOR COMMIT TOKEN <xxx> TIMEOUT <timeout>;

That allows waiting only at the start of a transaction, whereas your
idea of making a utility command would allow a single READ COMMITTED
transaction to wait multiple times for transactions it has heard about
through side channels, which may be useful.  Perhaps we could support
the same syntax in a stand alone statement inside a transaction OR as
part of a BEGIN ... statement.  Being able to do it as part of BEGIN
means that you can use this feature for single-snapshot transactions,
ie REPEATABLE READ and SERIALIZABLE (of course you can't use
SERIALIZABLE on hot standbys yet but that'll be fixed one day).
Otherwise you'd be waiting for the LSN in the middle of your
transaction but not be able to see the result because you don't take a
new snapshot.  Or maybe it's enough to use a standalone WAIT ...
statement inside a REPEATABLE READ or SERIALIZABLE transaction as long
as it's the first statement, and should be an error to do so any time
later?

I think working in terms of LSNs or XIDs explicitly is not a good
idea: encouraging clients to think in terms of anything other than
opaque 'commit tokens' seems like a bad idea because it limits future
changes.  For example, there is on-going discussion about introducing
CSNs (commit sequence numbers), and there are some related concepts
lurking in the SSI code; maybe we'd want to use those one day.  Do you
think it would make sense to have a concept of a commit token that is
a non-analysable string as far as clients are concerned, so that
clients are not encouraged to do anything at all with them except use
them in a WAIT FOR COMMIT TOKEN <xxx> statement?

INITIAL FEEDBACK ON THE PATCH

I didn't get as far as testing or detailed review because it has some
obvious bugs and compiler warnings which I figured we should talk
about first, and I also have some higher level questions about the
design.

gram.y:12882:15: error: assignment makes pointer from integer without
a cast [-Werror=int-conversion]
      n->delay = $3;

It looks like struct WaitLSNStmt accidentally has 'delay' as a pointer
to int.  Perhaps you want an int?  Maybe it would be useful to include
the units (millisecond, ms) in the name?

waitlsn.c: In function 'WLDisownLatch':
waitlsn.c:82:2: error: suggest parentheses around assignment used as
truth value [-Werror=parentheses]
  if (MyBackendId = state->backend_maxid)
  ^~

Pretty sure you want == here.

waitlsn.c: In function 'WaitLSNUtility':
waitlsn.c:153:17: error: initialization makes integer from pointer
without a cast [-Werror=int-conversion]
  int   tdelay = delay;
                 ^~~~~

Another place where I think you wanted an int but used a pointer to
int?  To fix that warning you need tdelay = *delay, but I think delay
should really not be taken by pointer at all.

@@ -6922,6 +6923,11 @@ StartupXLOG(void)
+ /*
+ * After update lastReplayedEndRecPtr set Latches in SHMEM array
+ */
+ WaitLSNSetLatch();
+

I think you should try to do this only after commit records are
replayed, not after every record.  Only commit records can make
transactions visible, and the raison d'être for this feature is to let
users wait for transactions they know about to become visible.  You
probably can't do it directly in xact_redo_commit though, because at
that point XLogCtl->lastReplayedEndRecPtr hasn't been updated yet so a
backend that wakes up might not see that it has advanced and go back
to sleep.  It is updated in the StartupXLOG loop after the redo
function runs.  That is the reason why WalRcvForceReply() is called
from there rather than in xact_redo_commit, to implement remote_apply
for replication.  Perhaps you need something similar?

+ tdelay -= (GetCurrentTimestamp() - timer);

You can't do arithmetic with TimestampTz like this.  Depending on
configure option --disable-integer-datetimes (which controls macro
HAVE_INT64_TIMESTAMP), it may be a floating point number of seconds
since the epoch, or an integer number of microseconds since the epoch.
It looks like maybe the above code assumes it works in milliseconds,
since you're using that to adjust your delay which is in milliseconds?

I would try to figure out how to express the logic you want with
TimestampTzPlusMilliseconds and TimestampDifference.  I'd probably do
something like compute the absolute timeout time with
TimestampTzPlusMilliseconds(GetCurrentTimestamp(), delay) at the start
and then compute the remaining delay each time through the latch wait
loop with TimestampDifference(GetCurrentTimestamp(), timeout, ...),
though that requires converting (seconds, microseconds) to
millisecond.

+void
+WaitLSNSetLatch(void)
+{
+ uint i;
+ for (i = 1; i < (state->backend_maxid+1); i++)
+ {
+ SpinLockAcquire(&state->l_arr[i].slock);
+ if (state->l_arr[i].pid != 0)
+ SetLatch(&state->l_arr[i].latch);
+ SpinLockRelease(&state->l_arr[i].slock);
+ }
+}

So your approach here is to let regular backends each have their own
slot indexed by backend ID, which seems good because it means that
they don't have to contend for a lock, but it's bad because it means
that the recovery process has to spin through the array looking for
backends to wake up every time it advances, and wake them all up no
matter whether they are interested in the current LSN or not.  That
means that they may get woken up many times before they see the value
they're waiting for.

Did you also consider a design where there would be a wait list/queue,
and the recovery process would wake up only those backends that are
waiting for LSNs <= the current replayed location?  That would make
the work for the recovery process cheaper (it just has to pop waiters
from one end of a list sorted by the LSN they're waiting for), and let
the waiting backends sleep without so many spurious wake-ups, but it
would create potential for contention between backends that are
calling WAITLSN at the same time because they all need to add
themselves to that queue which would involve some kind of mutex.  I
don't know if that would be better or not, but it's probably the first
way that I would have tried to do this.  See syncrep.c which does
something similar.

+static void
+WLOwnLatch(void)
+{
+ SpinLockAcquire(&state->l_arr[MyBackendId].slock);
+ OwnLatch(&state->l_arr[MyBackendId].latch);
+ is_latch_owned = true;
+ if (MyBackendId > state->backend_maxid)
+ state->backend_maxid += 1;
+ state->l_arr[MyBackendId].pid = MyProcPid;
+ SpinLockRelease(&state->l_arr[MyBackendId].slock);
+}

I'm a bit confused about state->backend_maxid.  It looks like you are
using that to limit the range of slots that WaitLSNSetLatch has to
scan.  Isn't it supposed to contain the highest MyBackendId that has
ever been seen?  You appear to be incrementing backend_maxid by one,
instead of recording the index of the highest slot in use, but then
WaitLSNSetLatch is using it to restrict the range of indexes.  Then
there is the question of the synchronisation of access to
backend_maxid.  You hold a spinlock in one arbitrary slot, but that
doesn't seem sufficient: another backend may also read it, compute a
new value and then write it, while holding a different spin lock.  Or
am I missing something?

+ if (delay > 0)
+ latch_events = WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH;
+ else
+ latch_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;

Here you are using delay <= 0 as 'wait forever'.  I wonder if it would
be useful to have two different special values: one meaning 'wait
forever', and another meaning 'don't wait at all: if it's not applied
yet, then timeout immediately'.  In any case I'd consider using names
for special wait times and using those for clarity:
WAITLSN_INFINITE_WAIT, WAITLSN_NO_WAIT.

Later I'll have feedback on the error messages, documentation and
comments but let's talk just about the design and code for now.

I hope this is helpful!

[1] https://www.postgresql.org/message-id/5642FF8F.4080803%40iki.fi

--
Thomas Munro
http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

Thomas Munro-3
On Thu, Sep 15, 2016 at 2:41 PM, Thomas Munro
<[hidden email]> wrote:

> On Thu, Sep 1, 2016 at 2:16 AM, Ivan Kartyshov
> <[hidden email]> wrote:
>> Hi hackers,
>>
>> Few days earlier I've finished my work on WAITLSN statement utility, so I’d
>> like to share it.
>> [...]
>> Your feedback is welcome!
>>
>> [waitlsn_10dev.patch]
>
> Thanks for working on this.  Here are some general thoughts on the
> feature, and an initial review.

Hi Ivan

I'm marking the patch Returned with Feedback, since there hasn't been
any response or a new patch.  I encourage you to keep working on this
feature, and I'll be happy to review future patches.

--
Thomas Munro
http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

Kartyshov Ivan
In reply to this post by Thomas Munro-3
Thank you for reviews and suggested improvements.
I rewrote patch to make it more stable.

Changes
=======
I've made a few changes:
1) WAITLSN now doesn`t depend on snapshot
2) Check current replayed LSN rather than in xact_redo_commit
3) Add syntax WAITLSN_INFINITE '0/693FF800' - for infinite wait and
WAITLSN_NO_WAIT '0/693FF800' for check if LSN was replayed as you
advised.
4) Reduce the count of loops with GUCs (WalRcvForceReply() which in 9.5
doesn`t exist).
5) Optimize loop that set latches.
6) Add two GUCs that helps us to configure influence on StartupXLOG:
count_waitlsn (denominator to check not each LSN)
interval_waitlsn (Interval in milliseconds to additional LSN check)

Feedback
========
On 09/15/2016 05:41 AM, Thomas Munro wrote:
> You hold a spinlock in one arbitrary slot, but that
> doesn't seem sufficient: another backend may also read it, compute a
> new value and then write it, while holding a different spin lock.  Or
> am I missing something?

We acquire an individual spinlock on each member of array, so you cannot
compute new value and write it concurrently.

Tested
======
We have been tested it on different servers and OS`s, in different cases
and workloads. New version is nearly as fast as vanilla on primary and
bring tiny influence on standby performance.

Hardware:
144 Intel Cores with HT
3TB RAM
all data on ramdisk
primary + hotstandby  on the same node.

A dataset was created with "pgbench -i -s 1000" command. For each round
of test we pause replay on standby, make 1000000 transaction on primary
with pgbench, start replay on standby and measure replication gap
disappearing time under different standby workload. The workload was
"WAITLSN ('Very/FarLSN', 1000ms timeout)" followed by "select abalance
from pgbench_accounts there aid = random_aid;"
For vanilla 1000ms timeout was enforced on pgbench side by -R option.
GUC waitlsn parameters was adopted for 1000ms timeout on standby with
35000 tps rate on primary.
interval_waitlsn = 500 (ms)
count_waitlsn = 30000

On 200 clients, slave caching up master as vanilla without significant
delay.
On 500 clients, slave caching up master 3% slower then vanilla.
On 1000 clients, 12% slower.
On 5000 clients, 3 time slower because it far above our hardware ability.

How to use it
==========
WAITLSN ‘LSN’ [, timeout in ms];
WAITLSN_INFINITE ‘LSN’;
WAITLSN_NO_WAIT ‘LSN’;

#Wait until LSN 0/303EC60 will be replayed, or 10 second passed.
WAITLSN ‘0/303EC60’, 10000;

#Or same without timeout.
WAITLSN ‘0/303EC60’;
orfile:///home/vis/Downloads/waitlsn_10dev_v2.patch
WAITLSN_INFINITE '0/693FF800';

#To check if LSN is replayed can be used.
WAITLSN_NO_WAIT '0/693FF800';

Notice: WAITLSN will release on PostmasterDeath or Interruption events
if they come earlier then target LSN or timeout.


Thank you for reading, will be glad to get your feedback.

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

waitlsn_10dev_v3.patch (18K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

Thom Brown-2
On 23 January 2017 at 11:56, Ivan Kartyshov <[hidden email]> wrote:

> Thank you for reviews and suggested improvements.
> I rewrote patch to make it more stable.
>
> Changes
> =======
> I've made a few changes:
> 1) WAITLSN now doesn`t depend on snapshot
> 2) Check current replayed LSN rather than in xact_redo_commit
> 3) Add syntax WAITLSN_INFINITE '0/693FF800' - for infinite wait and
> WAITLSN_NO_WAIT '0/693FF800' for check if LSN was replayed as you
> advised.
> 4) Reduce the count of loops with GUCs (WalRcvForceReply() which in 9.5
> doesn`t exist).
> 5) Optimize loop that set latches.
> 6) Add two GUCs that helps us to configure influence on StartupXLOG:
> count_waitlsn (denominator to check not each LSN)
> interval_waitlsn (Interval in milliseconds to additional LSN check)
>
> Feedback
> ========
> On 09/15/2016 05:41 AM, Thomas Munro wrote:
>>
>> You hold a spinlock in one arbitrary slot, but that
>> doesn't seem sufficient: another backend may also read it, compute a
>> new value and then write it, while holding a different spin lock.  Or
>> am I missing something?
>
>
> We acquire an individual spinlock on each member of array, so you cannot
> compute new value and write it concurrently.
>
> Tested
> ======
> We have been tested it on different servers and OS`s, in different cases and
> workloads. New version is nearly as fast as vanilla on primary and bring
> tiny influence on standby performance.
>
> Hardware:
> 144 Intel Cores with HT
> 3TB RAM
> all data on ramdisk
> primary + hotstandby  on the same node.
>
> A dataset was created with "pgbench -i -s 1000" command. For each round of
> test we pause replay on standby, make 1000000 transaction on primary with
> pgbench, start replay on standby and measure replication gap disappearing
> time under different standby workload. The workload was "WAITLSN
> ('Very/FarLSN', 1000ms timeout)" followed by "select abalance from
> pgbench_accounts there aid = random_aid;"
> For vanilla 1000ms timeout was enforced on pgbench side by -R option.
> GUC waitlsn parameters was adopted for 1000ms timeout on standby with 35000
> tps rate on primary.
> interval_waitlsn = 500 (ms)
> count_waitlsn = 30000
>
> On 200 clients, slave caching up master as vanilla without significant
> delay.
> On 500 clients, slave caching up master 3% slower then vanilla.
> On 1000 clients, 12% slower.
> On 5000 clients, 3 time slower because it far above our hardware ability.
>
> How to use it
> ==========
> WAITLSN ‘LSN’ [, timeout in ms];
> WAITLSN_INFINITE ‘LSN’;
> WAITLSN_NO_WAIT ‘LSN’;
>
> #Wait until LSN 0/303EC60 will be replayed, or 10 second passed.
> WAITLSN ‘0/303EC60’, 10000;
>
> #Or same without timeout.
> WAITLSN ‘0/303EC60’;
> orfile:///home/vis/Downloads/waitlsn_10dev_v2.patch
> WAITLSN_INFINITE '0/693FF800';
>
> #To check if LSN is replayed can be used.
> WAITLSN_NO_WAIT '0/693FF800';
>
> Notice: WAITLSN will release on PostmasterDeath or Interruption events
> if they come earlier then target LSN or timeout.
>
>
> Thank you for reading, will be glad to get your feedback.

Could you please rebase your patch as it no longer applies cleanly.

Thanks

Thom


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

Thomas Munro-3
On Thu, Feb 23, 2017 at 3:08 AM, Thom Brown <[hidden email]> wrote:
> On 23 January 2017 at 11:56, Ivan Kartyshov <[hidden email]> wrote:
>>
>> Thank you for reading, will be glad to get your feedback.
>
> Could you please rebase your patch as it no longer applies cleanly.

+1

--
Thomas Munro
http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

David Steele
Hi Ivan,

On 2/27/17 3:52 PM, Thomas Munro wrote:
> On Thu, Feb 23, 2017 at 3:08 AM, Thom Brown <[hidden email]> wrote:
>> On 23 January 2017 at 11:56, Ivan Kartyshov <[hidden email]> wrote:
>>>
>>> Thank you for reading, will be glad to get your feedback.
>>
>> Could you please rebase your patch as it no longer applies cleanly.
>
> +1

Please provide a rebased patch as soon as possible.

Thanks,
--
-David
[hidden email]


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

Kartyshov Ivan
Rebase done.

Meanwhile I made some more changes.

Changes
=======
1) WAITLSN is now implemented as an extension called "pg_waitlsn"

2) Call new hook "lsn_updated_hook" right after xact_redo_commit (xlog.c)

3) Corresponding functions:
pg_waitlsn('0/693FF800', 10000) - wait 10 seconds
pg_waitlsn_infinite('0/693FF800') - for infinite wait
pg_waitlsn_no_wait('0/693FF800') - once check if LSN was replayed or not.

4) Add two GUCs which help tuning influence on StartupXLOG:
count_waitlsn (denominator to check not each LSN)
int count_waitlsn    = 10;

interval_waitlsn (Interval in milliseconds to additional LSN check)
int interval_waitlsn = 100;

5) Optimize loop that set latches.

How to use it
==========
Master:
1) Make "wal_level = replica"
Slave:
2) Add  shared_preload_libraries = 'pg_waitlsn'
        hot_standby = on (in postgresql.conf)
3) Create extension pg_waitlsn;
4) And in hot_standby you can wait for LSN (pgsleep), when LSN will
replayed on slave pg_waitlsn will release

select pg_waitlsn(‘LSN’ [, timeout in ms]);
select pg_waitlsn_infinite(‘LSN’);
select pg_waitlsn_no_wait(‘LSN’);

#Wait until LSN 0/303EC60 will be replayed, or 10 second passed.
select pg_waitlsn(‘0/303EC60’, 10000);

#Or same without timeout.
select pg_waitlsn(‘0/303EC60’);
select pg_waitlsn_infinite('0/693FF800');

#To check if LSN is replayed can be used.
select pg_waitlsn_no_wait('0/693FF800');

Notice: select pg_waitlsn will release on PostmasterDeath or
Interruption events if they come earlier then target LSN or timeout.

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pg_waitlsn10_v4.patch (10K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

Masahiko Sawada
On Tue, Mar 7, 2017 at 8:48 PM, Ivan Kartyshov
<[hidden email]> wrote:
> Rebase done.

Thank you for updating the patch.

>
> Meanwhile I made some more changes.
>
> Changes
> =======
> 1) WAITLSN is now implemented as an extension called "pg_waitlsn"

I've read the discussion so far but I didn't see the reason why you've
changed it to as a contrib module. Could you tell me about that? I
guess this feature would be more useful if provided as a core feature
and we need to discuss about syntax as Thomas mentioned.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

Thomas Munro-3
On Wed, Mar 8, 2017 at 1:58 AM, Masahiko Sawada <[hidden email]> wrote:

> On Tue, Mar 7, 2017 at 8:48 PM, Ivan Kartyshov
> <[hidden email]> wrote:
>> Rebase done.
>
> Thank you for updating the patch.
>
>>
>> Meanwhile I made some more changes.
>>
>> Changes
>> =======
>> 1) WAITLSN is now implemented as an extension called "pg_waitlsn"
>
> I've read the discussion so far but I didn't see the reason why you've
> changed it to as a contrib module. Could you tell me about that? I
> guess this feature would be more useful if provided as a core feature
> and we need to discuss about syntax as Thomas mentioned.

The problem with using functions like pg_waitlsn(‘LSN’ [, timeout in
ms]) instead of new syntax for transaction starting commands like
BEGIN TRANSACTION ... WAIT FOR ... is that it doesn't work for the
higher isolation levels.  In READ COMMITTED it's fine, because every
statement runs with its own snapshot, so when SELECT
pg_waitlsn(some_lsn) returns, the next statement will run with a
snapshot that can see the effects of some_lsn being applied.  But in
REPEATABLE READ and SERIALIZABLE, even though pg_waitlsn(some_lsn)
waits for the LSN to be applied, the next statement will run with the
snapshot from before and never see the transaction you were waiting
for.

--
Thomas Munro
http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

Kartyshov Ivan
In reply to this post by Masahiko Sawada
Hello

On 07.03.2017 15:58, Masahiko Sawada wrote:
> I've read the discussion so far but I didn't see the reason why you've
> changed it to as a contrib module. Could you tell me about that?

I did it on the initiative of our customer, who preferred the
implementation in the form of contrib. Contrib realization of WAITLSN
adds to core the only hook.

On 07.03.2017 15:58, Masahiko Sawada wrote:
 > I guess this feature would be more useful if provided as a core
 > feature and we need to discuss about syntax as Thomas mentioned.

Here I attached rebased patch waitlsn_10dev_v3 (core feature)
I will leave the choice of implementation (core/contrib) to the
discretion of the community.

Will be glad to hear your suggestion about syntax, code and patch.

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

waitlsn_10dev_v3_rebased.patch (21K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

Thomas Munro-3
On Fri, Mar 10, 2017 at 1:49 AM, Ivan Kartyshov
<[hidden email]> wrote:
> Here I attached rebased patch waitlsn_10dev_v3 (core feature)
> I will leave the choice of implementation (core/contrib) to the discretion
> of the community.
>
> Will be glad to hear your suggestion about syntax, code and patch.

Hi Ivan,

Here is some feedback based on a first read-through of the v4 patch.
I haven't tested it yet.

First, I'll restate my view of the syntax-vs-function question:  I
think an fmgr function is the wrong place to do this, because it
doesn't work for our 2 higher isolation levels as mentioned.  Most
people probably use READ COMMITTED most of the time so the extension
version you've proposed is certainly useful for many people and I like
it, but I will vote against inclusion in core of any feature that
doesn't consider all three of our isolation levels, especially if
there is no way to extend support to other levels later.  I don't want
PostgreSQL to keep adding features that eventually force everyone to
use READ COMMITTED because they want to use feature X, Y or Z.

Maybe someone can think of a clever way for an extension to insert a
wait for a user-supplied LSN *before* acquiring a snapshot so it can
work for the higher levels, or maybe the hooks should go into core
PostgreSQL so that the extension can exist as an external project not
requiring a patched PostgreSQL installation, or maybe this should be
done with new core syntax that extends transaction commands.  Do other
people have views on this?

+ * Portions Copyright (c) 2012-2017, PostgresPro Global Development Group

This should say PostgreSQL.

+wl_lsn_updated_hook(void)
+{
+    uint i;
+    /*
+     * After update lastReplayedEndRecPtr set Latches in SHMEM array
+     */
+    if (counter_waitlsn % count_waitlsn == 0
+        || TimestampDifferenceExceeds(time_waitlsn,GetCurrentTimestamp(),interval_waitlsn))
+    {

Doesn't this mean that if you are waiting for LSN 1234, and the
primary sends that LSN and then doesn't send anything for another
hour, a standby waiting in pg_waitlsn is quite likely to skip that
update (either because of count_waitlsn or interval_waitlsn), and then
finish up waiting for a very long time?

I'm not sure if this is a good idea, but it's an idea:  You could keep
your update skipping logic, but make sure you always catch the
important case where recovery hits the end of WAL, by invoking your
callback from WaitForWALToBecomeAvailable.  It could have a boolean
parameter that means 'don't skip this one!'.  In other words, it's OK
to skip updates, but not if you know there is no more WAL available to
apply (since you have no idea how long it will be for more to arrive).

Calling GetCurrentTimestamp() at high frequency (after every WAL
record is replayed) may be a bad idea.  It's a slow system call on
some operating systems.  Can we use an existing timestamp for free,
like recoveryLastXTime?  Remembering that the motivation for using
this feature is to wait for *whole transactions* to be replayed and
become visible, there is no sensible reason to wait for random WAL
positions inside a transaction, so if you used that then you would
effectively skip all non-COMMIT records and also skip some COMMIT
records that are coming down the pipe too fast.

+static void
+wl_own_latch(void)
+{
+    SpinLockAcquire(&state->l_arr[MyBackendId].slock);
+    OwnLatch(&state->l_arr[MyBackendId].latch);
+    is_latch_owned = true;
+
+    if (state->backend_maxid < MyBackendId)
+        state->backend_maxid = MyBackendId;
+
+    state->l_arr[MyBackendId].pid = MyProcPid;
+    SpinLockRelease(&state->l_arr[MyBackendId].slock);
+}

What is the point of using extra latches for this?  Why not just use
the standard latch?  Syncrep and many other things do that.  I'm not
actually sure if there is ever a reason to create more latches in
regular backends.  SIGUSR1 will be delivered and set the main latch
anyway.

There are cases of specialised latches in the system, like the wal
receiver latch, and I'm reliably informed that that solves problems
like delivering a wakeup message without having to know which backend
is currently the wal receiver (a problem we might consider solving
today with a condition variable?)  I don't think anything like that
applies here.

+        for (i = 0; i <= state->backend_maxid; i++)
+        {
+            SpinLockAcquire(&state->l_arr[i].slock);
+            if (state->l_arr[i].pid != 0)
+                SetLatch(&state->l_arr[i].latch);
+            SpinLockRelease(&state->l_arr[i].slock);
+        }

Once we get through the update-skipping logic above, we hit this loop
which acquires  spinlocks for every backend one after another and sets
the latches of every backend, no matter whether they are waiting for
the LSN that has been applied.  Assuming we go with this
scan-the-whole-array approach, why not include the LSN waited for in
the array slots, so that we can avoid disturbing processes that are
waiting for a later LSN?

Could you talk a bit about the trade-off between this approach and a
queue based approach?  I would like to understand whether this really
is the best way to do it.

One way to use a queue would be
ConditionVariableBroadcast(&state->lsn_moved), and then waiters would
use a condition variable wait loop.  That would make your code much
simpler (you wouldn't even need your array of spinlocks + latches) but
would still have the problem of processes waking up even though
they're actually waiting for a later LSN.  Another choice would be to
create a custom wait list which actually holds the LSNs waited for in
sorted order, so that we wake up exactly the right processes, cheaply,
or in arbitrary order which makes insertion cheaper but search more
expensive.

--
Thomas Munro
http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

David Steele
Hi Ivan,

On 3/12/17 10:20 PM, Thomas Munro wrote:

> On Fri, Mar 10, 2017 at 1:49 AM, Ivan Kartyshov
> <[hidden email]> wrote:
>> Here I attached rebased patch waitlsn_10dev_v3 (core feature)
>> I will leave the choice of implementation (core/contrib) to the discretion
>> of the community.
>>
>> Will be glad to hear your suggestion about syntax, code and patch.
>
> Hi Ivan,
>
> Here is some feedback based on a first read-through of the v4 patch.
> I haven't tested it yet.

This thread has been idle for over a week.  Please respond and/or post a
new patch by 2017-03-24 00:00 AoE (UTC-12) or this submission will be
marked "Returned with Feedback".

Thanks,
--
-David
[hidden email]


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

Robert Haas
In reply to this post by Thomas Munro-3
On Sun, Mar 12, 2017 at 10:20 PM, Thomas Munro
<[hidden email]> wrote:
> Maybe someone can think of a clever way for an extension to insert a
> wait for a user-supplied LSN *before* acquiring a snapshot so it can
> work for the higher levels, or maybe the hooks should go into core
> PostgreSQL so that the extension can exist as an external project not
> requiring a patched PostgreSQL installation, or maybe this should be
> done with new core syntax that extends transaction commands.  Do other
> people have views on this?

IMHO, trying to do this using a function-based interface is a really
bad idea for exactly the reasons you mention.  I don't see why we'd
resist the idea of core syntax here; transactions are a core part of
PostgreSQL.

There is, of course, the question of whether making LSNs such a
user-visible thing is a good idea in the first place, but that's a
separate question from issue of what syntax for such a thing is best.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

David Steele
In reply to this post by David Steele
Hi Ivan,

On 3/21/17 1:06 PM, David Steele wrote:

> Hi Ivan,
>
> On 3/12/17 10:20 PM, Thomas Munro wrote:
>> On Fri, Mar 10, 2017 at 1:49 AM, Ivan Kartyshov
>> <[hidden email]> wrote:
>>> Here I attached rebased patch waitlsn_10dev_v3 (core feature)
>>> I will leave the choice of implementation (core/contrib) to the
>>> discretion
>>> of the community.
>>>
>>> Will be glad to hear your suggestion about syntax, code and patch.
>>
>> Hi Ivan,
>>
>> Here is some feedback based on a first read-through of the v4 patch.
>> I haven't tested it yet.
>
> This thread has been idle for over a week.  Please respond and/or post a
> new patch by 2017-03-24 00:00 AoE (UTC-12) or this submission will be
> marked "Returned with Feedback".

This submission has been marked "Returned with Feedback".  Please feel
free to resubmit to a future commitfest.

Regards,
--
-David
[hidden email]


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

Peter Eisentraut-6
In reply to this post by Kartyshov Ivan
On 3/9/17 07:49, Ivan Kartyshov wrote:
> Here I attached rebased patch waitlsn_10dev_v3 (core feature)
> I will leave the choice of implementation (core/contrib) to the
> discretion of the community.

This patch is registered in the upcoming commit fest, but it needs to be
rebased.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

Craig Ringer-3
In reply to this post by Robert Haas
On 22 March 2017 at 01:17, Robert Haas <[hidden email]> wrote:
On Sun, Mar 12, 2017 at 10:20 PM, Thomas Munro
<[hidden email]> wrote:
> Maybe someone can think of a clever way for an extension to insert a
> wait for a user-supplied LSN *before* acquiring a snapshot so it can
> work for the higher levels, or maybe the hooks should go into core
> PostgreSQL so that the extension can exist as an external project not
> requiring a patched PostgreSQL installation, or maybe this should be
> done with new core syntax that extends transaction commands.  Do other
> people have views on this?

IMHO, trying to do this using a function-based interface is a really
bad idea for exactly the reasons you mention.  I don't see why we'd
resist the idea of core syntax here; transactions are a core part of
PostgreSQL.

There is, of course, the question of whether making LSNs such a
user-visible thing is a good idea in the first place, but that's a
separate question from issue of what syntax for such a thing is best.

(I know this is old, but):

That ship sailed a long time ago unfortunately, they're all over pg_stat_replication and pg_replication_slots and so on. They're already routinely used for monitoring replication lag in bytes, waiting for a peer to catch up, etc.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Reply | Threaded
Open this post in threaded view
|

Re: make async slave to wait for lsn to be replayed

Kartyshov Ivan
In reply to this post by Thomas Munro-3
Hello, thank you for your comments over main idea and code.

On 13.03.2017 05:20, Thomas Munro wrote:
1)

> First, I'll restate my view of the syntax-vs-function question:  I
> think an fmgr function is the wrong place to do this, because it
> doesn't work for our 2 higher isolation levels as mentioned.  Most
> people probably use READ COMMITTED most of the time so the extension
> version you've proposed is certainly useful for many people and I like
> it, but I will vote against inclusion in core of any feature that
> doesn't consider all three of our isolation levels, especially if
> there is no way to extend support to other levels later.  I don't want
> PostgreSQL to keep adding features that eventually force everyone to
> use READ COMMITTED because they want to use feature X, Y or Z.

Waiting for LSN is expected to be used on hot standby READ ONLY
replication.
Only READ COMMITTED and REPEATABLE READ, are allowed on hot standby.

> Maybe someone can think of a clever way for an extension to insert a
> wait for a user-supplied LSN *before* acquiring a snapshot so it can
> work for the higher levels, or maybe the hooks should go into core
> PostgreSQL so that the extension can exist as an external project not
> requiring a patched PostgreSQL installation, or maybe this should be
> done with new core syntax that extends transaction commands.  Do other
> people have views on this?

I think it is a good idea, but it is not clear for me, how to do it
properly.

2)

> +wl_lsn_updated_hook(void)
> +{
> +    uint i;
> +    /*
> +     * After update lastReplayedEndRecPtr set Latches in SHMEM array
> +     */
> +    if (counter_waitlsn % count_waitlsn == 0
> +        ||
> TimestampDifferenceExceeds(time_waitlsn,GetCurrentTimestamp(),interval_waitlsn))
> +    {
>
> Doesn't this mean that if you are waiting for LSN 1234, and the
> primary sends that LSN and then doesn't send anything for another
> hour, a standby waiting in pg_waitlsn is quite likely to skip that
> update (either because of count_waitlsn or interval_waitlsn), and then
> finish up waiting for a very long time?
>
> I'm not sure if this is a good idea, but it's an idea:  You could keep
> your update skipping logic, but make sure you always catch the
> important case where recovery hits the end of WAL, by invoking your
> callback from WaitForWALToBecomeAvailable.  It could have a boolean
> parameter that means 'don't skip this one!'.  In other words, it's OK
> to skip updates, but not if you know there is no more WAL available to
> apply (since you have no idea how long it will be for more to arrive).
>
> Calling GetCurrentTimestamp() at high frequency (after every WAL
> record is replayed) may be a bad idea.  It's a slow system call on
> some operating systems.  Can we use an existing timestamp for free,
> like recoveryLastXTime?  Remembering that the motivation for using
> this feature is to wait for *whole transactions* to be replayed and
> become visible, there is no sensible reason to wait for random WAL
> positions inside a transaction, so if you used that then you would
> effectively skip all non-COMMIT records and also skip some COMMIT
> records that are coming down the pipe too fast.

Yes, I applied this idea and prepared new patch.

3)

> +static void
> +wl_own_latch(void)
> +{
> +    SpinLockAcquire(&state->l_arr[MyBackendId].slock);
> +    OwnLatch(&state->l_arr[MyBackendId].latch);
> +    is_latch_owned = true;
> +
> +    if (state->backend_maxid < MyBackendId)
> +        state->backend_maxid = MyBackendId;
> +
> +    state->l_arr[MyBackendId].pid = MyProcPid;
> +    SpinLockRelease(&state->l_arr[MyBackendId].slock);
> +}
>
> What is the point of using extra latches for this?  Why not just use
> the standard latch?  Syncrep and many other things do that.  I'm not
> actually sure if there is ever a reason to create more latches in
> regular backends.  SIGUSR1 will be delivered and set the main latch
> anyway.
>
> There are cases of specialised latches in the system, like the wal
> receiver latch, and I'm reliably informed that that solves problems
> like delivering a wakeup message without having to know which backend
> is currently the wal receiver (a problem we might consider solving
> today with a condition variable?)  I don't think anything like that
> applies here.

In my case I create a bunch of shared latches for each backend, I`ll
think
how to use standard latches in an efficient way. And about specialized
latches on standby they are already in busy with wal replaying
functions.

4)

> +        for (i = 0; i <= state->backend_maxid; i++)
> +        {
> +            SpinLockAcquire(&state->l_arr[i].slock);
> +            if (state->l_arr[i].pid != 0)
> +                SetLatch(&state->l_arr[i].latch);
> +            SpinLockRelease(&state->l_arr[i].slock);
> +        }
>
> Once we get through the update-skipping logic above, we hit this loop
> which acquires  spinlocks for every backend one after another and sets
> the latches of every backend, no matter whether they are waiting for
> the LSN that has been applied.  Assuming we go with this
> scan-the-whole-array approach, why not include the LSN waited for in
> the array slots, so that we can avoid disturbing processes that are
> waiting for a later LSN?

Done.

> Could you talk a bit about the trade-off between this approach and a
> queue based approach?  I would like to understand whether this really
> is the best way to do it.
> One way to use a queue would be
> ConditionVariableBroadcast(&state->lsn_moved), and then waiters would
> use a condition variable wait loop.  That would make your code much
> simpler (you wouldn't even need your array of spinlocks + latches) but
> would still have the problem of processes waking up even though
> they're actually waiting for a later LSN.  Another choice would be to
> create a custom wait list which actually holds the LSNs waited for in
> sorted order, so that we wake up exactly the right processes, cheaply,
> or in arbitrary order which makes insertion cheaper but search more
> expensive.

I`ll think how to implemented waiting for lsn with queue in next patch.

Thank you for your review.

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
123