POC and rebased patch for CSN based snapshots

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

POC and rebased patch for CSN based snapshots

movead.li@highgo.ca
Hello hackers,

I have read the community mail from 'postgrespro' which the link
below ①, a summary for the patch, it generals a CSN by timestamp
when a transaction is committed and assigns a special value as CSN
for abort transaction, and record them in CSN SLRU file. Now we can
judge if a xid available in a snapshot with a CSN value instead of by
xmin,xmax and xip array so that if we hold CSN as a snapshot which
can be export and import.


CSN may be a correct direction and an important part to implement
distributed of PostgreSQL because it delivers few data among cross-nodes
for snapshot, so the patch is meant to do some research.

We want to implement Clock-SI base on the patch.However the patch
is too old, and I rebase the infrastructure part of the patch to recently
commit(7dc37ccea85).

The origin patch does not support csn alive among database restart
because it will clean csnlog at every time the database restart, it works
well until a prepared transaction occurs due to the csn of prepare
transaction cleaned by a database restart. So I add wal support for
csnlog then csn can alive all the time, and move the csnlog clean work
to auto vacuum.

It comes to another issue, now it can't switch from a xid-base snapshot
to csn-base snapshot if a prepare transaction exists because it can not
find csn for the prepare transaction produced during xid-base snapshot.
To solve it, if the database restart with snapshot change to csn-base I
record an 'xmin_for_csn' where start to check with csn snapshot. 

Some issues known about the current patch:
1. The CSN-snapshot support repeatable read isolation level only, we
should try to support other isolation levels.

2. We can not switch fluently from xid-base->csn-base, if there be prepared
transaction in database.
 
What do you think about it, I want try to test and improve the patch step by step. 


-----------
Regards,
Highgo Software (Canada/China/Pakistan)
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca





0001-CSN-base-snapshot.patch (50K) Download Attachment
0002-Wal-for-csn.patch (27K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: POC and rebased patch for CSN based snapshots

movead.li@highgo.ca
Hello hackers,

Currently, I do some changes based on the last version:
1. Catch up to the current  commit (c2bd1fec32ab54).
2. Add regression and document.
3. Add support to switch from xid-base snapshot to csn-base snapshot,
and the same with standby side.


Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca




0001-CSN-base-snapshot.patch (67K) Download Attachment
0002-Wal-for-csn.patch (37K) Download Attachment
0003-snapshot-switch.patch (36K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: POC and rebased patch for CSN based snapshots

akapila
On Fri, Jun 12, 2020 at 3:11 PM [hidden email] <[hidden email]> wrote:
>
> Hello hackers,
>
> Currently, I do some changes based on the last version:
> 1. Catch up to the current  commit (c2bd1fec32ab54).
> 2. Add regression and document.
> 3. Add support to switch from xid-base snapshot to csn-base snapshot,
> and the same with standby side.
>

AFAIU, this patch is to improve scalability and also will be helpful
for Global Snapshots stuff, is that right?  If so, how much
performance/scalability benefit this patch will have after Andres's
recent work on scalability [1]?

[1] - https://www.postgresql.org/message-id/20200301083601.ews6hz5dduc3w2se%40alap3.anarazel.de

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: POC and rebased patch for CSN based snapshots

Fujii Masao-4
In reply to this post by movead.li@highgo.ca


On 2020/06/12 18:41, [hidden email] wrote:
> Hello hackers,
>
> Currently, I do some changes based on the last version:
> 1. Catch up to the current  commit (c2bd1fec32ab54).
> 2. Add regression and document.
> 3. Add support to switch from xid-base snapshot to csn-base snapshot,
> and the same with standby side.

Andrey also seems to be proposing the similar patch [1] that introduces CSN
into core. Could you tell me what the difference between his patch and yours?
If they are almost the same, we should focus on one together rather than
working separately?

Regards,

[1]
https://www.postgresql.org/message-id/9964cf46-9294-34b9-4858-971e9029f5c7@...

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: POC and rebased patch for CSN based snapshots

Fujii Masao-4


On 2020/06/15 16:48, Fujii Masao wrote:

>
>
> On 2020/06/12 18:41, [hidden email] wrote:
>> Hello hackers,
>>
>> Currently, I do some changes based on the last version:
>> 1. Catch up to the current  commit (c2bd1fec32ab54).
>> 2. Add regression and document.
>> 3. Add support to switch from xid-base snapshot to csn-base snapshot,
>> and the same with standby side.

Probably it's not time to do the code review yet, but when I glanced the patch,
I came up with one question.

0002 patch changes GenerateCSN() so that it generates CSN-related WAL records
(and inserts it into WAL buffers). Which means that new WAL record is generated
whenever CSN is assigned, e.g., in GetSnapshotData(). Is this WAL generation
really necessary for CSN?

BTW, GenerateCSN() is called while holding ProcArrayLock. Also it inserts new
WAL record in WriteXidCsnXlogRec() while holding spinlock. Firstly this is not
acceptable because spinlocks are intended for *very* short-term locks.
Secondly, I don't think that WAL generation during ProcArrayLock is good
design because ProcArrayLock is likely to be bottleneck and its term should
be short for performance gain.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: POC and rebased patch for CSN based snapshots

movead.li@highgo.ca

Thanks for reply.

>Probably it's not time to do the code review yet, but when I glanced the patch,
>I came up with one question.
>0002 patch changes GenerateCSN() so that it generates CSN-related WAL records
>(and inserts it into WAL buffers). Which means that new WAL record is generated
>whenever CSN is assigned, e.g., in GetSnapshotData(). Is this WAL generation
>really necessary for CSN?
This is designed for crash recovery, here we record our most new lsn in wal so it
will not use a history lsn after a restart. It will not write into wal every time, but with
gap which you can see CSNAddByNanosec() function.

>BTW, GenerateCSN() is called while holding ProcArrayLock. Also it inserts new
>WAL record in WriteXidCsnXlogRec() while holding spinlock. Firstly this is not
>acceptable because spinlocks are intended for *very* short-term locks.
>Secondly, I don't think that WAL generation during ProcArrayLock is good
>design because ProcArrayLock is likely to be bottleneck and its term should
>be short for performance gain.
Thanks for point out which may help me deeply, I will reconsider that.


Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
Reply | Threaded
Open this post in threaded view
|

Re: POC and rebased patch for CSN based snapshots

movead.li@highgo.ca
In reply to this post by akapila

Thanks for reply.

>AFAIU, this patch is to improve scalability and also will be helpful
>for Global Snapshots stuff, is that right?  If so, how much
>performance/scalability benefit this patch will have after Andres's
>recent work on scalability [1]?
The patch focus on to be an infrastructure of sharding feature, according
to my test almost has the same performance with and without the patch.


Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
Reply | Threaded
Open this post in threaded view
|

Re: POC and rebased patch for CSN based snapshots

Fujii Masao-4
In reply to this post by movead.li@highgo.ca


On 2020/06/19 12:12, [hidden email] wrote:

>
> Thanks for reply.
>
>  >Probably it's not time to do the code review yet, but when I glanced the patch,
>>I came up with one question.
>>0002 patch changes GenerateCSN() so that it generates CSN-related WAL records
>>(and inserts it into WAL buffers). Which means that new WAL record is generated
>>whenever CSN is assigned, e.g., in GetSnapshotData(). Is this WAL generation
>>really necessary for CSN?
> This is designed for crash recovery, here we record our most new lsn in wal so it
> will not use a history lsn after a restart. It will not write into wal every time, but with
> a gap which you can see CSNAddByNanosec() function.

You mean that the last generated CSN needs to be WAL-logged because any smaller
CSN than the last one should not be reused after crash recovery. Right?

If right, that WAL-logging seems not necessary because CSN mechanism assumes
CSN is increased monotonically. IOW, even without that WAL-logging, CSN afer
crash recovery must be larger than that before. No?


>>BTW, GenerateCSN() is called while holding ProcArrayLock. Also it inserts new
>>WAL record in WriteXidCsnXlogRec() while holding spinlock. Firstly this is not
>>acceptable because spinlocks are intended for *very* short-term locks.
>>Secondly, I don't think that WAL generation during ProcArrayLock is good
>>design because ProcArrayLock is likely to be bottleneck and its term should
>>be short for performance gain.
> Thanks for point out which may help me deeply, I will reconsider that.

Thanks for working on this!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: POC and rebased patch for CSN based snapshots

movead.li@highgo.ca

>You mean that the last generated CSN needs to be WAL-logged because any smaller
>CSN than the last one should not be reused after crash recovery. Right?
Yes that's it. 

>If right, that WAL-logging seems not necessary because CSN mechanism assumes
>CSN is increased monotonically. IOW, even without that WAL-logging, CSN afer
>crash recovery must be larger than that before. No?

CSN collected based on time of  system in this patch, but time is not reliable all the
time. And it designed for Global CSN(for sharding) where it may rely on CSN from
other node , which generated from other machine. 

So monotonically is not reliable and it need to keep it's largest CSN in wal in case
of crash.


Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
Reply | Threaded
Open this post in threaded view
|

Re: POC and rebased patch for CSN based snapshots

Fujii Masao-4


On 2020/06/19 13:36, [hidden email] wrote:

>
>  >You mean that the last generated CSN needs to be WAL-logged because any smaller
>>CSN than the last one should not be reused after crash recovery. Right?
> Yes that's it.
>
>>If right, that WAL-logging seems not necessary because CSN mechanism assumes
>>CSN is increased monotonically. IOW, even without that WAL-logging, CSN afer
>>crash recovery must be larger than that before. No?
>
> CSN collected based on time of  system in this patch, but time is not reliable all the
> time. And it designed for Global CSN(for sharding) where it may rely on CSN from
> other node , which generated from other machine.
>
> So monotonically is not reliable and it need to keep it's largest CSN in wal in case
> of crash.

Thanks for the explanaion! Understood.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: POC and rebased patch for CSN based snapshots

Andrey Lepikhov
In reply to this post by movead.li@highgo.ca
On 6/12/20 2:41 PM, [hidden email] wrote:
> Hello hackers,
>
> Currently, I do some changes based on the last version:
> 1. Catch up to the current  commit (c2bd1fec32ab54).
> 2. Add regression and document.
> 3. Add support to switch from xid-base snapshot to csn-base snapshot,
> and the same with standby side.

Some remarks on your patch:
1. The variable last_max_csn can be an atomic variable.
2. GenerateCSN() routine: in the case than csn < csnState->last_max_csn
This is the case when someone changed the value of the system clock. I
think it is needed to write a WARNING to the log file. (May be we can do
synchronization with a time server.
3. That about global snapshot xmin? In the pgpro version of the patch we
had GlobalSnapshotMapXmin() routine to maintain circular buffer of
oldestXmins for several seconds in past. This buffer allows to shift
oldestXmin in the past when backend is importing global transaction.
Otherwise old versions of tuples that were needed for this transaction
can be recycled by other processes (vacuum, HOT, etc).
How do you implement protection from local pruning? I saw
SNAP_DESYNC_COMPLAIN, but it is not used anywhere.
4. The current version of the patch is not applied clearly with current
master.

--
regards,
Andrey Lepikhov
Postgres Professional


Reply | Threaded
Open this post in threaded view
|

Re: POC and rebased patch for CSN based snapshots

movead.li@highgo.ca
Thanks for the remarks,

>Some remarks on your patch:
>1. The variable last_max_csn can be an atomic variable.
Yes will consider.

>2. GenerateCSN() routine: in the case than csn < csnState->last_max_csn
>This is the case when someone changed the value of the system clock. I
>think it is needed to write a WARNING to the log file. (May be we can do
>synchronization with a time server.
Yes good point, I will work out a way to report the warning, it should exist a 
report gap rather than report every time it generates CSN.
If we really need a correct time? What's the inferiority if one node generate
csn by monotonically increasing?

>3. That about global snapshot xmin? In the pgpro version of the patch we
>had GlobalSnapshotMapXmin() routine to maintain circular buffer of
>oldestXmins for several seconds in past. This buffer allows to shift
>oldestXmin in the past when backend is importing global transaction.
>Otherwise old versions of tuples that were needed for this transaction
>can be recycled by other processes (vacuum, HOT, etc).
>How do you implement protection from local pruning? I saw
>SNAP_DESYNC_COMPLAIN, but it is not used anywhere.
I have researched your patch which is so great, in the patch only data
out of 'global_snapshot_defer_time' can be vacuum, and it keep dead
tuple even if no snapshot import at all,right?

I am thanking about a way if we can start remain dead tuple just before
we import a csn snapshot.

Base on Clock-SI paper, we should get local CSN then send to shard nodes,
because we do not known if the shard nodes' csn bigger or smaller then
master node, so we should keep some dead tuple all the time to support
snapshot import anytime.

Then if we can do a small change to CLock-SI model, we do not use the 
local csn when transaction start, instead we touch every shard node for
require their csn, and shard nodes start keep dead tuple, and master node
choose the biggest csn to send to shard nodes.

By the new way, we do not need to keep dead tuple all the time and do
not need to manage a ring buf, we can give to ball to 'snapshot too old'
feature. But for trade off, almost all shard node need wait. 
I will send more detail explain in few days.


>4. The current version of the patch is not applied clearly with current
>master.
Maybe it's because of the release of PG13, it cause some conflict, I will
rebase it.

---
Regards,
Highgo Software (Canada/China/Pakistan)
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca


Reply | Threaded
Open this post in threaded view
|

Re: POC and rebased patch for CSN based snapshots

Andrey Lepikhov
On 7/2/20 7:31 PM, Movead Li wrote:

> Thanks for the remarks,
>
>  >Some remarks on your patch:
>  >1. The variable last_max_csn can be an atomic variable.
> Yes will consider.
>
>  >2. GenerateCSN() routine: in the case than csn < csnState->last_max_csn
>  >This is the case when someone changed the value of the system clock. I
>  >think it is needed to write a WARNING to the log file. (May be we can do
>  >synchronization with a time server.
> Yes good point, I will work out a way to report the warning, it should
> exist a
> report gap rather than report every time it generates CSN.
> If we really need a correct time? What's the inferiority if one node
> generate
> csn by monotonically increasing?
Changes in time values can lead to poor effects, such as old snapshot.
Adjusting the time can be a kind of defense.

>
>  >3. That about global snapshot xmin? In the pgpro version of the patch we
>  >had GlobalSnapshotMapXmin() routine to maintain circular buffer of
>  >oldestXmins for several seconds in past. This buffer allows to shift
>  >oldestXmin in the past when backend is importing global transaction.
>  >Otherwise old versions of tuples that were needed for this transaction
>  >can be recycled by other processes (vacuum, HOT, etc).
>  >How do you implement protection from local pruning? I saw
>  >SNAP_DESYNC_COMPLAIN, but it is not used anywhere.
> I have researched your patch which is so great, in the patch only data
> out of 'global_snapshot_defer_time' can be vacuum, and it keep dead
> tuple even if no snapshot import at all,right?
>
> I am thanking about a way if we can start remain dead tuple just before
> we import a csn snapshot.
>
> Base on Clock-SI paper, we should get local CSN then send to shard nodes,
> because we do not known if the shard nodes' csn bigger or smaller then
> master node, so we should keep some dead tuple all the time to support
> snapshot import anytime.
>
> Then if we can do a small change to CLock-SI model, we do not use the
> local csn when transaction start, instead we touch every shard node for
> require their csn, and shard nodes start keep dead tuple, and master node
> choose the biggest csn to send to shard nodes.
>
> By the new way, we do not need to keep dead tuple all the time and do
> not need to manage a ring buf, we can give to ball to 'snapshot too old'
> feature. But for trade off, almost all shard node need wait.
> I will send more detail explain in few days.
I think, in the case of distributed system and many servers it can be
bottleneck.
Main idea of "deferred time" is to reduce interference between DML
queries in the case of intensive OLTP workload. This time can be reduced
if the bloationg of a database prevails over the frequency of
transaction aborts.
>
>
>  >4. The current version of the patch is not applied clearly with current
>  >master.
> Maybe it's because of the release of PG13, it cause some conflict, I will
> rebase it.
Ok
>
> ---
> Regards,
> Highgo Software (Canada/China/Pakistan)
> URL : www.highgo.ca <http://www.highgo.ca/>
> EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
>
>


--
regards,
Andrey Lepikhov
Postgres Professional


Reply | Threaded
Open this post in threaded view
|

Re: POC and rebased patch for CSN based snapshots

movead.li@highgo.ca
Hello Andrey

>> I have researched your patch which is so great, in the patch only data
>> out of 'global_snapshot_defer_time' can be vacuum, and it keep dead
>> tuple even if no snapshot import at all,right?
>>
>> I am thanking about a way if we can start remain dead tuple just before
>> we import a csn snapshot.
>>
>> Base on Clock-SI paper, we should get local CSN then send to shard nodes,
>> because we do not known if the shard nodes' csn bigger or smaller then
>> master node, so we should keep some dead tuple all the time to support
>> snapshot import anytime.
>>
>> Then if we can do a small change to CLock-SI model, we do not use the
>> local csn when transaction start, instead we touch every shard node for
>> require their csn, and shard nodes start keep dead tuple, and master node
>> choose the biggest csn to send to shard nodes.
>>
>> By the new way, we do not need to keep dead tuple all the time and do
>> not need to manage a ring buf, we can give to ball to 'snapshot too old'
>> feature. But for trade off, almost all shard node need wait.
>> I will send more detail explain in few days.
>I think, in the case of distributed system and many servers it can be
>bottleneck.
>Main idea of "deferred time" is to reduce interference between DML
>queries in the case of intensive OLTP workload. This time can be reduced
>if the bloationg of a database prevails over the frequency of
>transaction aborts.
OK there maybe a performance issue, and I have another question about Clock-SI.

For example we have three  nodes, shard1(as master), shard2, shard3, which
(time of node2) > (time of node2) > (time of node3), and you can see a picture:
http://movead.gitee.io/picture/blog_img_bad/csn/clock_si_question.png 

As far as I know about Clock-SI, left part of the blue line will setup as a snapshot
if master require a snapshot at time t1. But in fact data A should in snapshot but
not and data B should out of snapshot but not.

If this scene may appear in your origin patch? Or something my understand about
Clock-SI is wrong?


Reply | Threaded
Open this post in threaded view
|

Re: POC and rebased patch for CSN based snapshots

Andrey Lepikhov
On 7/4/20 7:56 PM, [hidden email] wrote:

>
>
>     As far as I know about Clock-SI, left part of the blue line will
>     setup as a snapshot
>
>     if master require a snapshot at time t1. But in fact data A should
>     in snapshot but
>
>     not and data B should out of snapshot but not.
>
>
>     If this scene may appear in your origin patch? Or something my
>     understand about
>
>     Clock-SI is wrong?
>
>
>

Sorry for late answer.

I have doubts that I fully understood your question, but still.
What real problems do you see here? Transaction t1 doesn't get state of
shard2 until time at node with shard2 won't reach start time of t1.
If transaction, that inserted B wants to know about it position in time
relatively to t1 it will generate CSN, attach to node1 and will see,
that t1 is not started yet.

Maybe you are saying about the case that someone who has a faster data
channel can use the knowledge from node1 to change the state at node2?
If so, i think it is not a problem, or you can explain your idea.

--
regards,
Andrey Lepikhov
Postgres Professional


Reply | Threaded
Open this post in threaded view
|

Re: POC and rebased patch for CSN based snapshots

movead.li@highgo.ca

>I have doubts that I fully understood your question, but still.
>What real problems do you see here? Transaction t1 doesn't get state of
>shard2 until time at node with shard2 won't reach start time of t1.
>If transaction, that inserted B wants to know about it position in time
>relatively to t1 it will generate CSN, attach to node1 and will see,
>that t1 is not started yet.
 
>Maybe you are saying about the case that someone who has a faster data
>channel can use the knowledge from node1 to change the state at node2?
>If so, i think it is not a problem, or you can explain your idea.
Sorry, I think this is my wrong understand about Clock-SI. At first I expect
we can get a absolutly snapshot, for example B should not include in the
snapshot because it happened after time t1. How ever Clock-SI can not guarantee
that and no design guarantee that at all.



Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca