zheap: a new storage format for PostgreSQL

classic Classic list List threaded Threaded
59 messages Options
123
Reply | Threaded
Open this post in threaded view
|

zheap: a new storage format for PostgreSQL

akapila
Sometime back Robert has proposed a solution to reduce the bloat in PostgreSQL [1] which has some other advantages of its own as well.  To recap, in the existing heap, we always create a new version of a tuple on an update which must eventually be removed by periodic vacuuming or by HOT-pruning, but still in many cases space is never reclaimed completely.  A similar problem occurs for tuples that are deleted.  This leads to bloat in the database.

At EnterpriseDB, we (me and some of my colleagues) are working from more than a year on the new storage format in which only the latest version of the data is kept in main storage and the old versions are moved to an undo log.  We call this new storage format "zheap".  To be clear, this proposal is for PG-12.  The purpose of posting this at this stage is that it can help as an example to be integrated with pluggable storage API patch and to get some early feedback on the design.  The purpose of this email is to introduce the overall project, however, I think going forward, we need to discuss some of the subsystems (like Indexing, Tuple locking, Vacuum for non-delete-marked indexes, Undo Log Storage, Undo Workers, etc. ) in separate threads.

The three main advantages of this new format are:
1. Provide better control over bloat (a) by allowing in-place updates in common cases and (b) by reusing space as soon as a transaction that has performed a delete or non-in-place-update has committed.  In short, with this new storage, whenever possible, we’ll avoid creating bloat in the first place.

2. Reduce write amplification both by avoiding rewrites of heap pages (for setting hint-bits, freezing, etc.) and by making it possible to do an update that touches indexed columns without updating every index.

3. Reduce the tuple size by (a) shrinking the tuple header and (b) eliminating most alignment padding.

You can check README.md in the project folder [1] to understand how to use it and also what are the open issues. The detailed design of the project is present at src/backend/access/zheap/README.  The code for this project is being developed in Github repository [1].  You can also read about this project from Robert's recent blog [2].  I have also added few notes on integration with pluggable API on zheap wiki page [3].

Preliminary performance results
-------------------------------------------

We’ve shown the performance improvement of zheap over heap in a few different pgbench scenarios.  All of these tests were run with data that fits in shared_buffers (32GB), and 16 transaction slots per zheap page. Scenario-1 and Scenario-2 has used synchronous_commit = off and Scenario-3 and Scenario-4 has used synchronous_commit = on


Scenario 1: A 15 minutes simple-update pgbench test with scale factor 100 shows 5.13% TPS improvement with 64 clients. The performance improvement increases as we increase the scale factor; at scale factor 1000, it reaches11.5% with 64 clients.


Scale Factor

HEAP

ZHEAP (tables)*

Improvement

Before test

100

1281 MB

1149 MB

-10.3%

1000

13 GB

11 GB

-15.38%

After test

100

4.08 GB

3 GB

-26.47%

1000

15 GB

12.6 GB

-16%

* The size of zheap tables increase because of the insertions in pgbench_history table.


Scenario 2: To show the effect of bloat, we’ve performed another test similar to the previous scenario, but a transaction is kept open for the first 15 minutes of a 30-minute test. This restricts HOT-pruning for the heap and undo-discarding for zheap for the first half of the test. Scale factor 1000 - 75.86% TPS improvement for zheap at 64 client count.  

Scale factor 3000 - 98.18% TPS improvement for zheap at 64 client count.


Scale Factor

HEAP

ZHEAP (tables)*

Improvement

After test

1000

19 GB

14 GB

-26.3%

3000

45 GB

37 GB

-17.7%

* The size of zheap tables increase because of the insertions in pgbench_history table.


The reason for this huge performance improvement is that when the long-running transaction gets committed after 900 seconds, autovacuum workers start working and degrade the performance of heap for a long time. In addition, the heap tables are also bloated by a significant amount. On the other hand, the undo worker discards the undo very quickly, and we don't have any bloat in the zheap relations. In brief, zheap clusters the bloats in undo segments. We just need to determine the how much undo can be discarded and remove it, which is cheap.


Scenario 3: A 15 minutes simple-update pgbench test with scale factor 100 shows 6% TPS improvement with 64 clients. The performance improvement increases as we increase the scale factor to 1000 achieving 11.8% with 64 clients.


Scale Factor

HEAP

ZHEAP (tables)*

Improvement

Before test

100

1281 MB

1149 MB

-10.3%

1000

13 GB

11 GB

-15.38%

After test

100

2.88 GB

2.20 GB

-23.61%

1000

13.9 GB

11.7 GB

-15.8%

* The size of zheap tables increase because of the insertions in pgbench_history table.


Scenario 4: To amplify the effect of bloats in scenario 3, we’ve performed another test similar to scenario, but a transaction is kept open for the first 15 minutes of a 30 minute test. This restricts HOT-pruning for heap and undo-discarding for zheap for the first half of the test.


Scale Factor

HEAP

ZHEAP (tables)*

Improvement

After test

1000

15.5 GB

12.4 GB

-20%

3000

40.2 GB

35 GB

-12.9%



Pros
--------
1. Zheap has better performance characteristics as it is smaller in size and it has an efficient mechanism to discard undo in the background which is cheaper than HOT-pruning.
2. The performance improvement is huge in cases where heap bloats and zheap bloats the undo.
3. We will also see a good performance boost for the cases where UPDATE statement updates few indexed columns.
4. The system slowdowns due to Vacuum (or Autovacuum) would be reduced to a great extent.
5. Due to fewer rewrites of the heap (like is no freezing, hot-pruning, hint-bits etc), the overall writes and the WAL volume will be lesser.

Cons
-----------
1. Deletes can be somewhat expensive.
2. Transaction aborts will be expensive.
3. Updates that update most of the indexed columns can be somewhat expensive.

Credits
------------
Robert did much of the basic design work.  The design and development of various subsystems of zheap have been done by a team comprising of me, Dilip Kumar, Kuntal Ghosh, Mithun CY, Ashutosh Sharma, Rafia Sabih, Beena Emerson, and Amit Khandekar.  Thomas Munro wrote the undo storage system.  Marc Linster has provided unfailing management support, and Andres Freund has provided some design input (and criticism).  Neha Sharma and Tushar Ahuja are helping with the testing of this project.

[1] - https://github.com/EnterpriseDB/zheap
[2] - http://rhaas.blogspot.in/2018/01/do-or-undo-there-is-no-vacuum.html

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Reply | Threaded
Open this post in threaded view
|

Re: zheap: a new storage format for PostgreSQL

akapila
On Thu, Mar 1, 2018 at 7:39 PM, Amit Kapila <[hidden email]> wrote:
>
> Preliminary performance results
> -------------------------------------------
>

I have not used plain text mode in my previous email due to which
performance results might not be clear in some email clients, so
attaching it again in the form of pdf.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

zheap_perf_data.pdf (98K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: zheap: a new storage format for PostgreSQL

Satyanarayana Narlapuram


>> Cons

>> -----------
>> 1. Deletes can be somewhat expensive.
>> 2. Transaction aborts will be expensive.
>> 3. Updates that update most of the indexed columns can be somewhat expensive.

Given transaction aborts are expensive, is there any impact on the crash recovery? Did you perform any tests on the recovery duration?

Thanks,
Satya




From: Amit Kapila <[hidden email]>
Sent: Thursday, March 1, 2018 7:05:12 AM
To: PostgreSQL Hackers
Subject: Re: zheap: a new storage format for PostgreSQL
 
On Thu, Mar 1, 2018 at 7:39 PM, Amit Kapila <[hidden email]> wrote:
>
> Preliminary performance results
> -------------------------------------------
>

I have not used plain text mode in my previous email due to which
performance results might not be clear in some email clients, so
attaching it again in the form of pdf.

--
With Regards,
Amit Kapila.
EnterpriseDB: https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.enterprisedb.com&data=04%7C01%7CSatyanarayana.Narlapuram%40microsoft.com%7Cad676656345544116aa008d57f85e87d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636555135932006655%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=7z7XUUdXr3CZe71y%2F7kVto%2BzJB5IogypcRHODu8yAu0%3D&reserved=0
Reply | Threaded
Open this post in threaded view
|

Re: zheap: a new storage format for PostgreSQL

Hartmut Holzgraefe
On 01.03.2018 16:30, Satyanarayana Narlapuram wrote:
> Given transaction aborts are expensive, is there any impact on the crash
> recovery?

In InnoDB/XtraDB, which has used the "move old row versions to UNDO log"
since the very beginning, rollbacks are indeed costly, and especially
so on recovery when the UNDO log pages are not yet cached in RAM.

There's is a cost trade of between this kind of "optimistic MVCC" and
rollback/recovery that one has to be aware of.

We get support issues about this at MariaDB every once in a while, but
it is not happening that often.

I can dig up some more info on this from the InnoDB side if you are
interested ...

--
hartmut

Reply | Threaded
Open this post in threaded view
|

Re: zheap: a new storage format for PostgreSQL

akapila
In reply to this post by Satyanarayana Narlapuram
On Thu, Mar 1, 2018 at 9:00 PM, Satyanarayana Narlapuram
<[hidden email]> wrote:

>
>>> Cons
>
>>> -----------
>>> 1. Deletes can be somewhat expensive.
>>> 2. Transaction aborts will be expensive.
>>> 3. Updates that update most of the indexed columns can be somewhat
>>> expensive.
>
> Given transaction aborts are expensive, is there any impact on the crash
> recovery?

I don't think there should be any direct impact of aborts on recovery
time as we start processing the undo records after recovery is done.
Basically, we invoke undo worker after recovery which performs the
aborts in the background.

>Did you perform any tests on the recovery duration?
>

Not yet, but I think we will do it after making some more progress.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: zheap: a new storage format for PostgreSQL

Alexander Korotkov
In reply to this post by akapila
On Thu, Mar 1, 2018 at 5:09 PM, Amit Kapila <[hidden email]> wrote:
Preliminary performance results
-------------------------------------------

We’ve shown the performance improvement of zheap over heap in a few different pgbench scenarios.  All of these tests were run with data that fits in shared_buffers (32GB), and 16 transaction slots per zheap page. Scenario-1 and Scenario-2 has used synchronous_commit = off and Scenario-3 and Scenario-4 has used synchronous_commit = on


What hardware did you use for benchmarks?
Also, I note that you have 4 transaction slots per zheap page in github code while you use 16 in benchmarks.

#define MAX_PAGE_TRANS_INFO_SLOTS 4

I would also note that in the code you preserve only 3 bits for transaction slot number.  So, one have to redefine 3 macros to change transaction slot number to the value you used in the benchmarks.

#define ZHEAP_XACT_SLOT 0x3800 /* 3 bits (12, 13 and 14) for transaction slot */
#define ZHEAP_XACT_SLOT_MASK 0x000B /* 11 - mask to retrieve transaction slot */

I'm only starting reviewing this, but it makes me think that we need transaction slots number to be tunable (or even auto-tunable).

BTW, last two macros don't look properly named for me.  I would rather rename them in a following way:
ZHEAP_XACT_SLOT_MASK => ZHEAP_XACT_SLOT_OFFSET
ZHEAP_XACT_SLOT => ZHEAP_XACT_SLOT_MASK

Scenario 1: A 15 minutes simple-update pgbench test with scale factor 100 shows 5.13% TPS improvement with 64 clients. The performance improvement increases as we increase the scale factor; at scale factor 1000, it reaches11.5% with 64 clients.


Scale Factor

HEAP

ZHEAP (tables)*

Improvement

Before test

100

1281 MB

1149 MB

-10.3%

1000

13 GB

11 GB

-15.38%

After test

100

4.08 GB

3 GB

-26.47%

1000

15 GB

12.6 GB

-16%

* The size of zheap tables increase because of the insertions in pgbench_history table.


I think results representation should be improved.  You show total size of the database, but it's hard to understand how bloat degree was really decreased, assuming that there are both update and append-only tables.  So, I propose to show the results in per table manner.

What is total number of transactions processed in both cases?  It would be also more fair to compare sizes for the same number of processed transactions.

Also, what are index sizes?  What are undo log sizes for zheap?
 
I also suggest to use Zipfian distribution in testing.  It's more close to real world workloads.  And it would be a good stress test for both HOT and transaction slots.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Reply | Threaded
Open this post in threaded view
|

Re: zheap: a new storage format for PostgreSQL

akapila
On Fri, Mar 2, 2018 at 2:42 AM, Alexander Korotkov <[hidden email]> wrote:
On Thu, Mar 1, 2018 at 5:09 PM, Amit Kapila <[hidden email]> wrote:
Preliminary performance results
-------------------------------------------

We’ve shown the performance improvement of zheap over heap in a few different pgbench scenarios.  All of these tests were run with data that fits in shared_buffers (32GB), and 16 transaction slots per zheap page. Scenario-1 and Scenario-2 has used synchronous_commit = off and Scenario-3 and Scenario-4 has used synchronous_commit = on


What hardware did you use for benchmarks? 
Also, I note that you have 4 transaction slots per zheap page in github code while you use 16 in benchmarks.

#define MAX_PAGE_TRANS_INFO_SLOTS 4

I would also note that in the code you preserve only 3 bits for transaction slot number.  So, one have to redefine 3 macros to change transaction slot number to the value you used in the benchmarks.

#define ZHEAP_XACT_SLOT 0x3800 /* 3 bits (12, 13 and 14) for transaction slot */
#define ZHEAP_XACT_SLOT_MASK 0x000B /* 11 - mask to retrieve transaction slot */

I'm only starting reviewing this, but it makes me think that we need transaction slots number to be tunable (or even auto-tunable).


Yeah, that is the plan.  So, the idea is that for now we will give compile time option to configure the number of slots (the patch for the same is ready, currently we are testing it), later we can even give the option to user at relation level or whatever we decides.  Why I think it makes sense to give an option at relation level is that for larger relations, we can do with very few transaction slots considering that the chances of many transactions operating on the same page are less, it is only for smaller relations that we need more number of slots.  OTOH, there could be workloads where we can expect many concurrent transactions on the same page.  However, for now if you want to test, the patch to increase transaction slots is attached, you need to change the value of few macros according to the number of slots you want.
 
BTW, last two macros don't look properly named for me.  I would rather rename them in a following way:
ZHEAP_XACT_SLOT_MASK => ZHEAP_XACT_SLOT_OFFSET

How about ZHEAP_XACT_SLOT_SHIFT?  I see similar things named with *_SHIFT suffix in code .
 
ZHEAP_XACT_SLOT => ZHEAP_XACT_SLOT_MASK


makes sense.  I will change it.
 

Scenario 1: A 15 minutes simple-update pgbench test with scale factor 100 shows 5.13% TPS improvement with 64 clients. The performance improvement increases as we increase the scale factor; at scale factor 1000, it reaches11.5% with 64 clients.


Scale Factor

HEAP

ZHEAP (tables)*

Improvement

Before test

100

1281 MB

1149 MB

-10.3%

1000

13 GB

11 GB

-15.38%

After test

100

4.08 GB

3 GB

-26.47%

1000

15 GB

12.6 GB

-16%

* The size of zheap tables increase because of the insertions in pgbench_history table.


I think results representation should be improved.  You show total size of the database, but it's hard to understand how bloat degree was really decreased, assuming that there are both update and append-only tables.  So, I propose to show the results in per table manner.


Fair enough, Kuntal has done this testing.  He will share the results as you have requested.
 
What is total number of transactions processed in both cases?  It would be also more fair to compare sizes for the same number of processed transactions.

Also, what are index sizes?  What are undo log sizes for zheap?
 

There shouldn't be any change in the index sizes and by the end of tests undo is completely discarded.  I think to see the impact of undo size, we need some different tests where in we can keep the transaction open till end of test or some such.
 
I also suggest to use Zipfian distribution in testing.  It's more close to real world workloads.  And it would be a good stress test for both HOT and transaction slots.


Yeah, we can do such tests, but keep in mid this code is still a work in progress and lot of things are going to change and till now we have not done much optimization in the code to improve the performance numbers.

Thanks a lot for showing interest in this work!

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

increase_slots_15.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: zheap: a new storage format for PostgreSQL

Álvaro Herrera
In reply to this post by akapila
I think it was impolite to post this on the very same day the commitfest
started.  We have enough patches as it is ...

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: zheap: a new storage format for PostgreSQL

Mark Kirkwood-2
On 02/03/18 16:53, Alvaro Herrera wrote:

> I think it was impolite to post this on the very same day the commitfest
> started.  We have enough patches as it is ...
>

To be fair - he did say things like "wanting feedback..." and "shows an
example of using pluggable storage.." and for PG 12. If he held onto the
patches and waited - he'd get criticism of the form "you should have
given a heads up earlier...".

This is earlier :-)

Best wishes

Mark

P.s: awesome work.

Reply | Threaded
Open this post in threaded view
|

Re: zheap: a new storage format for PostgreSQL

Fabien COELHO-3
In reply to this post by akapila

Hello Amit,

> At EnterpriseDB, we (me and some of my colleagues) are working from more
> than a year on the new storage format in which only the latest version of
> the data is kept in main storage and the old versions are moved to an undo
> log.  [...]

This looks more than great!

> *We’ve shown the performance improvement of zheap over heap in a few
> different pgbench scenarios. [...]

> 2. Transaction aborts will be expensive.

ISTM that some scenarii should also test the performance impact when the
zheap storage is expected to be worse than the heap storage, i.e. with
some rollback which will exercise the undo stuff. There does not seem to
be any in your report, I apologise if I misread it.

I would suggest that you can use pgbench scripts such as:

   -- commit.sql
   \set aid random(1, 100000 * :scale)
   BEGIN;
   UPDATE pgbench_accounts
     SET abalance = abalance + 1
     WHERE aid = :aid;
   COMMIT;

and

   -- rollback.sql
   \set aid random(1, 100000 * :scale)
   BEGIN;
   UPDATE pgbench_accounts
     SET abalance = abalance + 1
     WHERE aid = :aid;
   ROLLBACK;

that can run with various weights to change how much rollback is injected,
eg 1% rollback rate is achieved with:

   pgbench -T 10 -P 1 -M prepared -r \
     -f SQL/commit.sql@99 -f SQL/rollback.sql@1

Also, I would be wary of doing only max speed test, and consider more
realistic --rate tests where the tps is fixed.

--
Fabien.
Reply | Threaded
Open this post in threaded view
|

RE: zheap: a new storage format for PostgreSQL

Tsunakawa, Takayuki
In reply to this post by akapila
From: Amit Kapila [mailto:[hidden email]]
> At EnterpriseDB, we (me and some of my colleagues) are working from more
> than a year on the new storage format in which only the latest version of
> the data is kept in main storage and the old versions are moved to an undo
> log.  We call this new storage format "zheap".  To be clear, this proposal
> is for PG-12.

Wonderful!  BTW, what "z" stand for?  Ultimate?


> Credits
> ------------
> Robert did much of the basic design work.  The design and development of
> various subsystems of zheap have been done by a team comprising of me, Dilip
> Kumar, Kuntal Ghosh, Mithun CY, Ashutosh Sharma, Rafia Sabih, Beena Emerson,
> and Amit Khandekar.  Thomas Munro wrote the undo storage system.  Marc
> Linster has provided unfailing management support, and Andres Freund has
> provided some design input (and criticism).  Neha Sharma and Tushar Ahuja
> are helping with the testing of this project.

What a gorgeous star team!


Below are my first questions and comments.

(1)
This is a pure simple question from the user's perspective.  What kind of workloads would you recommend zheap and heap respectively?  Are you going to recommend zheap for all use cases, and will heap be deprecated?  I think we need to be clear on this in the manual, at least before the final release.

I felt zheap would be better for update-intensive workloads.  Then, how about insert-and-read-mostly databases like a data warehouse?  zheap seems better for that, since the database size is reduced.  Although data loading may generate more transaction logs for undo, that increase is offset by the reduction of the tuple header in WAL.

zheap allows us to run long-running analytics and reporting queries simultaneously with updates without the concern on database bloat, so zheap is a way toward HTAP, right?


(2)
Can zheap be used for system catalogs?  If yes, we won't be bothered with system catalog bloat, e.g. as a result of repeated creation and deletion of temporary tables.


(3)
> Scenario 1: A 15 minutes simple-update pgbench test with scale factor 100
> shows 5.13% TPS improvement with 64 clients. The performance improvement
> increases as we increase the scale factor; at scale factor 1000, it
> reaches11.5% with 64 clients.

What was the fillfactor?  What would be the comparison when HOT works effectively for heap?


(4)
"Undo logs are not yet crash-safe. Fsync and some recovery details are yet to be implemented."

"We also want to make FSM crash-safe, since we can’t count on
VACUUM to recover free space that we neglect to record."

Would these directly affect the response time of each transaction?  Do you predict that the performance difference will get smaller when these are implemented?


)5)
"The tuple header is reduced from 24 bytes to 5 bytes (8 bytes with alignment):
2 bytes each for informask and infomask2, and one byte for t_hoff.  I think we
might be able to squeeze some space from t_infomask, but for now, I have kept
it as two bytes.  All transactional information is stored in undo, so fields
that store such information are not needed here."

"To check the visibility of a
tuple, we fetch the transaction slot number stored in the tuple header, and
then get the transaction id and undo record pointer from transaction slot."

Where in the tuple header is the transaction slot number stored?


(6)
"As of now, we have four transaction slots per
page, but this can be changed.  Currently, this is a compile-time option;  we
can decide later whether such an option is desirable in general for users."

"The one known problem with the fixed number of slots is that
it can lead to deadlock, so we are planning to add  a mechanism to allow the
array of transactions slots to be continued on a separate overflow page.   We
also need such a mechanism to support cases where a large number of
transactions acquire SHARE or KEY SHARE locks on a single page."

I wish for this.  I was bothered with deadlocks with Oracle and had to tune INITRANS with CREATE TABLE.  The fixed number of slots introduces a new configuration parameter, which adds something the DBA has to be worried about and monitor a statistics figure for tuning.


(7)
What index AMs does "indexes which lack delete-marking support" apply to?

Can we be freed from vacuum in a typical use case where only zheap and B-tree indexes are used?


(8)
How does rollback after subtransaction rollback work?  Does the undo of a whole transaction skip the undo of the subtransaction?



(9)
Will the prepare of 2pc transactions be slower, as they have to safely save undo log?

Regards
Takayuki Tsunakawa



Reply | Threaded
Open this post in threaded view
|

Re: zheap: a new storage format for PostgreSQL

akapila
In reply to this post by Álvaro Herrera
On Fri, Mar 2, 2018 at 9:23 AM, Alvaro Herrera <[hidden email]> wrote:
> I think it was impolite to post this on the very same day the commitfest
> started.  We have enough patches as it is ...
>

I can understand your concern, but honestly, I have no intention to
hinder the current commit fest work.  We are preparing to post this
for more than a month, but it took some time to finish the
documentation and to fix some other issues.  I could have posted this
after the CF as well, but I was not sure if there is any benefit in
delaying, because, at this stage, we are not expecting much of code
review, but some feedback on high-level design and I think it can
certainly help pluggable API project.  I think the chances of getting
pluggable API in this release is remote, but maybe we can get some
small portion of it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: zheap: a new storage format for PostgreSQL

akapila
In reply to this post by Mark Kirkwood-2
On Fri, Mar 2, 2018 at 9:29 AM, Mark Kirkwood
<[hidden email]> wrote:

> On 02/03/18 16:53, Alvaro Herrera wrote:
>
>> I think it was impolite to post this on the very same day the commitfest
>> started.  We have enough patches as it is ...
>>
>
> To be fair - he did say things like "wanting feedback..." and "shows an
> example of using pluggable storage.." and for PG 12. If he held onto the
> patches and waited - he'd get criticism of the form "you should have given a
> heads up earlier...".
>
>
> P.s: awesome work.
>

Thanks.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: zheap: a new storage format for PostgreSQL

akapila
In reply to this post by Fabien COELHO-3
On Fri, Mar 2, 2018 at 1:35 PM, Fabien COELHO <[hidden email]> wrote:

>
> Hello Amit,
>
>> At EnterpriseDB, we (me and some of my colleagues) are working from more
>> than a year on the new storage format in which only the latest version of
>> the data is kept in main storage and the old versions are moved to an undo
>> log.  [...]
>
>
> This looks more than great!
>

Thanks.

>> *We’ve shown the performance improvement of zheap over heap in a few
>> different pgbench scenarios. [...]
>
>
>> 2. Transaction aborts will be expensive.
>
>
> ISTM that some scenarii should also test the performance impact when the
> zheap storage is expected to be worse than the heap storage, i.e. with some
> rollback which will exercise the undo stuff. There does not seem to be any
> in your report, I apologise if I misread it.
>

No, there isn't any.  One idea, we have to mitigate this cost is to
allow rollbacks to happen in the background.  Currently, the patch for
the same is being worked upon.

> I would suggest that you can use pgbench scripts such as:
>
>   -- commit.sql
>   \set aid random(1, 100000 * :scale)
>   BEGIN;
>   UPDATE pgbench_accounts
>     SET abalance = abalance + 1
>     WHERE aid = :aid;
>   COMMIT;
>
> and
>
>   -- rollback.sql
>   \set aid random(1, 100000 * :scale)
>   BEGIN;
>   UPDATE pgbench_accounts
>     SET abalance = abalance + 1
>     WHERE aid = :aid;
>   ROLLBACK;
>
> that can run with various weights to change how much rollback is injected,
> eg 1% rollback rate is achieved with:
>
>   pgbench -T 10 -P 1 -M prepared -r \
>     -f SQL/commit.sql@99 -f SQL/rollback.sql@1
>
> Also, I would be wary of doing only max speed test, and consider more
> realistic --rate tests where the tps is fixed.
>

Your suggestions are good, we will try to do some tests based on these
ideas after making some more progress in the Rollbacks (there is some
pending work in Rollbacks as mentioned in README.md).

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: zheap: a new storage format for PostgreSQL

Kuntal Ghosh
In reply to this post by Alexander Korotkov
On Fri, Mar 2, 2018 at 2:42 AM, Alexander Korotkov
<[hidden email]> wrote:
>
> I think results representation should be improved.  You show total size of the database, but it's hard to understand how bloat degree was really decreased, assuming that there are both update and append-only tables.  So, I propose to show the results in per table manner.
>
> What is total number of transactions processed in both cases?  It would be also more fair to compare sizes for the same number of processed transactions.
>
> Also, what are index sizes?  What are undo log sizes for zheap?
>
I've added the table sizes and TPS in the performance results. As of
now, we've just performed stress testing using pgbench. We've plans
for performing other tests including:
1. Introduce random delay in the transactions instead of keeping a
transaction open for 15 minutes.
2. Combination of ROLLBACK and COMMIT (As suggested by Fabien)
3. PGbench tests for fixed number of transaction.
4. Modify the distribution (As suggested by Alexander Korotkov)

Do let me know if any other tests are required.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

zheap_perf_data_1.pdf (102K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: zheap: a new storage format for PostgreSQL

akapila
In reply to this post by Tsunakawa, Takayuki
On Fri, Mar 2, 2018 at 1:50 PM, Tsunakawa, Takayuki
<[hidden email]> wrote:
> From: Amit Kapila [mailto:[hidden email]]
>> At EnterpriseDB, we (me and some of my colleagues) are working from more
>> than a year on the new storage format in which only the latest version of
>> the data is kept in main storage and the old versions are moved to an undo
>> log.  We call this new storage format "zheap".  To be clear, this proposal
>> is for PG-12.
>
> Wonderful!  BTW, what "z" stand for?  Ultimate?
>

There is no special meaning to 'z'.  We have discussed quite a few
names (like newheap, nheap, zheap and some more on those lines), but
zheap sounds better.  IIRC, one among Robert or Thomas has come up
with this name.

>
>
> Below are my first questions and comments.
>
> (1)
> This is a pure simple question from the user's perspective.  What kind of workloads would you recommend zheap and heap respectively?
>

I think you have already mentioned some of the important use cases for
zheap, namely, update-intensive workloads and probably the cases where
users have long-running queries with updates.

>  Are you going to recommend zheap for all use cases, and will heap be deprecated?
>

Oh, no. I don't think so.  We have yet not measured zheap's
performance in very many scenarios, so it is difficult to say about
all the cases, but I think eventually Deletes, Updates that update
most of index columns and Rollbacks will be somewhat costlier in
zheap.  Now, I think at this stage we can't measure everything because
(a) few things are not implemented and (b) we have not done much on
performance optimization of code.


> I felt zheap would be better for update-intensive workloads.  Then, how about insert-and-read-mostly databases like a data warehouse?  zheap seems better for that, since the database size is reduced.  Although data loading may generate more transaction logs for undo, that increase is offset by the reduction of the tuple header in WAL.
>

We have done optimization where we don't need to WAL-log the complete
undo data as it can be regenerated from page during recovery if
full_page_writes are enabled.

> zheap allows us to run long-running analytics and reporting queries simultaneously with updates without the concern on database bloat, so zheap is a way toward HTAP, right?
>

I think so.

>
> (2)
> Can zheap be used for system catalogs?
>

As of now, we are not planning to support it for system catalogs, as
it involves much more work, but I think if we want we can do it.

>
> (3)
>> Scenario 1: A 15 minutes simple-update pgbench test with scale factor 100
>> shows 5.13% TPS improvement with 64 clients. The performance improvement
>> increases as we increase the scale factor; at scale factor 1000, it
>> reaches11.5% with 64 clients.
>
> What was the fillfactor?
>

Default.

>  What would be the comparison when HOT works effectively for heap?
>

I guess this is the case where HOT works effectively.

>
> (4)
> "Undo logs are not yet crash-safe. Fsync and some recovery details are yet to be implemented."
>
> "We also want to make FSM crash-safe, since we can’t count on
> VACUUM to recover free space that we neglect to record."
>
> Would these directly affect the response time of each transaction?
>

Not the first one, but the second one might depend upon on the actual
implementation, but I think it is difficult to predict much at this
stage.

>
> )5)
> "The tuple header is reduced from 24 bytes to 5 bytes (8 bytes with alignment):
> 2 bytes each for informask and infomask2, and one byte for t_hoff.  I think we
> might be able to squeeze some space from t_infomask, but for now, I have kept
> it as two bytes.  All transactional information is stored in undo, so fields
> that store such information are not needed here."
>
> "To check the visibility of a
> tuple, we fetch the transaction slot number stored in the tuple header, and
> then get the transaction id and undo record pointer from transaction slot."
>
> Where in the tuple header is the transaction slot number stored?
>

In t_infomask2, refer zhtup.h.

>
> (6)
> "As of now, we have four transaction slots per
> page, but this can be changed.  Currently, this is a compile-time option;  we
> can decide later whether such an option is desirable in general for users."
>
> "The one known problem with the fixed number of slots is that
> it can lead to deadlock, so we are planning to add  a mechanism to allow the
> array of transactions slots to be continued on a separate overflow page.   We
> also need such a mechanism to support cases where a large number of
> transactions acquire SHARE or KEY SHARE locks on a single page."
>
> I wish for this.  I was bothered with deadlocks with Oracle and had to tune INITRANS with CREATE TABLE.  The fixed number of slots introduces a new configuration parameter, which adds something the DBA has to be worried about and monitor a statistics figure for tuning.
>

Yeah.

>
> (7)
> What index AMs does "indexes which lack delete-marking support" apply to?
>

Currently, delete-marking is not supported for any of the indexes, but
we are planning to do it for B-tree.

> Can we be freed from vacuum in a typical use case where only zheap and B-tree indexes are used?
>

Depends on what you mean by typical workloads?  I think for some
workloads like, when we are inserting monotonically increasing values
and deleting the initial values from index (say someone inserts
11111111111111...2222222222....333333... and then deletes all 1's),
then we might not immediately reclaim space in the index.  However, I
don't think we need vacuum per se for such cases, but we will
eventually need some way to clear the bloat in such cases.  However, I
think we are still far from there.

>
> (8)
> How does rollback after subtransaction rollback work?  Does the undo of a whole transaction skip the undo of the subtransaction?
>

We rewind the undo pointer after rolling back subtransaction, so we
need to just rollback the remaining part.

>
> (9)
> Will the prepare of 2pc transactions be slower, as they have to safely save undo log?
>

I don't think so, for prepared transactions, we need to just save
'from and to' undo record pointer.  OTOH, we have not yet measured the
performance of this case.



--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: zheap: a new storage format for PostgreSQL

Alexander Korotkov
On Fri, Mar 2, 2018 at 1:31 PM, Amit Kapila <[hidden email]> wrote:
On Fri, Mar 2, 2018 at 1:50 PM, Tsunakawa, Takayuki
<[hidden email]> wrote:
> From: Amit Kapila [mailto:[hidden email]]
>> At EnterpriseDB, we (me and some of my colleagues) are working from more
>> than a year on the new storage format in which only the latest version of
>> the data is kept in main storage and the old versions are moved to an undo
>> log.  We call this new storage format "zheap".  To be clear, this proposal
>> is for PG-12.
>
> Wonderful!  BTW, what "z" stand for?  Ultimate?
>

There is no special meaning to 'z'.  We have discussed quite a few
names (like newheap, nheap, zheap and some more on those lines), but
zheap sounds better.  IIRC, one among Robert or Thomas has come up
with this name.

I would propose "zero-bloat heap" disambiguation of zheap.  Seems like fair enough explanation for me without need to rename :)

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Reply | Threaded
Open this post in threaded view
|

Re: zheap: a new storage format for PostgreSQL

Thomas Munro-3
On Fri, Mar 2, 2018 at 11:35 PM, Alexander Korotkov
<[hidden email]> wrote:

> On Fri, Mar 2, 2018 at 1:31 PM, Amit Kapila <[hidden email]> wrote:
>> On Fri, Mar 2, 2018 at 1:50 PM, Tsunakawa, Takayuki
>> <[hidden email]> wrote:
>> > Wonderful!  BTW, what "z" stand for?  Ultimate?
>>
>> There is no special meaning to 'z'.  We have discussed quite a few
>> names (like newheap, nheap, zheap and some more on those lines), but
>> zheap sounds better.  IIRC, one among Robert or Thomas has come up
>> with this name.
>
> I would propose "zero-bloat heap" disambiguation of zheap.  Seems like fair
> enough explanation for me without need to rename :)

Nice.

A weird idea I had is that it adds a Z dimension to your tables.
That's a bit... far fetched, I admit.

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: zheap: a new storage format for PostgreSQL

Aleksander Alekseeev
In reply to this post by akapila
Hello Amit,

> Sometime back Robert has proposed a solution to reduce the bloat in
> PostgreSQL [1] which has some other advantages of its own as well.  To
> recap, in the existing heap, we always create a new version of a tuple on
> an update which must eventually be removed by periodic vacuuming or by
> HOT-pruning, but still in many cases space is never reclaimed completely.
> A similar problem occurs for tuples that are deleted.  This leads to bloat
> in the database.

This is an impressive work!

Personally I would like to note that performance is probably not a
priority at this stage. Most important parts, in my humble opinion at
least, are correctness, maintainability (tests, documentation, how
readable the code is), extendability (e.g. an ability to add point in
time recovery in the future), interfaces and heap format. There is some
saying on premature optimization... don't remember exact words and who
said this.

--
Best regards,
Aleksander Alekseev

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: zheap: a new storage format for PostgreSQL

Robert Haas
In reply to this post by Alexander Korotkov
On Fri, Mar 2, 2018 at 5:35 AM, Alexander Korotkov
<[hidden email]> wrote:
> I would propose "zero-bloat heap" disambiguation of zheap.  Seems like fair
> enough explanation for me without need to rename :)

It will be possible to bloat a zheap table in certain usage patterns.
For example, if you bulk-load the table with a ton of data, commit the
transaction, delete every other row, and then never insert any more
rows ever again, the table is bloated: it's twice as large as it
really needs to be, and we have no provision for shrinking it.  In
general, I think it's very hard to keep bulk deletes from leaving
bloat in the table, and to the extent that it *is* possible, we're not
doing it.  One could imagine, for example, an index-organized table
that automatically combines adjacent pages when they're empty enough,
and that also relocates data to physically lower-numbered pages
whenever possible.  Such a storage engine might automatically shrink
the on-disk footprint after a large delete, but we have no plans to go
in that direction.

Rather, our assumption is that the bloat most people care about comes
from updates.  By performing updates in-place as often as possible, we
hope to avoid bloating both the heap (because we're not adding new row
versions to it which then have to be removed) and the indexes (because
if we don't add new row versions at some other TID, then we don't need
to add index pointers to that new TID either, or remove the old index
pointers to the old TID).  Without delete-marking, we can basically
optimize the case that is currently handled via HOT updates: no
indexed columns have changed.  However, the in-place update has a
major advantage that it still works even when the page is completely
full, provided that the row does not expand.  As Amit's results show,
that can hugely reduce bloat and increase performance in the face of
long-running concurrent transactions.  With delete-marking, we can
also optimize the case where indexed columns have been changed.  We
don't know exactly how well this will work yet because the code isn't
written and therefore can't be benchmarked, but am hopeful that that
in-place updates will be a big win here too.

So, I would not describe a zheap table as zero-bloat, but it should
involve a lot less bloat than our standard heap.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

123