Move unused buffers to freelist

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
39 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Move unused buffers to freelist

amitkapila

As discussed and concluded in mail thread (http://www.postgresql.org/message-id/006f01ce34f0$d6fa8220$84ef8660$@kapila@...), for moving unused buffer’s to freelist end,

I having implemented the idea and taken some performance data.

 

 

In the attached patch, bgwriter/checkpointer moves unused (usage_count =0 && refcount = 0) buffer’s to end of freelist. I have implemented a new API StrategyMoveBufferToFreeListEnd() to

move buffer’s to end of freelist.

 

Performance Data :

 

Configuration Details

O/S – Suse-11

RAM – 24GB

Number of Cores – 8

Server Conf – checkpoint_segments = 256; checkpoint_timeout = 25 min, synchronous_commit = 0FF, shared_buffers = 5GB

Pgbench – Select-only

Scalefactor – 1200

Time – Each run is of 20 mins

 

Below data is for average 3 runs of 20 minutes

 

                        8C-8T                16C-16T            32C-32T            64C-64T

HEAD               11997               8455                 4989                 2757

After Patch        19807               13296               8388                 2821

 

Detailed each run data is attached with mail.

 

This is just the initial data, I will collect more data based on different configuration of shared buffers and other configurations.

 

Feedback/Suggesions?

 

With Regards,

Amit Kapila.



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

move_unused_buffers_to_freelist.htm (25K) Download Attachment
move_unsed_buffers_to_freelist.patch (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

Greg Smith-21
On 5/14/13 9:42 AM, Amit Kapila wrote:
> In the attached patch, bgwriter/checkpointer moves unused (usage_count
> =0 && refcount = 0) buffer’s to end of freelist. I have implemented a
> new API StrategyMoveBufferToFreeListEnd() to

There's a comment in the new function:

It is possible that we are told to put something in the freelist that
is already in it; don't screw up the list if so.

I don't see where the code does anything to handle that though.  What
was your intention here?

This area has always been the tricky part of the change.  If you do
something complicated when adding new entries, like scanning the
freelist for duplicates, you run the risk of holding BufFreelistLock for
too long.  To try and see that in benchmarks, I would use a small
database scale (I typically use 100 for this type of test) and a large
number of clients.  "-M prepared" would help get a higher transaction
rate out of the hardware too.  It might take a server with a large core
count to notice any issues with holding the lock for too long though.

Instead you might just invalidate buffers before they go onto the list.
  Doing that will then throw away usefully cached data though.

To try and optimize both insertion speed and retaining cached data, I
was thinking about using a hash table for the free buffers, instead of
the simple linked list approach used in the code now.

Also:  check the formatting on the additions to in bufmgr.c, I noticed a
spaces vs. tabs difference on lines 35/36 of your patch.

--
Greg Smith   2ndQuadrant US    [hidden email]   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

amitkapila
On Wednesday, May 15, 2013 12:44 AM Greg Smith wrote:

> On 5/14/13 9:42 AM, Amit Kapila wrote:
> > In the attached patch, bgwriter/checkpointer moves unused
> (usage_count
> > =0 && refcount = 0) buffer's to end of freelist. I have implemented a
> > new API StrategyMoveBufferToFreeListEnd() to
>
> There's a comment in the new function:
>
> It is possible that we are told to put something in the freelist that
> is already in it; don't screw up the list if so.
>
> I don't see where the code does anything to handle that though.  What
> was your intention here?

The intention is that put the entry in freelist only if it is not in
freelist which is accomplished by check
If (buf->freeNext == FREENEXT_NOT_IN_LIST). Every entry when removed from
freelist, buf->freeNext is marked as FREENEXT_NOT_IN_LIST.
Code Reference (last line):
StrategyGetBuffer()
{
..
..
while (StrategyControl->firstFreeBuffer >= 0)
        {
                buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
                Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);

                /* Unconditionally remove buffer from freelist */
                StrategyControl->firstFreeBuffer = buf->freeNext;
                buf->freeNext = FREENEXT_NOT_IN_LIST;

...
}

Also the same check exists in StrategyFreeBuffer().

> This area has always been the tricky part of the change.  If you do
> something complicated when adding new entries, like scanning the
> freelist for duplicates, you run the risk of holding BufFreelistLock
> for
> too long.

Yes, this is true and I had tried to hold this lock for minimal time.
In this patch, it holds BufFreelistLock only to put the unused buffer at end
of freelist.

> To try and see that in benchmarks, I would use a small
> database scale (I typically use 100 for this type of test) and a large
> number of clients.  

>"-M prepared" would help get a higher transaction
> rate out of the hardware too.  It might take a server with a large core
> count to notice any issues with holding the lock for too long though.

This is good idea, I shall take another set of readings with "-M prepared"
 
> Instead you might just invalidate buffers before they go onto the list.
>   Doing that will then throw away usefully cached data though.

Yes, if we invalidate buffers, it might throw away usefully cached data
especially when working set just a tiny bit smaller than shared_buffers.
This is pointed by Robert in his mail
http://www.postgresql.org/message-id/CA+TgmoYhWsz__KtSxm6BuBirE7VR6Qqc_COkbE
[hidden email]


> To try and optimize both insertion speed and retaining cached data,

I think by the method proposed by patch it takes care of both, because it
directly puts free buffer at end of freelist and
because it doesn't invalidate the buffers it can retain cached data for
longer period.
Do you see any flaw with current approach?

> I
> was thinking about using a hash table for the free buffers, instead of
> the simple linked list approach used in the code now.

Okay, we can try different methods for maintaining free buffers if we find
current approach doesn't turn out to be good.
 
> Also:  check the formatting on the additions to in bufmgr.c, I noticed
> a
> spaces vs. tabs difference on lines 35/36 of your patch.

Thanks for pointing it, I shall send an updated patch along with next set of
performance data.


With Regards,
Amit Kapila.



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

amitkapila

On Wednesday, May 15, 2013 8:38 AM Amit Kapila wrote:

> On Wednesday, May 15, 2013 12:44 AM Greg Smith wrote:

> > On 5/14/13 9:42 AM, Amit Kapila wrote:

> > > In the attached patch, bgwriter/checkpointer moves unused

> > (usage_count

> > > =0 && refcount = 0) buffer's to end of freelist. I have implemented

> a

> > > new API StrategyMoveBufferToFreeListEnd() to

> >

> > There's a comment in the new function:

> >

> > It is possible that we are told to put something in the freelist that

> > is already in it; don't screw up the list if so.

> >

> > I don't see where the code does anything to handle that though.  What

> > was your intention here?

>

> The intention is that put the entry in freelist only if it is not in

> freelist which is accomplished by check

> If (buf->freeNext == FREENEXT_NOT_IN_LIST). Every entry when removed

> from

> freelist, buf->freeNext is marked as FREENEXT_NOT_IN_LIST.

> Code Reference (last line):

> StrategyGetBuffer()

> {

> ..

> ..

> while (StrategyControl->firstFreeBuffer >= 0)

>         {

>                 buf = &BufferDescriptors[StrategyControl-

> >firstFreeBuffer];

>                 Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);

>

>                 /* Unconditionally remove buffer from freelist */

>                 StrategyControl->firstFreeBuffer = buf->freeNext;

>                 buf->freeNext = FREENEXT_NOT_IN_LIST;

>

> ...

> }

>

> Also the same check exists in StrategyFreeBuffer().

>

> > This area has always been the tricky part of the change.  If you do

> > something complicated when adding new entries, like scanning the

> > freelist for duplicates, you run the risk of holding BufFreelistLock

> > for

> > too long.

>

> Yes, this is true and I had tried to hold this lock for minimal time.

> In this patch, it holds BufFreelistLock only to put the unused buffer

> at end

> of freelist.

>

> > To try and see that in benchmarks, I would use a small

> > database scale (I typically use 100 for this type of test) and a

> large

> > number of clients.

 

I shall try this test, do you have any suggestions for shred buffers and number of clients for 100 scale factor?

 

> >"-M prepared" would help get a higher transaction

> > rate out of the hardware too.  It might take a server with a large

> core

> > count to notice any issues with holding the lock for too long though.

>

> This is good idea, I shall take another set of readings with "-M

> prepared"

>

> > Instead you might just invalidate buffers before they go onto the

> list.

> >   Doing that will then throw away usefully cached data though.

>

> Yes, if we invalidate buffers, it might throw away usefully cached data

> especially when working set just a tiny bit smaller than

> shared_buffers.

> This is pointed by Robert in his mail

> http://www.postgresql.org/message-

> id/CA+TgmoYhWsz__KtSxm6BuBirE7VR6Qqc_COkbE

> [hidden email]

>

>

> > To try and optimize both insertion speed and retaining cached data,

>

> I think by the method proposed by patch it takes care of both, because

> it

> directly puts free buffer at end of freelist and

> because it doesn't invalidate the buffers it can retain cached data for

> longer period.

> Do you see any flaw with current approach?

>

> > I

> > was thinking about using a hash table for the free buffers, instead

> of

> > the simple linked list approach used in the code now.

>

> Okay, we can try different methods for maintaining free buffers if we

> find

> current approach doesn't turn out to be good.

>

> > Also:  check the formatting on the additions to in bufmgr.c, I

> noticed

> > a

> > spaces vs. tabs difference on lines 35/36 of your patch.

>

> Thanks for pointing it, I shall send an updated patch along with next

> set of

> performance data.

 

 

Further Performance Data:

 

Below data is for average 3 runs of 20 minutes

Scale Factor   - 1200

Shared Buffers - 7G

 

 

                   8C-8T                16C-16T               32C-32T            64C-64T

HEAD               1739                   1461                 1578                 1609

After Patch        4029                   1924                 1743                 1706

 

 

Scale Factor   - 1200

Shared Buffers – 10G

 

                   8C-8T                16C-16T               32C-32T            64C-64T

HEAD               2004                   2270                 2195                 2173

After Patch        2298                   2172                 2111                 2044

 

 

Detailed data of 3 runs is attached with mail.

 

Observations :

 

1.  For scale factor 1200, With 5G and 7G Shared buffers,

a.  there is reasonably good performance after patch (>50%).

b.  However the performance increase is not so good when number of clients-threads increase.

The reason for it can be that at higher number of clients/threads, there are other blocking factors(other LWLocks, I/O) that limit the benefit of moving buffers to freelist

2.  For scale factor 1200, With 10G Shared buffers,

a.  Performance increase is observed for 8 clients/8 threads reading

b.  There is performance dip (3~6%) from 16C onwards. The reasons could be

a.  that with such a long buffer list, actually taking BufFreeListLock by BGwriter frequently (bgwrite_delay = 200ms) can add to Concurrency overhead which is overcoming the need for getting

buffer from freelist.

b.  The other reason is sometimes it comes to free the buffer which is already in freelist. It can also add to small overhead as currently to check weather buffer is in freelist, we need to take BufFreeListLock

 

I will try to find more reasons for 2b and work to resolve performance dip of 2b.

 

Any suggestions will be really helpful to proceed and crack this problem.

 

With Regards,

Amit Kapila.

 

 

 

 



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

move_unused_buffers_to_freelist_1.htm (42K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

Robert Haas
In reply to this post by Greg Smith-21
On Thu, May 16, 2013 at 10:18 AM, Amit Kapila <[hidden email]> wrote:
> Further Performance Data:
>
> Below data is for average 3 runs of 20 minutes
>
> Scale Factor   - 1200
> Shared Buffers - 7G

These results are good but I don't get similar results in my own
testing.  I ran pgbench tests at a variety of client counts and scale
factors, using 30-minute test runs and the following non-default
configuration parameters.

shared_buffers = 8GB
maintenance_work_mem = 1GB
synchronous_commit = off
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
log_line_prefix = '%t [%p] '

Here are the results.  The first field in each line is the number of
clients.  The second number is the scale factor.  The numbers after
"master" and "patched" are the median of three runs.

01 100 master 1433.297699 patched 1420.306088
01 300 master 1371.286876 patched 1368.910732
01 1000 master 1056.891901 patched 1067.341658
01 3000 master 637.312651 patched 685.205011
08 100 master 10575.017704 patched 11456.043638
08 300 master 9262.601107 patched 9120.925071
08 1000 master 1721.807658 patched 1800.733257
08 3000 master 819.694049 patched 854.333830
32 100 master 26981.677368 patched 27024.507600
32 300 master 14554.870871 patched 14778.285400
32 1000 master 1941.733251 patched 1990.248137
32 3000 master 846.654654 patched 892.554222

And here's the same results for 5-minute, read-only tests:

01 100 master 9361.073952 patched 9049.553997
01 300 master 8640.235680 patched 8646.590739
01 1000 master 8339.364026 patched 8342.799468
01 3000 master 7968.428287 patched 7882.121547
08 100 master 71311.491773 patched 71812.899492
08 300 master 69238.839225 patched 70063.632081
08 1000 master 34794.778567 patched 65998.468775
08 3000 master 60834.509571 patched 61165.998080
32 100 master 203168.264456 patched 205258.283852
32 300 master 199137.276025 patched 200391.633074
32 1000 master 177996.853496 patched 176365.732087
32 3000 master 149891.147442 patched 148683.269107

Something appears to have screwed up my results for 8 clients @ scale
factor 300 on master, but overall, on both the read-only and
read-write tests, I'm not seeing anything that resembles the big gains
you reported.

Tests were run on a 16-core, 64-hwthread PPC64 machine provided to the
PostgreSQL community courtesy of IBM.  Fedora 16, Linux kernel 3.2.6.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

amitkapila
On Monday, May 20, 2013 6:54 PM Robert Haas wrote:

> On Thu, May 16, 2013 at 10:18 AM, Amit Kapila <[hidden email]>
> wrote:
> > Further Performance Data:
> >
> > Below data is for average 3 runs of 20 minutes
> >
> > Scale Factor   - 1200
> > Shared Buffers - 7G
>
> These results are good but I don't get similar results in my own
> testing.  

Thanks for running detailed tests

> I ran pgbench tests at a variety of client counts and scale
> factors, using 30-minute test runs and the following non-default
> configuration parameters.
>
> shared_buffers = 8GB
> maintenance_work_mem = 1GB
> synchronous_commit = off
> checkpoint_segments = 300
> checkpoint_timeout = 15min
> checkpoint_completion_target = 0.9
> log_line_prefix = '%t [%p] '
>
> Here are the results.  The first field in each line is the number of
> clients. The second number is the scale factor.  The numbers after
> "master" and "patched" are the median of three runs.
>
> 01 100 master 1433.297699 patched 1420.306088
> 01 300 master 1371.286876 patched 1368.910732
> 01 1000 master 1056.891901 patched 1067.341658
> 01 3000 master 637.312651 patched 685.205011
> 08 100 master 10575.017704 patched 11456.043638
> 08 300 master 9262.601107 patched 9120.925071
> 08 1000 master 1721.807658 patched 1800.733257
> 08 3000 master 819.694049 patched 854.333830
> 32 100 master 26981.677368 patched 27024.507600
> 32 300 master 14554.870871 patched 14778.285400
> 32 1000 master 1941.733251 patched 1990.248137
> 32 3000 master 846.654654 patched 892.554222


Is the above test for tpc-b?
In the above tests, there is performance increase from 1~8% and decrease
from 0.2~1.5%

> And here's the same results for 5-minute, read-only tests:
>
> 01 100 master 9361.073952 patched 9049.553997
> 01 300 master 8640.235680 patched 8646.590739
> 01 1000 master 8339.364026 patched 8342.799468
> 01 3000 master 7968.428287 patched 7882.121547
> 08 100 master 71311.491773 patched 71812.899492
> 08 300 master 69238.839225 patched 70063.632081
> 08 1000 master 34794.778567 patched 65998.468775
> 08 3000 master 60834.509571 patched 61165.998080
> 32 100 master 203168.264456 patched 205258.283852
> 32 300 master 199137.276025 patched 200391.633074
> 32 1000 master 177996.853496 patched 176365.732087
> 32 3000 master 149891.147442 patched 148683.269107
>
> Something appears to have screwed up my results for 8 clients @ scale
> factor 300 on master,

  Do you want to say the reading of 1000 scale factor?
 
>but overall, on both the read-only and
> read-write tests, I'm not seeing anything that resembles the big gains
> you reported.

I have not generated numbers for read-write tests, I will check that once.
For read-only tests, the performance increase is minor and different from
what I saw.
Few points which I could think of for difference in data:

1. In my test's I always observed best data when number of clients/threads
are equal to number of cores which in your case should be at 16.
2. I think for scale factor 100 and 300, there should not be much
performance increase, as for them they should mostly get buffer from
freelist inspite of even bgwriter adds to freelist or not.
3. In my tests variance is for shared buffers, database size is always less
than RAM (Scale Factor -1200, approx db size 16~17GB, RAM -24 GB), but due
to variance in shared buffers, it can lead to I/O.
4. Each run is of 20 minutes, not sure if this has any difference.
 
> Tests were run on a 16-core, 64-hwthread PPC64 machine provided to the
> PostgreSQL community courtesy of IBM.  Fedora 16, Linux kernel 3.2.6.

To think about the difference in your and my runs, could you please tell me
about below points
1. What is RAM in machine.
2. Are number of threads equal to number of clients.
3. Before starting tests I have always done pre-warming of buffers (used
pg_prewarm written by you last year), is it same for above read-only tests.
4. Can you please once again run only the test where you saw variation(8
clients @ scale> factor 1000 on master), because I have also seen that
performance difference is very good for certain    
   configurations(Scale Factor, RAM, Shared Buffers)

Apart from above, I had one more observation during my investigation to find
why in some cases, there is a small dip:
1. Many times, it finds the buffer in free list is not usable, means it's
refcount or usage count is not zero, due to which it had to spend more time
under BufFreelistLock.
   I had not any further experiments related to this finding like if it
really adds any overhead.

Currently I am trying to find reasons for small dip of performance and see
if I could do something to avoid it. Also I will run tests with various
configurations.

Any other suggestions?

With Regards,
Amit Kapila.



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

amitkapila
On Tuesday, May 21, 2013 12:36 PM Amit Kapila wrote:

> On Monday, May 20, 2013 6:54 PM Robert Haas wrote:
> > On Thu, May 16, 2013 at 10:18 AM, Amit Kapila
> <[hidden email]>
> > wrote:
> > > Further Performance Data:
> > >
> > > Below data is for average 3 runs of 20 minutes
> > >
> > > Scale Factor   - 1200
> > > Shared Buffers - 7G
> >
> > These results are good but I don't get similar results in my own
> > testing.
>
> Thanks for running detailed tests
>
> > I ran pgbench tests at a variety of client counts and scale
> > factors, using 30-minute test runs and the following non-default
> > configuration parameters.
> >
> > shared_buffers = 8GB
> > maintenance_work_mem = 1GB
> > synchronous_commit = off
> > checkpoint_segments = 300
> > checkpoint_timeout = 15min
> > checkpoint_completion_target = 0.9
> > log_line_prefix = '%t [%p] '
> >
> > Here are the results.  The first field in each line is the number of
> > clients. The second number is the scale factor.  The numbers after
> > "master" and "patched" are the median of three runs.
> >
> > 01 100 master 1433.297699 patched 1420.306088
> > 01 300 master 1371.286876 patched 1368.910732
> > 01 1000 master 1056.891901 patched 1067.341658
> > 01 3000 master 637.312651 patched 685.205011
> > 08 100 master 10575.017704 patched 11456.043638
> > 08 300 master 9262.601107 patched 9120.925071
> > 08 1000 master 1721.807658 patched 1800.733257
> > 08 3000 master 819.694049 patched 854.333830
> > 32 100 master 26981.677368 patched 27024.507600
> > 32 300 master 14554.870871 patched 14778.285400
> > 32 1000 master 1941.733251 patched 1990.248137
> > 32 3000 master 846.654654 patched 892.554222
>
>
> Is the above test for tpc-b?
> In the above tests, there is performance increase from 1~8% and
> decrease
> from 0.2~1.5%
>
> > And here's the same results for 5-minute, read-only tests:
> >
> > 01 100 master 9361.073952 patched 9049.553997
> > 01 300 master 8640.235680 patched 8646.590739
> > 01 1000 master 8339.364026 patched 8342.799468
> > 01 3000 master 7968.428287 patched 7882.121547
> > 08 100 master 71311.491773 patched 71812.899492
> > 08 300 master 69238.839225 patched 70063.632081
> > 08 1000 master 34794.778567 patched 65998.468775
> > 08 3000 master 60834.509571 patched 61165.998080
> > 32 100 master 203168.264456 patched 205258.283852
> > 32 300 master 199137.276025 patched 200391.633074
> > 32 1000 master 177996.853496 patched 176365.732087
> > 32 3000 master 149891.147442 patched 148683.269107
> >
> > Something appears to have screwed up my results for 8 clients @ scale
> > factor 300 on master,
>
>   Do you want to say the reading of 1000 scale factor?
>
> >but overall, on both the read-only and
> > read-write tests, I'm not seeing anything that resembles the big
> gains
> > you reported.
>
> I have not generated numbers for read-write tests, I will check that
> once.
> For read-only tests, the performance increase is minor and different
> from
> what I saw.
> Few points which I could think of for difference in data:
>
> 1. In my test's I always observed best data when number of
> clients/threads
> are equal to number of cores which in your case should be at 16.
> 2. I think for scale factor 100 and 300, there should not be much
> performance increase, as for them they should mostly get buffer from
> freelist inspite of even bgwriter adds to freelist or not.
> 3. In my tests variance is for shared buffers, database size is always
> less
> than RAM (Scale Factor -1200, approx db size 16~17GB, RAM -24 GB), but
> due
> to variance in shared buffers, it can lead to I/O.
> 4. Each run is of 20 minutes, not sure if this has any difference.
>
> > Tests were run on a 16-core, 64-hwthread PPC64 machine provided to
> the
> > PostgreSQL community courtesy of IBM.  Fedora 16, Linux kernel 3.2.6.
>
> To think about the difference in your and my runs, could you please
> tell me
> about below points
> 1. What is RAM in machine.
> 2. Are number of threads equal to number of clients.
> 3. Before starting tests I have always done pre-warming of buffers
> (used
> pg_prewarm written by you last year), is it same for above read-only
> tests.
> 4. Can you please once again run only the test where you saw
> variation(8
> clients @ scale> factor 1000 on master), because I have also seen that
> performance difference is very good for certain
>    configurations(Scale Factor, RAM, Shared Buffers)

On looking more closely at data posted by you, I believe that there is some
problem with data (8
clients @ scale factor 1000 on master) as in all other cases, the data for
scale factor 1000 is better than 3000 except for this case.
So I think no need to run again.

> Apart from above, I had one more observation during my investigation to
> find
> why in some cases, there is a small dip:
> 1. Many times, it finds the buffer in free list is not usable, means
> it's
> refcount or usage count is not zero, due to which it had to spend more
> time
> under BufFreelistLock.
>    I had not any further experiments related to this finding like if it
> really adds any overhead.
>
> Currently I am trying to find reasons for small dip of performance and
> see
> if I could do something to avoid it. Also I will run tests with various
> configurations.
>
> Any other suggestions?



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

Robert Haas
In reply to this post by Robert Haas
On Tue, May 21, 2013 at 3:06 AM, Amit Kapila <[hidden email]> wrote:

>> Here are the results.  The first field in each line is the number of
>> clients. The second number is the scale factor.  The numbers after
>> "master" and "patched" are the median of three runs.
>>
>> 01 100 master 1433.297699 patched 1420.306088
>> 01 300 master 1371.286876 patched 1368.910732
>> 01 1000 master 1056.891901 patched 1067.341658
>> 01 3000 master 637.312651 patched 685.205011
>> 08 100 master 10575.017704 patched 11456.043638
>> 08 300 master 9262.601107 patched 9120.925071
>> 08 1000 master 1721.807658 patched 1800.733257
>> 08 3000 master 819.694049 patched 854.333830
>> 32 100 master 26981.677368 patched 27024.507600
>> 32 300 master 14554.870871 patched 14778.285400
>> 32 1000 master 1941.733251 patched 1990.248137
>> 32 3000 master 846.654654 patched 892.554222
>
> Is the above test for tpc-b?
> In the above tests, there is performance increase from 1~8% and decrease
> from 0.2~1.5%

It's just the default pgbench workload.

>> And here's the same results for 5-minute, read-only tests:
>>
>> 01 100 master 9361.073952 patched 9049.553997
>> 01 300 master 8640.235680 patched 8646.590739
>> 01 1000 master 8339.364026 patched 8342.799468
>> 01 3000 master 7968.428287 patched 7882.121547
>> 08 100 master 71311.491773 patched 71812.899492
>> 08 300 master 69238.839225 patched 70063.632081
>> 08 1000 master 34794.778567 patched 65998.468775
>> 08 3000 master 60834.509571 patched 61165.998080
>> 32 100 master 203168.264456 patched 205258.283852
>> 32 300 master 199137.276025 patched 200391.633074
>> 32 1000 master 177996.853496 patched 176365.732087
>> 32 3000 master 149891.147442 patched 148683.269107
>>
>> Something appears to have screwed up my results for 8 clients @ scale
>> factor 300 on master,
>
>   Do you want to say the reading of 1000 scale factor?

Yes.

>>but overall, on both the read-only and
>> read-write tests, I'm not seeing anything that resembles the big gains
>> you reported.
>
> I have not generated numbers for read-write tests, I will check that once.
> For read-only tests, the performance increase is minor and different from
> what I saw.
> Few points which I could think of for difference in data:
>
> 1. In my test's I always observed best data when number of clients/threads
> are equal to number of cores which in your case should be at 16.

Sure, but you also showed substantial performance increases across a
variety of connection counts, whereas I'm seeing basically no change
at any connection count.

> 2. I think for scale factor 100 and 300, there should not be much
> performance increase, as for them they should mostly get buffer from
> freelist inspite of even bgwriter adds to freelist or not.

I agree.

> 3. In my tests variance is for shared buffers, database size is always less
> than RAM (Scale Factor -1200, approx db size 16~17GB, RAM -24 GB), but due
> to variance in shared buffers, it can lead to I/O.

Not sure I understand this.

> 4. Each run is of 20 minutes, not sure if this has any difference.

I've found that 5-minute tests are normally adequate to identify
performance changes on the pgbench SELECT-only workload.

>> Tests were run on a 16-core, 64-hwthread PPC64 machine provided to the
>> PostgreSQL community courtesy of IBM.  Fedora 16, Linux kernel 3.2.6.
>
> To think about the difference in your and my runs, could you please tell me
> about below points
> 1. What is RAM in machine.

64GB

> 2. Are number of threads equal to number of clients.

Yes.

> 3. Before starting tests I have always done pre-warming of buffers (used
> pg_prewarm written by you last year), is it same for above read-only tests.

No, I did not use pg_prewarm.  But I don't think that should matter
very much.  First, the data was all in the OS cache.  Second, on the
small scale factors, everything should end up in cache pretty quickly
anyway.  And on the large scale factors, well, you're going to be
churning shared_buffers anyway, so pg_prewarm is only going to affect
the very beginning of the test.

> 4. Can you please once again run only the test where you saw variation(8
> clients @ scale> factor 1000 on master), because I have also seen that
> performance difference is very good for certain
>    configurations(Scale Factor, RAM, Shared Buffers)

I can do this if I get a chance, but I don't really see where that's
going to get us.  It seems pretty clear to me that there's no benefit
on these tests from this patch.  So either one of us is doing the
benchmarking incorrectly, or there's some difference in our test
environments that is significant, but none of the proposals you've
made so far seem to me to explain the difference.

> Apart from above, I had one more observation during my investigation to find
> why in some cases, there is a small dip:
> 1. Many times, it finds the buffer in free list is not usable, means it's
> refcount or usage count is not zero, due to which it had to spend more time
> under BufFreelistLock.
>    I had not any further experiments related to this finding like if it
> really adds any overhead.
>
> Currently I am trying to find reasons for small dip of performance and see
> if I could do something to avoid it. Also I will run tests with various
> configurations.
>
> Any other suggestions?

Well, I think that the code in SyncOneBuffer is not really optimal.
In some cases you actually lock and unlock the buffer header an extra
time, which seems like a whole lotta extra overhead.  In fact, I don't
think you should be modifying SyncOneBuffer() at all, because that
affects not only the background writer but also checkpoints.
Presumably it is not right to put every unused buffer on the free list
when we checkpoint.

Instead, I suggest modifying BgBufferSync, specifically this part right here:

        else if (buffer_state & BUF_REUSABLE)
            reusable_buffers++;

What I would suggest is that if the BUF_REUSABLE flag is set here, use
that as the trigger to do StrategyMoveBufferToFreeListEnd().  That's
much simpler than the logic that you have now, and I think it's also
more efficient and more correct.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

Jim Nasby-2
In reply to this post by Greg Smith-21
On 5/14/13 2:13 PM, Greg Smith wrote:
> It is possible that we are told to put something in the freelist that
> is already in it; don't screw up the list if so.
>
> I don't see where the code does anything to handle that though.  What was your intention here?

IIRC, the code that pulls from the freelist already deals with the possibility that a block was on the freelist but has since been put to use. If that's the case then there shouldn't be much penalty to adding a block multiple times (at least within reason...)
--
Jim C. Nasby, Data Architect                       [hidden email]
512.569.9461 (cell)                         http://jim.nasby.net


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

amitkapila
In reply to this post by Robert Haas
On Thursday, May 23, 2013 8:45 PM Robert Haas wrote:

> On Tue, May 21, 2013 at 3:06 AM, Amit Kapila <[hidden email]>
> wrote:
> >> Here are the results.  The first field in each line is the number of
> >> clients. The second number is the scale factor.  The numbers after
> >> "master" and "patched" are the median of three runs.
>
> >>but overall, on both the read-only and
> >> read-write tests, I'm not seeing anything that resembles the big
> gains
> >> you reported.
> >
> > I have not generated numbers for read-write tests, I will check that
> once.
> > For read-only tests, the performance increase is minor and different
> from
> > what I saw.
> > Few points which I could think of for difference in data:
> >
> > 1. In my test's I always observed best data when number of
> clients/threads
> > are equal to number of cores which in your case should be at 16.
>
> Sure, but you also showed substantial performance increases across a
> variety of connection counts, whereas I'm seeing basically no change
> at any connection count.
> > 2. I think for scale factor 100 and 300, there should not be much
> > performance increase, as for them they should mostly get buffer from
> > freelist inspite of even bgwriter adds to freelist or not.
>
> I agree.
>
> > 3. In my tests variance is for shared buffers, database size is
> always less
> > than RAM (Scale Factor -1200, approx db size 16~17GB, RAM -24 GB),
> but due
> > to variance in shared buffers, it can lead to I/O.
>
> Not sure I understand this.

What I wanted to say is that all your tests was on same shared buffer
configuration 8GB, where as in my tests I was trying to vary shared buffers
as well.
However this is not important point, as it should show performance gain on
configuration you ran, if there is any real benefit of this patch.
 

> > 4. Each run is of 20 minutes, not sure if this has any difference.
>
> I've found that 5-minute tests are normally adequate to identify
> performance changes on the pgbench SELECT-only workload.
>
> >> Tests were run on a 16-core, 64-hwthread PPC64 machine provided to
> the
> >> PostgreSQL community courtesy of IBM.  Fedora 16, Linux kernel
> 3.2.6.
> >
> > To think about the difference in your and my runs, could you please
> tell me
> > about below points
> > 1. What is RAM in machine.
>
> 64GB
>
> > 2. Are number of threads equal to number of clients.
>
> Yes.
>
> > 3. Before starting tests I have always done pre-warming of buffers
> (used
> > pg_prewarm written by you last year), is it same for above read-only
> tests.
>
> No, I did not use pg_prewarm.  But I don't think that should matter
> very much.  First, the data was all in the OS cache.  Second, on the
> small scale factors, everything should end up in cache pretty quickly
> anyway.  And on the large scale factors, well, you're going to be
> churning shared_buffers anyway, so pg_prewarm is only going to affect
> the very beginning of the test.
>
> > 4. Can you please once again run only the test where you saw
> variation(8
> > clients @ scale> factor 1000 on master), because I have also seen
> that
> > performance difference is very good for certain
> >    configurations(Scale Factor, RAM, Shared Buffers)
>
> I can do this if I get a chance, but I don't really see where that's
> going to get us.  It seems pretty clear to me that there's no benefit
> on these tests from this patch.  So either one of us is doing the
> benchmarking incorrectly, or there's some difference in our test
> environments that is significant, but none of the proposals you've
> made so far seem to me to explain the difference.

Sorry for requesting you to run again without any concrete point.
I realized after reading data you posted more carefully that the reading was
just some m/c problem or something else, but actually there is no gain.
After your post, I had tried with various configurations on different m/c,
but till now I am not able see the performance gain as was shown in my
initial mail.
Infact I had tried on same m/c as well, it some times give good data. I will
update you if I get any concrete reason and results.

> > Apart from above, I had one more observation during my investigation
> to find
> > why in some cases, there is a small dip:
> > 1. Many times, it finds the buffer in free list is not usable, means
> it's
> > refcount or usage count is not zero, due to which it had to spend
> more time
> > under BufFreelistLock.
> >    I had not any further experiments related to this finding like if
> it
> > really adds any overhead.
> >
> > Currently I am trying to find reasons for small dip of performance
> and see
> > if I could do something to avoid it. Also I will run tests with
> various
> > configurations.
> >
> > Any other suggestions?
>
> Well, I think that the code in SyncOneBuffer is not really optimal.
> In some cases you actually lock and unlock the buffer header an extra
> time, which seems like a whole lotta extra overhead.  In fact, I don't
> think you should be modifying SyncOneBuffer() at all, because that
> affects not only the background writer but also checkpoints.
> Presumably it is not right to put every unused buffer on the free list
> when we checkpoint.
>
> Instead, I suggest modifying BgBufferSync, specifically this part right
> here:
>
>         else if (buffer_state & BUF_REUSABLE)
>             reusable_buffers++;
>
> What I would suggest is that if the BUF_REUSABLE flag is set here, use
> that as the trigger to do StrategyMoveBufferToFreeListEnd().  

I think at this point also we need to lock buffer header to check refcount
and usage_count before moving to freelist, or do you think it is not
required?

> That's
> much simpler than the logic that you have now, and I think it's also
> more efficient and more correct.

Sure, I will try the logic suggested by you.

With Regards,
Amit Kapila.



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

amitkapila
In reply to this post by Jim Nasby-2
On Friday, May 24, 2013 2:47 AM Jim Nasby wrote:

> On 5/14/13 2:13 PM, Greg Smith wrote:
> > It is possible that we are told to put something in the freelist that
> > is already in it; don't screw up the list if so.
> >
> > I don't see where the code does anything to handle that though.  What
> was your intention here?
>
> IIRC, the code that pulls from the freelist already deals with the
> possibility that a block was on the freelist but has since been put to
> use.

You are right, the check exists in StrategyGetBuffer()

>If that's the case then there shouldn't be much penalty to adding
> a block multiple times (at least within reason...)

There is a check in StrategyFreeBuffer() which will not allow to put
multiple times,
I had just used the same check in new function.

With Regards,
Amit Kapila.



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

Jim Nasby-2
In reply to this post by amitkapila
On 5/14/13 8:42 AM, Amit Kapila wrote:
> In the attached patch, bgwriter/checkpointer moves unused (usage_count =0 && refcount = 0) buffer’s to end of freelist. I have implemented a new API StrategyMoveBufferToFreeListEnd() to
>
> move buffer’s to end of freelist.
>

Instead of a separate function, would it be better to add an argument to StrategyFreeBuffer? ISTM this is similar to the other strategy stuff in the buffer manager, so perhaps it should mirror that...
--
Jim C. Nasby, Data Architect                       [hidden email]
512.569.9461 (cell)                         http://jim.nasby.net


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

amitkapila
On Friday, May 24, 2013 8:22 PM Jim Nasby wrote:
On 5/14/13 8:42 AM, Amit Kapila wrote:
>> In the attached patch, bgwriter/checkpointer moves unused (usage_count =0 && refcount = 0) buffer’s to end of freelist. I have implemented a new API StrategyMoveBufferToFreeListEnd() to
>>
>> move buffer’s to end of freelist.
>>

> Instead of a separate function, would it be better to add an argument to StrategyFreeBuffer?

  Yes, it could be done with a parameter which will decide whether to put buffer at head or tail in freelist.
  However currently the main focus is to check in which cases this optimization can give benefit.
  Robert had ran tests for quite a number of cases where it doesn't show any significant gain.
  I am also trying with various configurations to see if it gives any benefit.
  Robert has given some suggestions to change the way currently new function is getting called,
  I will try it and update the results of same.

  I am not very sure that default pgbench is a good test scenario to test this optimization.
  If you have any suggestions for tests where it can show benefit, that would be a great input.

> ISTM this is similar to the other strategy stuff in the buffer manager, so perhaps it should mirror that...

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

Robert Haas
In reply to this post by Robert Haas
>> Instead, I suggest modifying BgBufferSync, specifically this part right
>> here:
>>
>>         else if (buffer_state & BUF_REUSABLE)
>>             reusable_buffers++;
>>
>> What I would suggest is that if the BUF_REUSABLE flag is set here, use
>> that as the trigger to do StrategyMoveBufferToFreeListEnd().
>
> I think at this point also we need to lock buffer header to check refcount
> and usage_count before moving to freelist, or do you think it is not
> required?

If BUF_REUSABLE is set, that means we just did exactly what you're
saying.  Why do it twice?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

amitkapila
On Tuesday, May 28, 2013 6:54 PM Robert Haas wrote:

> >> Instead, I suggest modifying BgBufferSync, specifically this part
> right
> >> here:
> >>
> >>         else if (buffer_state & BUF_REUSABLE)
> >>             reusable_buffers++;
> >>
> >> What I would suggest is that if the BUF_REUSABLE flag is set here,
> use
> >> that as the trigger to do StrategyMoveBufferToFreeListEnd().
> >
> > I think at this point also we need to lock buffer header to check
> refcount
> > and usage_count before moving to freelist, or do you think it is not
> > required?
>
> If BUF_REUSABLE is set, that means we just did exactly what you're
> saying.  Why do it twice?
Even if we just did it, but we have released the buf header lock, so
theoretically chances are there that backend can increase the count, however
still it will be protected by check in StrategyGetBuffer(). As there is a
very rare chance of it, so doing without buffer header lock might not cause
any harm.
Modified patch to address the same is attached with mail.

Performance Data
-------------------

As far as I have noticed, performance data for this patch depends on 3
factors
1. Pre-loading of data in buffers, so that buffers holding pages should have
some usage count before running pgbench.
   Reason is it might be creating difference in performance of clock-sweep
2. Clearing of pages in OS cache before running pgbench with different
patch, it can create difference because when we run pgbench with or without
patch,
   it can access pages already cached due to previous runs which causes
variation in performance.
3. Scale factor and shared buffer configuration

To avoid above 3 factors in test readings, I used below steps:
1. Initialize the database with scale factor such that database size +
shared_buffers = RAM (shared_buffers = 1/4 of RAM).
   For example:
   Example -1
                if RAM = 128G, then initialize db with scale factor = 6700
and shared_buffers = 32GB.
                Database size (98 GB) + shared_buffers (32GB) = 130 (which
is approximately equal to total RAM)
   Example -2 (this is based on your test m/c)
                If RAM = 64GB, then initialize db with scale factor = 3400
and shared_buffers = 16GB.
2. reboot m/c
3. Load all buffers with data (tables/indexes of pgbench) using pg_prewarm.
I had loaded 3 times, so that usage count of buffers will be approximately
3.
   Used file load_all_buffers.sql attached with this mail
4. run 3 times pgbench select-only case for 10 or 15 minutes without patch
5. reboot m/c
6. Load all buffers with data (tables/indexes of pgbench) using pg_prewarm.
I had loaded 3 times, so that usage count of buffers will be approximately
3.
   Used file load_all_buffers.sql attached with this mail
7. run 3 times pgbench select-only case for 10 or 15 minutes with patch

Using above steps, I had taken performance data on 2 different m/c's

Configuration Details
O/S - Suse-11
RAM - 128GB
Number of Cores - 16
Server Conf - checkpoint_segments = 300; checkpoint_timeout = 15 min,
synchronous_commit = 0FF, shared_buffers = 32GB, AutoVacuum=off
Pgbench - Select-only
Scalefactor - 1200
Time - Each run is of 15 mins

Below data is for average of 3 runs

                   16C-16T            32C-32T            64C-64T
HEAD               4391                3971                 3464
After Patch        6147                5093                 3944

Detailed data of each run is attached with mail in file
move_unused_buffers_to_freelist_v2.htm

Below data is for 1 run of half hour on same configuration

                   16C-16T            32C-32T            64C-64T
HEAD               4377                3861                 3295
After Patch        6542                4770                 3504


Configuration Details
O/S - Suse-11
RAM - 24GB
Number of Cores - 8
Server Conf - checkpoint_segments = 256; checkpoint_timeout = 25 min,
synchronous_commit = 0FF, shared_buffers = 5GB
Pgbench - Select-only
Scalefactor - 1200
Time - Each run is of 10 mins

Below data is for average 3 runs of 10 minutes

                   8C-8T                16C-16T            32C-32T
64C-64T 128C-128T 256C-256T
HEAD               58837               56740               19390
5681 3191 2160
After Patch        59482               56936               25070
7655 4166 2704

Detailed data of each run is attached with mail in file
move_unused_buffers_to_freelist_v2.htm


Below data is for 1 run of half hour on same configuration

                   32C-32T
HEAD               17703
After Patch        20586

I had run these tests multiple times to ensure the correctness. I think last
time why it didn't show performance improvement in your runs is
because the way we both are running pgbench is different. This time, I have
detailed the steps I have used to collect performance data.


With Regards,
Amit Kapila.


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

move_unused_buffers_to_freelist_v2.patch (4K) Download Attachment
move_unused_buffers_to_freelist_v2.htm (41K) Download Attachment
load_all_buffers.sql (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

Robert Haas
In reply to this post by Robert Haas
On Thu, Jun 6, 2013 at 3:01 AM, Amit Kapila <[hidden email]> wrote:

> To avoid above 3 factors in test readings, I used below steps:
> 1. Initialize the database with scale factor such that database size +
> shared_buffers = RAM (shared_buffers = 1/4 of RAM).
>    For example:
>    Example -1
>                 if RAM = 128G, then initialize db with scale factor = 6700
> and shared_buffers = 32GB.
>                 Database size (98 GB) + shared_buffers (32GB) = 130 (which
> is approximately equal to total RAM)
>    Example -2 (this is based on your test m/c)
>                 If RAM = 64GB, then initialize db with scale factor = 3400
> and shared_buffers = 16GB.
> 2. reboot m/c
> 3. Load all buffers with data (tables/indexes of pgbench) using pg_prewarm.
> I had loaded 3 times, so that usage count of buffers will be approximately
> 3.

Hmm.  I don't think the usage count will actually end up being 3,
though, because the amount of data you're loading is sized to 3/4 of
RAM, and shared_buffers is just 1/4 of RAM, so I think that each run
of pg_prewarm will end up turning over the entire cache and you'll
never get any usage counts more than 1 this way.  Am I confused?

I wonder if it would be beneficial to test the case where the database
size is just a little more than shared_buffers.  I think that would
lead to a situation where the usage counts are high most of the time,
which - now that you mention it - seems like the sweet spot for this
patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

amitkapila
On Monday, June 24, 2013 11:00 PM Robert Haas wrote:

> On Thu, Jun 6, 2013 at 3:01 AM, Amit Kapila <[hidden email]>
> wrote:
> > To avoid above 3 factors in test readings, I used below steps:
> > 1. Initialize the database with scale factor such that database size
> +
> > shared_buffers = RAM (shared_buffers = 1/4 of RAM).
> >    For example:
> >    Example -1
> >                 if RAM = 128G, then initialize db with scale factor =
> 6700
> > and shared_buffers = 32GB.
> >                 Database size (98 GB) + shared_buffers (32GB) = 130
> (which
> > is approximately equal to total RAM)
> >    Example -2 (this is based on your test m/c)
> >                 If RAM = 64GB, then initialize db with scale factor =
> 3400
> > and shared_buffers = 16GB.
> > 2. reboot m/c
> > 3. Load all buffers with data (tables/indexes of pgbench) using
> pg_prewarm.
> > I had loaded 3 times, so that usage count of buffers will be
> approximately
> > 3.
>
> Hmm.  I don't think the usage count will actually end up being 3,
> though, because the amount of data you're loading is sized to 3/4 of
> RAM, and shared_buffers is just 1/4 of RAM, so I think that each run
> of pg_prewarm will end up turning over the entire cache and you'll
> never get any usage counts more than 1 this way.  Am I confused?

The way I am pre-warming is that loading the data of relation (table/index)
continuously 3 times, so mostly the buffers will contain the data of
relations loaded in last
which are indexes and also they got accessed more during scans. So usage
count should be 3.
Can you please once see load_all_buffers.sql, may be my understanding has
some gap.

Now about the question why then load all the relations.
Apart from PostgreSQL shared buffers, loading data this way can also
make sure OS buffers will have the data with higher usage count which can
lead to better OS scheduling.

> I wonder if it would be beneficial to test the case where the database
> size is just a little more than shared_buffers.  I think that would
> lead to a situation where the usage counts are high most of the time,
> which - now that you mention it - seems like the sweet spot for this
> patch.

I will check this case and take the readings for same. Thanks for your
suggestions.

With Regards,
Amit Kapila.



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

amitkapila
On Tuesday, June 25, 2013 10:25 AM Amit Kapila wrote:

> On Monday, June 24, 2013 11:00 PM Robert Haas wrote:
> > On Thu, Jun 6, 2013 at 3:01 AM, Amit Kapila <[hidden email]>
> > wrote:
> > > To avoid above 3 factors in test readings, I used below steps:
> > > 1. Initialize the database with scale factor such that database
> size
> > +
> > > shared_buffers = RAM (shared_buffers = 1/4 of RAM).
> > >    For example:
> > >    Example -1
> > >                 if RAM = 128G, then initialize db with scale factor
> =
> > 6700
> > > and shared_buffers = 32GB.
> > >                 Database size (98 GB) + shared_buffers (32GB) = 130
> > (which
> > > is approximately equal to total RAM)
> > >    Example -2 (this is based on your test m/c)
> > >                 If RAM = 64GB, then initialize db with scale factor
> =
> > 3400
> > > and shared_buffers = 16GB.
> > > 2. reboot m/c
> > > 3. Load all buffers with data (tables/indexes of pgbench) using
> > pg_prewarm.
> > > I had loaded 3 times, so that usage count of buffers will be
> > approximately
> > > 3.
> >
> > Hmm.  I don't think the usage count will actually end up being 3,
> > though, because the amount of data you're loading is sized to 3/4 of
> > RAM, and shared_buffers is just 1/4 of RAM, so I think that each run
> > of pg_prewarm will end up turning over the entire cache and you'll
> > never get any usage counts more than 1 this way.  Am I confused?
>
> The way I am pre-warming is that loading the data of relation
> (table/index)
> continuously 3 times, so mostly the buffers will contain the data of
> relations loaded in last
> which are indexes and also they got accessed more during scans. So
> usage
> count should be 3.
> Can you please once see load_all_buffers.sql, may be my understanding
> has
> some gap.
>
> Now about the question why then load all the relations.
> Apart from PostgreSQL shared buffers, loading data this way can also
> make sure OS buffers will have the data with higher usage count which
> can
> lead to better OS scheduling.
>
> > I wonder if it would be beneficial to test the case where the
> database
> > size is just a little more than shared_buffers.  I think that would
> > lead to a situation where the usage counts are high most of the time,
> > which - now that you mention it - seems like the sweet spot for this
> > patch.
>
> I will check this case and take the readings for same. Thanks for your
> suggestions.

Configuration Details
O/S - Suse-11
RAM - 128GB
Number of Cores - 16
Server Conf - checkpoint_segments = 300; checkpoint_timeout = 15 min,
synchronous_commit = 0FF, shared_buffers = 14GB, AutoVacuum=off Pgbench -
Select-only Scalefactor - 1200 Time - 30 mins

          8C-8T                16C-16T        32C-32T        64C-64T
Head       62403                101810         99516          94707
Patch      62827                101404         99109          94744

On 128GB RAM, if use scalefactor=1200 (database=approx 17GB) and 14GB shared
buffers, this is no major difference.
One of the reasons could be that there is no much swapping in shared buffers
as most data already fits in shared buffers.


I think more readings are need for combinations related to below settings:
scale factor such that database size + shared_buffers = RAM (shared_buffers
= 1/4 of RAM).

I can try varying shared_buffer size.

Kindly let me know your suggestions?

With Regards,
Amit Kapila.



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

Robert Haas
In reply to this post by Robert Haas
On Wed, Jun 26, 2013 at 8:09 AM, Amit Kapila <[hidden email]> wrote:

> Configuration Details
> O/S - Suse-11
> RAM - 128GB
> Number of Cores - 16
> Server Conf - checkpoint_segments = 300; checkpoint_timeout = 15 min,
> synchronous_commit = 0FF, shared_buffers = 14GB, AutoVacuum=off Pgbench -
> Select-only Scalefactor - 1200 Time - 30 mins
>
>              8C-8T                16C-16T        32C-32T        64C-64T
> Head       62403                101810         99516          94707
> Patch      62827                101404         99109          94744
>
> On 128GB RAM, if use scalefactor=1200 (database=approx 17GB) and 14GB shared
> buffers, this is no major difference.
> One of the reasons could be that there is no much swapping in shared buffers
> as most data already fits in shared buffers.

I'd like to just back up a minute here and talk about the broader
picture here.  What are we trying to accomplish with this patch?  Last
year, I did some benchmarking on a big IBM POWER7 machine (16 cores,
64 hardware threads).  Here are the results:

http://rhaas.blogspot.com/2012/03/performance-and-scalability-on-ibm.html

Now, if you look at these results, you see something interesting.
When there aren't too many concurrent connections, the higher scale
factors are only modestly slower than the lower scale factors.  But as
the number of connections increases, the performance continues to rise
at the lower scale factors, and at the higher scale factors, this
performance stops rising and in fact drops off.  So in other words,
there's no huge *performance* problem for a working set larger than
shared_buffers, but there is a huge *scalability* problem.  Now why is
that?

As far as I can tell, the answer is that we've got a scalability
problem around BufFreelistLock.  Contention on the buffer mapping
locks may also be a problem, but all of my previous benchmarking (with
LWLOCK_STATS) suggests that BufFreelistLock is, by far, the elephant
in the room.  My interest in having the background writer add buffers
to the free list is basically around solving that problem.  It's a
pretty dramatic problem, as the graph above shows, and this patch
doesn't solve it.  There may be corner cases where this patch improves
things (or, equally, makes them worse) but as a general point, the
difficulty I've had reproducing your test results and the specificity
of your instructions for reproducing them suggests to me that what we
have here is not a clear improvement on general workloads.  Yet such
an improvement should exist, because there are other products in the
world that have scalable buffer managers; we currently don't.  Instead
of spending a lot of time trying to figure out whether there's a small
win in narrow cases here (and there may well be), I think we should
back up and ask why this isn't a great big win, and what we'd need to
do to *get* a great big win.  I don't see much point in tinkering
around the edges here if things are broken in the middle; things that
seem like small wins or losses now may turn out otherwise in the face
of a more comprehensive solution.

One thing that occurred to me while writing this note is that the
background writer doesn't have any compelling reason to run on a
read-only workload.  It will still run at a certain minimum rate, so
that it cycles the buffer pool every 2 minutes, if I remember
correctly.  But it won't run anywhere near fast enough to keep up with
the buffer allocation demands of 8, or 32, or 64 sessions all reading
data not all of which is in shared_buffers at top speed.  In fact,
we've had reports that the background writer isn't too effective even
on read-write workloads.  The point is - if the background writer
isn't waking up and running frequently enough, what it does when it
does wake up isn't going to matter very much.  I think we need to
spend some energy poking at that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Move unused buffers to freelist

Andres Freund-3
On 2013-06-27 08:23:31 -0400, Robert Haas wrote:
> I'd like to just back up a minute here and talk about the broader
> picture here.

Sounds like a very good plan.

> So in other words,
> there's no huge *performance* problem for a working set larger than
> shared_buffers, but there is a huge *scalability* problem.  Now why is
> that?

> As far as I can tell, the answer is that we've got a scalability
> problem around BufFreelistLock.

Part of the problem is it's name ;)

> Contention on the buffer mapping
> locks may also be a problem, but all of my previous benchmarking (with
> LWLOCK_STATS) suggests that BufFreelistLock is, by far, the elephant
> in the room.

Contention wise I aggree. What I have seen is that we have a huge
amount of cacheline bouncing around the buffer header spinlocks.

> My interest in having the background writer add buffers
> to the free list is basically around solving that problem.  It's a
> pretty dramatic problem, as the graph above shows, and this patch
> doesn't solve it.

> One thing that occurred to me while writing this note is that the
> background writer doesn't have any compelling reason to run on a
> read-only workload.  It will still run at a certain minimum rate, so
> that it cycles the buffer pool every 2 minutes, if I remember
> correctly.

I have previously added some adhoc instrumentation that printed the
amount of buffers that were required (by other backends) during a
bgwriter cycle and the amount of buffers that the buffer manager could
actually write out. I don't think I actually found any workload where
the bgwriter actually wroute out a relevant percentage of the necessary
pages.
Which would explain why the patch doesn't have a big benefit. The
freelist is empty most of the time, so we don't benefit from the reduced
work done under the lock.

I think the whole algorithm that guides how much the background writer
actually does, including its pacing/sleeping logic, needs to be
rewritten from scratch before we are actually able to measure the
benefit from this patch. I personally don't think there's much to
salvage from the current code.

Problems with the current code:

* doesn't manipulate the usage_count and never does anything to used
  pages. Which means it will just about never find a victim buffer in a
  busy database.
* by far not aggressive enough, touches only a few buffers ahead of the
  clock sweep.
* does not advance the clock sweep, so the individual backends will
  touch the same buffers again and transfer all the buffer spinlock
  cacheline around
* The adaption logic it has makes it so slow to adapt that it takes
  several minutes to adapt.
* ...


There's another thing we could do to noticeably improve scalability of
buffer acquiration. Currently we do a huge amount of work under the
freelist lock.
In StrategyGetBuffer:
    LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
...
    // check freelist, will usually be empty
...
    for (;;)
    {
        buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];

        ++StrategyControl->nextVictimBuffer;

        LockBufHdr(buf);
        if (buf->refcount == 0)
        {
            if (buf->usage_count > 0)
            {
                buf->usage_count--;
            }
            else
            {
                /* Found a usable buffer */
                if (strategy != NULL)
                    AddBufferToRing(strategy, buf);
                return buf;
            }
        }
        UnlockBufHdr(buf);
    }

So, we perform the entire clock sweep until we found a single buffer we
can use inside a *global* lock. At times we need to iterate over the
whole shared buffers BM_MAX_USAGE_COUNT (5) times till we pushed down all
the usage counts enough (if the database is busy it can take even
longer...).
In a busy database where usually all the usagecounts are high the next
backend will touch a lot of those buffers again which causes massive
cache eviction & bouncing.

It seems far more sensible to only protect the clock sweep's
nextVictimBuffer with a spinlock. With some care all the rest can happen
without any global interlock.

I think even after fixing this - which we definitely should do - having
a sensible/more aggressive bgwriter moving pages onto the freelist makes
sense because then backends then don't need to deal with dirty pages.

Greetings,

Andres Freund

--
 Andres Freund                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
12