track_planning causing performance regression

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
48 messages Options
123
Reply | Threaded
Open this post in threaded view
|

track_planning causing performance regression

Tharakan, Robins
Hi,

During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows
~45% performance drop [2] at high DB connection counts (when compared with v12.3)

Disabling pg_stat_statements.track_planning (which is 'On' by default)
brings the TPS numbers up to v12.3 levels.

The inflection point (in this test-case) is 128 Connections, beyond which the
TPS numbers are consistently low. Looking at the mailing list [1], this issue
didn't surface earlier possibly since the regression is trivial at low connection counts.

It would be great if this could be optimized further, or track_planning
disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
enabled (but otherwise not particularly interested in track_planning).

These are some details around the above test:

pgbench: scale - 100 / threads - 16
test-duration - 30s each
server - 96 vCPUs / 768GB - r5.24xl (AWS EC2 instance)
client - 72 vCPUs / 144GB - c5.18xl (AWS EC2 instance) (co-located with the DB server - Same AZ)
v12 - REL_12_STABLE (v12.3)
v13Beta1 - REL_13_STABLE (v13Beta1)
max_connections = 10000
shared_preload_libraries = 'pg_stat_statements'
shared_buffers 128MB


Reference:
1) https://www.postgresql.org/message-id/1554150919882-0.post%40n3.nabble.com

2) Fully-cached-select-only TPS drops >= 128 connections.

Conn      v12.3          v13Beta1        v13Beta1 (track_planning=off)
1         6,764          6,734            6,905
2         14,978         14,961           15,316
4         31,641         32,012           36,961
8         71,989         68,848           69,204
16        129,056        131,157          132,773
32        231,910        226,718          253,316
64        381,778        371,782          385,402
128       534,661  ====> 353,944          539,231
256       636,794  ====> 248,825          643,631
512       574,447  ====> 213,033          555,099
768       493,912  ====> 214,801          502,014
1024      484,993  ====> 222,492          490,716
1280      480,571  ====> 223,296          483,843
1536      475,030  ====> 228,137          477,153
1792      472,145  ====> 229,027          474,423
2048      471,385  ====> 228,665          470,238


3) perf - v13Beta1

-   88.38%     0.17%  postgres      postgres               [.] PostgresMain
   - 88.21% PostgresMain                    
      - 80.09% exec_simple_query            
         - 25.34% pg_plan_queries          
            - 25.28% pg_plan_query          
               - 25.21% pgss_planner        
                  - 14.36% pgss_store      
                     + 13.54% s_lock        
                  + 10.71% standard_planner
         + 18.29% PortalRun                
         - 15.12% PortalDrop                
            - 14.73% PortalCleanup          
               - 13.78% pgss_ExecutorEnd    
                  - 13.72% pgss_store      
                     + 12.83% s_lock        
                 0.72% standard_ExecutorEnd
         + 6.18% PortalStart                
         + 4.86% pg_analyze_and_rewrite    
         + 3.52% GetTransactionSnapshot    
         + 2.56% pg_parse_query            
         + 1.83% finish_xact_command        
           0.51% start_xact_command        
      + 3.93% pq_getbyte                    
      + 3.40% ReadyForQuery                



4) perf - v12.3

v12.3
-   84.32%     0.21%  postgres      postgres               [.] PostgresMain
   - 84.11% PostgresMain                    
      - 72.56% exec_simple_query            
         + 26.71% PortalRun                
         - 15.33% pg_plan_queries          
            - 15.29% pg_plan_query          
               + 15.21% standard_planner    
         + 7.81% PortalStart                
         + 6.76% pg_analyze_and_rewrite    
         + 4.37% GetTransactionSnapshot    
         + 3.69% pg_parse_query            
         - 2.96% PortalDrop                
            - 2.42% PortalCleanup          
               - 1.35% pgss_ExecutorEnd    
                  - 1.22% pgss_store        
                       0.57% s_lock        
                 0.77% standard_ExecutorEnd
         + 2.16% finish_xact_command        
         + 0.78% start_xact_command        
         + 0.59% pg_rewrite_query          
      + 5.67% pq_getbyte                    
      + 4.73% ReadyForQuery

-
robins


Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Julien Rouhaud
Hi,

On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <[hidden email]> wrote:

>
> During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows
> ~45% performance drop [2] at high DB connection counts (when compared with v12.3)
>
> Disabling pg_stat_statements.track_planning (which is 'On' by default)
> brings the TPS numbers up to v12.3 levels.
>
> The inflection point (in this test-case) is 128 Connections, beyond which the
> TPS numbers are consistently low. Looking at the mailing list [1], this issue
> didn't surface earlier possibly since the regression is trivial at low connection counts.
>
> It would be great if this could be optimized further, or track_planning
> disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
> enabled (but otherwise not particularly interested in track_planning).
>
> These are some details around the above test:
>
> pgbench: scale - 100 / threads - 16
> test-duration - 30s each
> server - 96 vCPUs / 768GB - r5.24xl (AWS EC2 instance)
> client - 72 vCPUs / 144GB - c5.18xl (AWS EC2 instance) (co-located with the DB server - Same AZ)
> v12 - REL_12_STABLE (v12.3)
> v13Beta1 - REL_13_STABLE (v13Beta1)
> max_connections = 10000
> shared_preload_libraries = 'pg_stat_statements'
> shared_buffers 128MB

I can't reproduce this on my laptop, but I can certainly believe that
running the same 3 queries using more connections than available cores
will lead to extra overhead.

I disagree with the conclusion though.  It seems to me that if you
really have this workload that consists in these few queries and want
to get better performance, you'll anyway use a connection pooler
and/or use prepared statements, which will make this overhead
disappear entirely, and will also yield an even bigger performance
improvement.  A quick test using pgbench -M prepared, with
track_planning enabled, with still way too many connections already
shows a 25% improvement over the -M simple without track_planning.


Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Fujii Masao-4


On 2020/06/29 16:05, Julien Rouhaud wrote:
> Hi,
>
> On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <[hidden email]> wrote:
>>
>> During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows

Thanks for the benchmark!


>> ~45% performance drop [2] at high DB connection counts (when compared with v12.3)

That's bad :(


>>
>> Disabling pg_stat_statements.track_planning (which is 'On' by default)
>> brings the TPS numbers up to v12.3 levels.
>>
>> The inflection point (in this test-case) is 128 Connections, beyond which the
>> TPS numbers are consistently low. Looking at the mailing list [1], this issue
>> didn't surface earlier possibly since the regression is trivial at low connection counts.
>>
>> It would be great if this could be optimized further, or track_planning
>> disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
>> enabled (but otherwise not particularly interested in track_planning).

Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.

One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.


>> These are some details around the above test:
>>
>> pgbench: scale - 100 / threads - 16
>> test-duration - 30s each
>> server - 96 vCPUs / 768GB - r5.24xl (AWS EC2 instance)
>> client - 72 vCPUs / 144GB - c5.18xl (AWS EC2 instance) (co-located with the DB server - Same AZ)
>> v12 - REL_12_STABLE (v12.3)
>> v13Beta1 - REL_13_STABLE (v13Beta1)
>> max_connections = 10000
>> shared_preload_libraries = 'pg_stat_statements'
>> shared_buffers 128MB
>
> I can't reproduce this on my laptop, but I can certainly believe that
> running the same 3 queries using more connections than available cores
> will lead to extra overhead.
>
> I disagree with the conclusion though.  It seems to me that if you
> really have this workload that consists in these few queries and want
> to get better performance, you'll anyway use a connection pooler
> and/or use prepared statements, which will make this overhead
> disappear entirely, and will also yield an even bigger performance
> improvement.  A quick test using pgbench -M prepared, with
> track_planning enabled, with still way too many connections already
> shows a 25% improvement over the -M simple without track_planning.

I understand your point. But IMO the default setting basically should
be safer value, i.e., off at least until the problem disappears.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Julien Rouhaud
On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao
<[hidden email]> wrote:

>
> On 2020/06/29 16:05, Julien Rouhaud wrote:
> > On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <[hidden email]> wrote:
> >>
> >> During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows
>
> Thanks for the benchmark!
>
>
> >> ~45% performance drop [2] at high DB connection counts (when compared with v12.3)
>
> That's bad :(
>
>
> >>
> >> Disabling pg_stat_statements.track_planning (which is 'On' by default)
> >> brings the TPS numbers up to v12.3 levels.
> >>
> >> The inflection point (in this test-case) is 128 Connections, beyond which the
> >> TPS numbers are consistently low. Looking at the mailing list [1], this issue
> >> didn't surface earlier possibly since the regression is trivial at low connection counts.
> >>
> >> It would be great if this could be optimized further, or track_planning
> >> disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
> >> enabled (but otherwise not particularly interested in track_planning).
>
> Your benchmark result seems to suggest that the cause of the problem is
> the contention of per-query spinlock in pgss_store(). Right?
> This lock contention is likely to happen when multiple sessions run
> the same queries.
>
> One idea to reduce that lock contention is to separate per-query spinlock
> into two; one is for planning, and the other is for execution. pgss_store()
> determines which lock to use based on the given "kind" argument.
> To make this idea work, also every pgss counters like shared_blks_hit
> need to be separated into two, i.e., for planning and execution.

This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?
It'll also quite increase the shared memory consumption.

I'm wondering if we could instead use atomics to store the counters.
The only downside is that we won't guarantee per-row consistency
anymore, which may be problematic.


Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Fujii Masao-4


On 2020/06/29 18:17, Julien Rouhaud wrote:

> On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao
> <[hidden email]> wrote:
>>
>> On 2020/06/29 16:05, Julien Rouhaud wrote:
>>> On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <[hidden email]> wrote:
>>>>
>>>> During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows
>>
>> Thanks for the benchmark!
>>
>>
>>>> ~45% performance drop [2] at high DB connection counts (when compared with v12.3)
>>
>> That's bad :(
>>
>>
>>>>
>>>> Disabling pg_stat_statements.track_planning (which is 'On' by default)
>>>> brings the TPS numbers up to v12.3 levels.
>>>>
>>>> The inflection point (in this test-case) is 128 Connections, beyond which the
>>>> TPS numbers are consistently low. Looking at the mailing list [1], this issue
>>>> didn't surface earlier possibly since the regression is trivial at low connection counts.
>>>>
>>>> It would be great if this could be optimized further, or track_planning
>>>> disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
>>>> enabled (but otherwise not particularly interested in track_planning).
>>
>> Your benchmark result seems to suggest that the cause of the problem is
>> the contention of per-query spinlock in pgss_store(). Right?
>> This lock contention is likely to happen when multiple sessions run
>> the same queries.
>>
>> One idea to reduce that lock contention is to separate per-query spinlock
>> into two; one is for planning, and the other is for execution. pgss_store()
>> determines which lock to use based on the given "kind" argument.
>> To make this idea work, also every pgss counters like shared_blks_hit
>> need to be separated into two, i.e., for planning and execution.
>
> This can probably remove some overhead, but won't it eventually hit
> the same issue when multiple connections try to plan the same query,
> given the number of different queries and very low execution runtime?

Yes. But maybe we can expect that the idea would improve
the performance to the near same level as v12?


> It'll also quite increase the shared memory consumption.

Yes.


> I'm wondering if we could instead use atomics to store the counters.
> The only downside is that we won't guarantee per-row consistency
> anymore, which may be problematic.

Yeah, we can consider more improvements against this issue.
But I'm afraid these (maybe including my idea) basically should
be items for v14...

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Julien Rouhaud
On Mon, Jun 29, 2020 at 11:38 AM Fujii Masao
<[hidden email]> wrote:

>
> >> Your benchmark result seems to suggest that the cause of the problem is
> >> the contention of per-query spinlock in pgss_store(). Right?
> >> This lock contention is likely to happen when multiple sessions run
> >> the same queries.
> >>
> >> One idea to reduce that lock contention is to separate per-query spinlock
> >> into two; one is for planning, and the other is for execution. pgss_store()
> >> determines which lock to use based on the given "kind" argument.
> >> To make this idea work, also every pgss counters like shared_blks_hit
> >> need to be separated into two, i.e., for planning and execution.
> >
> > This can probably remove some overhead, but won't it eventually hit
> > the same issue when multiple connections try to plan the same query,
> > given the number of different queries and very low execution runtime?
>
> Yes. But maybe we can expect that the idea would improve
> the performance to the near same level as v12?

A POC patch should be easy to do and see how much it solves this
problem.  However I'm not able to reproduce the issue, and IMHO unless
we specifically want to be able to distinguish planner-time counters
from execution-time counters, I'd prefer to disable track_planning by
default than going this way, so that users with a sane usage won't
have to suffer from a memory increase.

> > I'm wondering if we could instead use atomics to store the counters.
> > The only downside is that we won't guarantee per-row consistency
> > anymore, which may be problematic.
>
> Yeah, we can consider more improvements against this issue.
> But I'm afraid these (maybe including my idea) basically should
> be items for v14...

Yes, that's clearly not something I'd vote to push in v13 at this point.


Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Fujii Masao-4


On 2020/06/29 18:53, Julien Rouhaud wrote:

> On Mon, Jun 29, 2020 at 11:38 AM Fujii Masao
> <[hidden email]> wrote:
>>
>>>> Your benchmark result seems to suggest that the cause of the problem is
>>>> the contention of per-query spinlock in pgss_store(). Right?
>>>> This lock contention is likely to happen when multiple sessions run
>>>> the same queries.
>>>>
>>>> One idea to reduce that lock contention is to separate per-query spinlock
>>>> into two; one is for planning, and the other is for execution. pgss_store()
>>>> determines which lock to use based on the given "kind" argument.
>>>> To make this idea work, also every pgss counters like shared_blks_hit
>>>> need to be separated into two, i.e., for planning and execution.
>>>
>>> This can probably remove some overhead, but won't it eventually hit
>>> the same issue when multiple connections try to plan the same query,
>>> given the number of different queries and very low execution runtime?
>>
>> Yes. But maybe we can expect that the idea would improve
>> the performance to the near same level as v12?
>
> A POC patch should be easy to do and see how much it solves this
> problem.  However I'm not able to reproduce the issue, and IMHO unless
> we specifically want to be able to distinguish planner-time counters
> from execution-time counters, I'd prefer to disable track_planning by
> default than going this way, so that users with a sane usage won't
> have to suffer from a memory increase.

Agreed. +1 to change that default to off.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Fujii Masao-4


On 2020/06/29 18:56, Fujii Masao wrote:

>
>
> On 2020/06/29 18:53, Julien Rouhaud wrote:
>> On Mon, Jun 29, 2020 at 11:38 AM Fujii Masao
>> <[hidden email]> wrote:
>>>
>>>>> Your benchmark result seems to suggest that the cause of the problem is
>>>>> the contention of per-query spinlock in pgss_store(). Right?
>>>>> This lock contention is likely to happen when multiple sessions run
>>>>> the same queries.
>>>>>
>>>>> One idea to reduce that lock contention is to separate per-query spinlock
>>>>> into two; one is for planning, and the other is for execution. pgss_store()
>>>>> determines which lock to use based on the given "kind" argument.
>>>>> To make this idea work, also every pgss counters like shared_blks_hit
>>>>> need to be separated into two, i.e., for planning and execution.
>>>>
>>>> This can probably remove some overhead, but won't it eventually hit
>>>> the same issue when multiple connections try to plan the same query,
>>>> given the number of different queries and very low execution runtime?
>>>
>>> Yes. But maybe we can expect that the idea would improve
>>> the performance to the near same level as v12?
>>
>> A POC patch should be easy to do and see how much it solves this
>> problem.  However I'm not able to reproduce the issue, and IMHO unless
>> we specifically want to be able to distinguish planner-time counters
>> from execution-time counters, I'd prefer to disable track_planning by
>> default than going this way, so that users with a sane usage won't
>> have to suffer from a memory increase.
>
> Agreed. +1 to change that default to off.
Attached patch does this.

I also add the following into the description about each *_plan_time column
in the docs. IMO this is helpful for users when they see that those columns
report zero by default and try to understand why.

(if <varname>pg_stat_statements.track_planning</varname> is enabled, otherwise zero)

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

disable_pgss_planning_default_v1.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Julien Rouhaud
On Mon, Jun 29, 2020 at 1:14 PM Fujii Masao <[hidden email]> wrote:

>
> On 2020/06/29 18:56, Fujii Masao wrote:
> >
> >
> > On 2020/06/29 18:53, Julien Rouhaud wrote:
> >> On Mon, Jun 29, 2020 at 11:38 AM Fujii Masao
> >> <[hidden email]> wrote:
> >>>
> >>>>> Your benchmark result seems to suggest that the cause of the problem is
> >>>>> the contention of per-query spinlock in pgss_store(). Right?
> >>>>> This lock contention is likely to happen when multiple sessions run
> >>>>> the same queries.
> >>>>>
> >>>>> One idea to reduce that lock contention is to separate per-query spinlock
> >>>>> into two; one is for planning, and the other is for execution. pgss_store()
> >>>>> determines which lock to use based on the given "kind" argument.
> >>>>> To make this idea work, also every pgss counters like shared_blks_hit
> >>>>> need to be separated into two, i.e., for planning and execution.
> >>>>
> >>>> This can probably remove some overhead, but won't it eventually hit
> >>>> the same issue when multiple connections try to plan the same query,
> >>>> given the number of different queries and very low execution runtime?
> >>>
> >>> Yes. But maybe we can expect that the idea would improve
> >>> the performance to the near same level as v12?
> >>
> >> A POC patch should be easy to do and see how much it solves this
> >> problem.  However I'm not able to reproduce the issue, and IMHO unless
> >> we specifically want to be able to distinguish planner-time counters
> >> from execution-time counters, I'd prefer to disable track_planning by
> >> default than going this way, so that users with a sane usage won't
> >> have to suffer from a memory increase.
> >
> > Agreed. +1 to change that default to off.
>
> Attached patch does this.

Patch looks good to me.

> I also add the following into the description about each *_plan_time column
> in the docs. IMO this is helpful for users when they see that those columns
> report zero by default and try to understand why.
>
> (if <varname>pg_stat_statements.track_planning</varname> is enabled, otherwise zero)

+1

Do you intend to wait for other input before pushing?  FWIW I'm still
not convinced that the exposed problem is representative of any
realistic workload.  I of course entirely agree with the other
documentation changes.


Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Ants Aasma-2
In reply to this post by Julien Rouhaud
On Mon, 29 Jun 2020 at 12:17, Julien Rouhaud <[hidden email]> wrote:
On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao
<[hidden email]> wrote:
>
> On 2020/06/29 16:05, Julien Rouhaud wrote:
> > On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <[hidden email]> wrote:
> >>
> >> During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows
>
> Thanks for the benchmark!
>
>
> >> ~45% performance drop [2] at high DB connection counts (when compared with v12.3)
>
> That's bad :(
>
>
> >>
> >> Disabling pg_stat_statements.track_planning (which is 'On' by default)
> >> brings the TPS numbers up to v12.3 levels.
> >>
> >> The inflection point (in this test-case) is 128 Connections, beyond which the
> >> TPS numbers are consistently low. Looking at the mailing list [1], this issue
> >> didn't surface earlier possibly since the regression is trivial at low connection counts.
> >>
> >> It would be great if this could be optimized further, or track_planning
> >> disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
> >> enabled (but otherwise not particularly interested in track_planning).
>
> Your benchmark result seems to suggest that the cause of the problem is
> the contention of per-query spinlock in pgss_store(). Right?
> This lock contention is likely to happen when multiple sessions run
> the same queries.
>
> One idea to reduce that lock contention is to separate per-query spinlock
> into two; one is for planning, and the other is for execution. pgss_store()
> determines which lock to use based on the given "kind" argument.
> To make this idea work, also every pgss counters like shared_blks_hit
> need to be separated into two, i.e., for planning and execution.

This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?
It'll also quite increase the shared memory consumption.

I'm wondering if we could instead use atomics to store the counters.
The only downside is that we won't guarantee per-row consistency
anymore, which may be problematic.


The problem looks to be that spinlocks are terrible with overloaded CPU and a contended spinlock. A process holding the spinlock might easily get scheduled out leading to excessive spinning by everybody. I think a simple thing to try would be to replace the spinlock with LWLock.

I did a prototype patch that replaces spinlocks with futexes, but was not able to find a workload where it mattered. We have done a great job at eliminating spinlocks from contended code paths. Robins, perhaps you could try it to see if it reduces the regression you are observing. The patch is against v13 stable branch.

-- 
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com

futex-prototype.patch (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Peter Geoghegan-4
In reply to this post by Fujii Masao-4
On Mon, Jun 29, 2020 at 1:55 AM Fujii Masao <[hidden email]> wrote:

> > I disagree with the conclusion though.  It seems to me that if you
> > really have this workload that consists in these few queries and want
> > to get better performance, you'll anyway use a connection pooler
> > and/or use prepared statements, which will make this overhead
> > disappear entirely, and will also yield an even bigger performance
> > improvement.  A quick test using pgbench -M prepared, with
> > track_planning enabled, with still way too many connections already
> > shows a 25% improvement over the -M simple without track_planning.
>
> I understand your point. But IMO the default setting basically should
> be safer value, i.e., off at least until the problem disappears.

+1 -- this regression seems unacceptable to me.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Peter Geoghegan-4
On Mon, Jun 29, 2020 at 3:23 PM Peter Geoghegan <[hidden email]> wrote:
> +1 -- this regression seems unacceptable to me.

I added an open item to track this.

Thanks
--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Andres Freund
In reply to this post by Julien Rouhaud
Hi,

On 2020-06-29 09:05:18 +0200, Julien Rouhaud wrote:
> I can't reproduce this on my laptop, but I can certainly believe that
> running the same 3 queries using more connections than available cores
> will lead to extra overhead.

> I disagree with the conclusion though.  It seems to me that if you
> really have this workload that consists in these few queries and want
> to get better performance, you'll anyway use a connection pooler
> and/or use prepared statements, which will make this overhead
> disappear entirely, and will also yield an even bigger performance
> improvement.

It's an extremely common to have have times where there's more active
queries than CPUs. And a pooler won't avoid that fully, at least not
without drastically reducing overall throughput.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Andres Freund
In reply to this post by Fujii Masao-4
Hi,

On 2020-06-29 17:55:28 +0900, Fujii Masao wrote:
> One idea to reduce that lock contention is to separate per-query spinlock
> into two; one is for planning, and the other is for execution. pgss_store()
> determines which lock to use based on the given "kind" argument.
> To make this idea work, also every pgss counters like shared_blks_hit
> need to be separated into two, i.e., for planning and execution.

I suspect that the best thing would be to just turn the spinlock into an
lwlock. Spinlocks deal terribly with contention. I suspect it'd solve
the performance issue entirely. And it might even be possible, further
down the line, to just use a shared lock, and use atomics for the
counters.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Fujii Masao-4
In reply to this post by Ants Aasma-2


On 2020/06/29 22:23, Ants Aasma wrote:

> On Mon, 29 Jun 2020 at 12:17, Julien Rouhaud <[hidden email] <mailto:[hidden email]>> wrote:
>
>     On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao
>     <[hidden email] <mailto:[hidden email]>> wrote:
>      >
>      > On 2020/06/29 16:05, Julien Rouhaud wrote:
>      > > On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <[hidden email] <mailto:[hidden email]>> wrote:
>      > >>
>      > >> During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows
>      >
>      > Thanks for the benchmark!
>      >
>      >
>      > >> ~45% performance drop [2] at high DB connection counts (when compared with v12.3)
>      >
>      > That's bad :(
>      >
>      >
>      > >>
>      > >> Disabling pg_stat_statements.track_planning (which is 'On' by default)
>      > >> brings the TPS numbers up to v12.3 levels.
>      > >>
>      > >> The inflection point (in this test-case) is 128 Connections, beyond which the
>      > >> TPS numbers are consistently low. Looking at the mailing list [1], this issue
>      > >> didn't surface earlier possibly since the regression is trivial at low connection counts.
>      > >>
>      > >> It would be great if this could be optimized further, or track_planning
>      > >> disabled (by default) so as to not trip users upgrading from v12 with pg_stat_statement
>      > >> enabled (but otherwise not particularly interested in track_planning).
>      >
>      > Your benchmark result seems to suggest that the cause of the problem is
>      > the contention of per-query spinlock in pgss_store(). Right?
>      > This lock contention is likely to happen when multiple sessions run
>      > the same queries.
>      >
>      > One idea to reduce that lock contention is to separate per-query spinlock
>      > into two; one is for planning, and the other is for execution. pgss_store()
>      > determines which lock to use based on the given "kind" argument.
>      > To make this idea work, also every pgss counters like shared_blks_hit
>      > need to be separated into two, i.e., for planning and execution.
>
>     This can probably remove some overhead, but won't it eventually hit
>     the same issue when multiple connections try to plan the same query,
>     given the number of different queries and very low execution runtime?
>     It'll also quite increase the shared memory consumption.
>
>     I'm wondering if we could instead use atomics to store the counters.
>     The only downside is that we won't guarantee per-row consistency
>     anymore, which may be problematic.
>
>
>
> The problem looks to be that spinlocks are terrible with overloaded CPU and a contended spinlock. A process holding the spinlock might easily get scheduled out leading to excessive spinning by everybody. I think a simple thing to try would be to replace the spinlock with LWLock.
Yes. Attached is the POC patch that replaces per-counter spinlock with LWLock.

>
> I did a prototype patch that replaces spinlocks with futexes, but was not able to find a workload where it mattered.

I'm not familiar with futex, but could you tell me why you used futex instead
of LWLock that we already have? Is futex portable?

> We have done a great job at eliminating spinlocks from contended code paths. Robins, perhaps you could try it to see if it reduces the regression you are observing.

Yes. Also we need to check that this change doesn't increase performance
overhead in other workloads.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

pgss_lwlock_v1.patch (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Ants Aasma-2
On Tue, 30 Jun 2020 at 08:43, Fujii Masao <[hidden email]> wrote:
> The problem looks to be that spinlocks are terrible with overloaded CPU and a contended spinlock. A process holding the spinlock might easily get scheduled out leading to excessive spinning by everybody. I think a simple thing to try would be to replace the spinlock with LWLock.

Yes. Attached is the POC patch that replaces per-counter spinlock with LWLock.

Great. I think this is the one that should get considered for testing.
 
> I did a prototype patch that replaces spinlocks with futexes, but was not able to find a workload where it mattered.

I'm not familiar with futex, but could you tell me why you used futex instead
of LWLock that we already have? Is futex portable?

Futex is a Linux kernel call that allows to build a lock that has uncontended cases work fully in user space almost exactly like a spinlock, while falling back to syscalls that wait for wakeup in case of contention. It's not portable, but probably something similar could be implemented for other operating systems. I did not pursue this further because it became apparent that every performance critical spinlock had already been removed.

To be clear, I am not advocating for this patch to get included. I just had the patch immediately available and it could have confirmed that using a better lock fixes things.
-- 
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com
Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Fujii Masao-4


On 2020/06/30 20:30, Ants Aasma wrote:

> On Tue, 30 Jun 2020 at 08:43, Fujii Masao <[hidden email] <mailto:[hidden email]>> wrote:
>
>      > The problem looks to be that spinlocks are terrible with overloaded CPU and a contended spinlock. A process holding the spinlock might easily get scheduled out leading to excessive spinning by everybody. I think a simple thing to try would be to replace the spinlock with LWLock.
>
>     Yes. Attached is the POC patch that replaces per-counter spinlock with LWLock.
>
>
> Great. I think this is the one that should get considered for testing.
>
>      > I did a prototype patch that replaces spinlocks with futexes, but was not able to find a workload where it mattered.
>
>     I'm not familiar with futex, but could you tell me why you used futex instead
>     of LWLock that we already have? Is futex portable?
>
>
> Futex is a Linux kernel call that allows to build a lock that has uncontended cases work fully in user space almost exactly like a spinlock, while falling back to syscalls that wait for wakeup in case of contention. It's not portable, but probably something similar could be implemented for other operating systems. I did not pursue this further because it became apparent that every performance critical spinlock had already been removed.
>
> To be clear, I am not advocating for this patch to get included. I just had the patch immediately available and it could have confirmed that using a better lock fixes things.

Understood. Thanks for the explanation!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Fujii Masao-4
In reply to this post by Peter Geoghegan-4


On 2020/06/30 7:29, Peter Geoghegan wrote:
> On Mon, Jun 29, 2020 at 3:23 PM Peter Geoghegan <[hidden email]> wrote:
>> +1 -- this regression seems unacceptable to me.
>
> I added an open item to track this.

Thanks!
I'm thinking to change the default value of track_planning to off for v13.

Ants and Andres suggested to replace the spinlock used in pgss_store() with
LWLock. I agreed with them and posted the POC patch doing that. But I think
the patch is an item for v14. The patch may address the reported performance
issue, but may cause other performance issues in other workloads. We would
need to measure how the patch affects the performance in various workloads.
It seems too late to do that at this stage of v13. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Andres Freund
In reply to this post by Fujii Masao-4
Hi,

On 2020-06-30 14:43:39 +0900, Fujii Masao wrote:
> > I did a prototype patch that replaces spinlocks with futexes, but was not able to find a workload where it mattered.
>
> I'm not familiar with futex, but could you tell me why you used futex instead
> of LWLock that we already have? Is futex portable?

We can't rely on futexes, they're linux only. I also don't see much of a
reason to use spinlocks (rather than lwlocks) here in the first place.


> diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
> index cef8bb5a49..aa506f6c11 100644
> --- a/contrib/pg_stat_statements/pg_stat_statements.c
> +++ b/contrib/pg_stat_statements/pg_stat_statements.c
> @@ -39,7 +39,7 @@
>   * in an entry except the counters requires the same.  To look up an entry,
>   * one must hold the lock shared.  To read or update the counters within
>   * an entry, one must hold the lock shared or exclusive (so the entry doesn't
> - * disappear!) and also take the entry's mutex spinlock.
> + * disappear!) and also take the entry's partition lock.
>   * The shared state variable pgss->extent (the next free spot in the external
>   * query-text file) should be accessed only while holding either the
>   * pgss->mutex spinlock, or exclusive lock on pgss->lock.  We use the mutex to
> @@ -115,6 +115,11 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100;
>  
>  #define JUMBLE_SIZE 1024 /* query serialization buffer size */
>  
> +#define PGSS_NUM_LOCK_PARTITIONS() (pgss_max)
> +#define PGSS_HASH_PARTITION_LOCK(key) \
> + (&(pgss->base + \
> +   (get_hash_value(pgss_hash, key) % PGSS_NUM_LOCK_PARTITIONS()))->lock)
> +
>  /*
>   * Extension version number, for supporting older extension versions' objects
>   */
> @@ -207,7 +212,7 @@ typedef struct pgssEntry
>   Size query_offset; /* query text offset in external file */
>   int query_len; /* # of valid bytes in query string, or -1 */
>   int encoding; /* query text encoding */
> - slock_t mutex; /* protects the counters only */
> + LWLock   *lock; /* protects the counters only */
>  } pgssEntry;

Why did you add the hashing here? It seems a lot better to just add an
lwlock in-place instead of the spinlock? The added size is neglegible
compared to the size of pgssEntry.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: track_planning causing performance regression

Andres Freund
In reply to this post by Ants Aasma-2
Hi,

On 2020-06-30 14:30:03 +0300, Ants Aasma wrote:
> Futex is a Linux kernel call that allows to build a lock that has
> uncontended cases work fully in user space almost exactly like a spinlock,
> while falling back to syscalls that wait for wakeup in case of contention.
> It's not portable, but probably something similar could be implemented for
> other operating systems. I did not pursue this further because it became
> apparent that every performance critical spinlock had already been removed.

Our lwlock implementation does have that property already, though. While
the kernel wait is implemented using semaphores, those are implemented
using futexes internally (posix ones, not sysv ones, so only after
whatever version we switched the default to posix semas on linux).

I'd rather move towards removing spinlocks from postgres than making
their implementation more complicated...

Greetings,

Andres Freund


123