Keep elog(ERROR) and ereport(ERROR) calls in the cold path

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Keep elog(ERROR) and ereport(ERROR) calls in the cold path

David Rowley
Back in [1] I experimented with a patch to coax compilers to build all
elog/ereport calls that were >= ERROR into a cold path away from the
function rasing the error. At the time, I really just wanted to test
how much of a speedup we could get by doing this and ended up just
writing up a patch that basically changed all elog(ERROR) calls from:

if (<error situation check>)
{
   <do stuff>;
    elog(ERROR, "...");
}

to add an unlikely() and become;

if (unlikely(<error situation check>)
{
   <do stuff>;
    elog(ERROR, "...");
}

Per the report in [1] I did see some pretty good gains in performance
from doing this.  The problem was, that at the time I couldn't figure
out a way to do this without an invasive patch that changed the code
in the thousands of elog/ereport calls.

In the attached, I've come up with a way that works. Basically I've
just added a function named errstart_cold() that is attributed with
__attribute__((cold)), which will hint to the compiler to keep
branches which call this function in a cold path.  To make the
compiler properly mark just >= ERROR calls as cold, and only when the
elevel is a constant, I modified the ereport_domain macro to do:

if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
errstart_cold(elevel, domain) : \
errstart(elevel, domain)) \

I see no reason why the compiler shouldn't always fold that constant
expression at compile-time and correctly select the correct version of
the function for the job.  (I also tried using __builtin_choose_expr()
but GCC complained when the elevel was not constant, even with the
__builtin_constant_p() test in the condition)

I sampled a few .s files to inspect what code had changed.  Looking at
mcxt.s, fmgr.s and xlog.s, the first two of these because I found in
[1] that elogs were being done from those files quite often and xlog.s
because it's pretty big.  As far as I could see, GCC correctly moved
all the error reporting stuff where the elevel was a constant and >=
ERROR into the cold path and left the lower-level or non-consts elevel
calls alone.

For clang, I didn't see any changes in the .s files. I suspect that
they might have a few smarts in there and see the
__builtin_unreachable() call and assume the path is cold already based
on that. That was with clang 10. Perhaps older versions are not as
smart.

Benchmarking:

For benchmarking, I've not done a huge amount to test the impacts of
this change.  However, I can say that I am seeing some fairly good
improvements.  There seems to be some small improvements to execution
speed using TPCH-Q1 and also some good results from a pgbench -S test.

For TPCH-Q1:

Master:
$ pgbench -n -f pg-tpch/queries/q01.sql -T 120 tpch
latency average = 5272.630 ms
latency average = 5258.610 ms
latency average = 5250.871 ms

Master + elog_ereport_attribute_cold.patch
$ pgbench -n -f pg-tpch/queries/q01.sql -T 120 tpch
latency average = 5182.761 ms
latency average = 5194.851 ms
latency average = 5183.128 ms

Which is about a 1.42% increase in performance. That's not exactly
groundbreaking, but pretty useful to have if that happens to apply
across the board for execution performance.

For pgbench -S:

My results were a bit noisier than the TPCH test, but the results I
obtained did show about a 10% increase in performance:

Master:
drowley@amd3990x:~$ pgbench -S -T 120 postgres
tps = 25245.903255 (excluding connections establishing)
tps = 26144.454208 (excluding connections establishing)
tps = 25931.850518 (excluding connections establishing)

Master + elog_ereport_attribute_cold.patch
drowley@amd3990x:~$ pgbench -S -T 120 postgres
tps = 28351.480631 (excluding connections establishing)
tps = 27763.656557 (excluding connections establishing)
tps = 28896.427929 (excluding connections establishing)

It would be useful if someone with some server-grade Intel hardware
could run a few tests on this.  The above results are all from AMD
hardware.

I've attached the patch for this. I'll add it to the July 'fest.

David

[1] https://www.postgresql.org/message-id/CAKJS1f8yqRW3qx2CO9r4bqqvA2Vx68=3awbh8CJWTP9zXeoHMw@...

elog_ereport_attribute_cold.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

Robert Haas
On Wed, Jun 24, 2020 at 9:51 PM David Rowley <[hidden email]> wrote:

> $ pgbench -n -f pg-tpch/queries/q01.sql -T 120 tpch
>
> Which is about a 1.42% increase in performance. That's not exactly
> groundbreaking, but pretty useful to have if that happens to apply
> across the board for execution performance.
>
> For pgbench -S:
>
> My results were a bit noisier than the TPCH test, but the results I
> obtained did show about a 10% increase in performance:

This is pretty cool, particularly because it affects single-client
performance. It seems like a lot of ideas people have had about
speeding up pgbench performance - including me - have improved
performance under concurrency at the cost of very slightly degrading
single-client performance. It would be nice to claw some of that back.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

Andres Freund
In reply to this post by David Rowley
Hi,

Thanks for picking this up again!

On 2020-06-25 13:50:37 +1200, David Rowley wrote:
> In the attached, I've come up with a way that works. Basically I've
> just added a function named errstart_cold() that is attributed with
> __attribute__((cold)), which will hint to the compiler to keep
> branches which call this function in a cold path.

I recall you trying this before? Has that gotten easier because we
evolved ereport()/elog(), or has gcc become smarter, or ...?


> To make the compiler properly mark just >= ERROR calls as cold, and
> only when the elevel is a constant, I modified the ereport_domain
> macro to do:
>
> if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
> errstart_cold(elevel, domain) : \
> errstart(elevel, domain)) \

I think it'd be good to not just do this for ERROR, but also for <=
DEBUG1. I recall seing quite a few debug elogs that made the code worse
just by "being there".

I suspect that doing this for DEBUG* could also improve the code for
clang, because we obviously don't have __builtin_unreachable after those.


> I see no reason why the compiler shouldn't always fold that constant
> expression at compile-time and correctly select the correct version of
> the function for the job.  (I also tried using __builtin_choose_expr()
> but GCC complained when the elevel was not constant, even with the
> __builtin_constant_p() test in the condition)

I think it has to be constant in all paths for that...


> Master:
> drowley@amd3990x:~$ pgbench -S -T 120 postgres
> tps = 25245.903255 (excluding connections establishing)
> tps = 26144.454208 (excluding connections establishing)
> tps = 25931.850518 (excluding connections establishing)
>
> Master + elog_ereport_attribute_cold.patch
> drowley@amd3990x:~$ pgbench -S -T 120 postgres
> tps = 28351.480631 (excluding connections establishing)
> tps = 27763.656557 (excluding connections establishing)
> tps = 28896.427929 (excluding connections establishing)

That is pretty damn cool.


> diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
> index e976201030..8076e8af24 100644
> --- a/src/backend/utils/error/elog.c
> +++ b/src/backend/utils/error/elog.c
> @@ -219,6 +219,19 @@ err_gettext(const char *str)
>  #endif
>  }
>  
> +#if defined(HAVE_PG_ATTRIBUTE_HOT_AND_COLD) && defined(HAVE__BUILTIN_CONSTANT_P)
> +/*
> + * errstart_cold
> + * A simple wrapper around errstart, but hinted to be cold so that the
> + * compiler is more likely to put error code in a cold area away from the
> + * main function body.
> + */
> +bool
> +pg_attribute_cold errstart_cold(int elevel, const char *domain)
> +{
> + return errstart(elevel, domain);
> +}
> +#endif

Hm. Would it make sense to have this be a static inline?


>  /*
>   * errstart --- begin an error-reporting cycle
> diff --git a/src/include/c.h b/src/include/c.h
> index d72b23afe4..087b8af6cb 100644
> --- a/src/include/c.h
> +++ b/src/include/c.h
> @@ -178,6 +178,21 @@
>  #define pg_noinline
>  #endif
>  
> +/*
> + * Marking certain functions as "hot" or "cold" can be useful to assist the
> + * compiler in arranging the assembly code in a more efficient way.
> + * These are supported from GCC >= 4.3 and clang >= 3.2
> + */
> +#if (defined(__GNUC__) && (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 3))) || \
> + (defined(__clang__) && (__clang_major__ > 3 || (__clang_major__ == 3 && __clang_minor__ >= 2)))
> +#define HAVE_PG_ATTRIBUTE_HOT_AND_COLD 1
> +#define pg_attribute_hot __attribute__((hot))
> +#define pg_attribute_cold __attribute__((cold))
> +#else
> +#define pg_attribute_hot
> +#define pg_attribute_cold
> +#endif

Wonder if we should start using __has_attribute() for things like this.

https://gcc.gnu.org/onlinedocs/cpp/_005f_005fhas_005fattribute.html

I.e. we could do something like
#ifndef __has_attribute
#define __has_attribute(attribute) 0
#endif

#if __has_attribute(hot)
#define pg_attribute_hot __attribute__((hot))
#else
#define pg_attribute_hot
#endif

clang added __has_attribute in 2.9 (2010), gcc added it in 5 (2014), so
I don't think we'd loose too much.




>  #ifdef HAVE__BUILTIN_CONSTANT_P
> +#ifdef HAVE_PG_ATTRIBUTE_HOT_AND_COLD
> +#define ereport_domain(elevel, domain, ...) \
> + do { \
> + pg_prevent_errno_in_scope(); \
> + if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
> + errstart_cold(elevel, domain) : \
> + errstart(elevel, domain)) \
> + __VA_ARGS__, errfinish(__FILE__, __LINE__, PG_FUNCNAME_MACRO); \
> + if (__builtin_constant_p(elevel) && (elevel) >= ERROR) \
> + pg_unreachable(); \
> + } while(0)
> +#else /* !HAVE_PG_ATTRIBUTE_HOT_AND_COLD */
>  #define ereport_domain(elevel, domain, ...) \
>   do { \
>   pg_prevent_errno_in_scope(); \
> @@ -129,6 +141,7 @@
>   if (__builtin_constant_p(elevel) && (elevel) >= ERROR) \
>   pg_unreachable(); \
>   } while(0)
> +#endif /* HAVE_PG_ATTRIBUTE_HOT_AND_COLD */
>  #else /* !HAVE__BUILTIN_CONSTANT_P */
>  #define ereport_domain(elevel, domain, ...) \
>   do { \
> @@ -146,6 +159,9 @@

Could we do this without another copy? Feels like we should be able to
just do that in the existing #ifdef HAVE__BUILTIN_CONSTANT_P if we just
add errstart_cold() independent HAVE_PG_ATTRIBUTE_HOT_AND_COLD.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

David Rowley
On Fri, 26 Jun 2020 at 04:35, Andres Freund <[hidden email]> wrote:
> On 2020-06-25 13:50:37 +1200, David Rowley wrote:
> > In the attached, I've come up with a way that works. Basically I've
> > just added a function named errstart_cold() that is attributed with
> > __attribute__((cold)), which will hint to the compiler to keep
> > branches which call this function in a cold path.
>
> I recall you trying this before? Has that gotten easier because we
> evolved ereport()/elog(), or has gcc become smarter, or ...?

Yeah, I appear to have tried it and found it not to work in [1]. I can
only assume GCC got smarter in regards to how it marks a path as cold.
Previously it seemed not do due to the do/while(0). I'm pretty sure
back when I tested last that ditching the do while made it work, just
we can't really get rid of it.

> > To make the compiler properly mark just >= ERROR calls as cold, and
> > only when the elevel is a constant, I modified the ereport_domain
> > macro to do:
> >
> > if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
> > errstart_cold(elevel, domain) : \
> > errstart(elevel, domain)) \
>
> I think it'd be good to not just do this for ERROR, but also for <=
> DEBUG1. I recall seing quite a few debug elogs that made the code worse
> just by "being there".

I think that case is different. We don't want to move the entire elog
path into the cold path for that. We'd only want to hint that errstart
is unlikely to return true if elevel <= DEBUG1

> > diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
> > index e976201030..8076e8af24 100644
> > --- a/src/backend/utils/error/elog.c
> > +++ b/src/backend/utils/error/elog.c
> > @@ -219,6 +219,19 @@ err_gettext(const char *str)
> >  #endif
> >  }
> >
> > +#if defined(HAVE_PG_ATTRIBUTE_HOT_AND_COLD) && defined(HAVE__BUILTIN_CONSTANT_P)
> > +/*
> > + * errstart_cold
> > + *           A simple wrapper around errstart, but hinted to be cold so that the
> > + *           compiler is more likely to put error code in a cold area away from the
> > + *           main function body.
> > + */
> > +bool
> > +pg_attribute_cold errstart_cold(int elevel, const char *domain)
> > +{
> > +     return errstart(elevel, domain);
> > +}
> > +#endif
>
> Hm. Would it make sense to have this be a static inline?

I thought about that but didn't try it to ensure it still worked ok. I
didn't think it was that important to make sure we don't get the extra
function hop for ERRORs. It seemed like a case we'd not want to really
optimise for.

> >  /*
> >   * errstart --- begin an error-reporting cycle
> > diff --git a/src/include/c.h b/src/include/c.h
> > index d72b23afe4..087b8af6cb 100644
> > --- a/src/include/c.h
> > +++ b/src/include/c.h
> > @@ -178,6 +178,21 @@
> >  #define pg_noinline
> >  #endif
> >
> > +/*
> > + * Marking certain functions as "hot" or "cold" can be useful to assist the
> > + * compiler in arranging the assembly code in a more efficient way.
> > + * These are supported from GCC >= 4.3 and clang >= 3.2
> > + */
> > +#if (defined(__GNUC__) && (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 3))) || \
> > +     (defined(__clang__) && (__clang_major__ > 3 || (__clang_major__ == 3 && __clang_minor__ >= 2)))
> > +#define HAVE_PG_ATTRIBUTE_HOT_AND_COLD 1
> > +#define pg_attribute_hot __attribute__((hot))
> > +#define pg_attribute_cold __attribute__((cold))
> > +#else
> > +#define pg_attribute_hot
> > +#define pg_attribute_cold
> > +#endif
>
> Wonder if we should start using __has_attribute() for things like this.
>
> https://gcc.gnu.org/onlinedocs/cpp/_005f_005fhas_005fattribute.html
>
> I.e. we could do something like
> #ifndef __has_attribute
> #define __has_attribute(attribute) 0
> #endif
>
> #if __has_attribute(hot)
> #define pg_attribute_hot __attribute__((hot))
> #else
> #define pg_attribute_hot
> #endif
>
> clang added __has_attribute in 2.9 (2010), gcc added it in 5 (2014), so
> I don't think we'd loose too much.

Thanks for pointing that out. Seems like a good idea to me. I don't
think we'll upset too many people running GCC 4.4 to 5.0. I can't
imagine many people serious about performance will be using a
PostgreSQL version that'll be released in 2021 with a pre 2014
compiler.

> >  #ifdef HAVE__BUILTIN_CONSTANT_P
> > +#ifdef HAVE_PG_ATTRIBUTE_HOT_AND_COLD
> > +#define ereport_domain(elevel, domain, ...)  \
> > +     do { \
> > +             pg_prevent_errno_in_scope(); \
> > +             if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
> > +                      errstart_cold(elevel, domain) : \
> > +                      errstart(elevel, domain)) \
> > +                     __VA_ARGS__, errfinish(__FILE__, __LINE__, PG_FUNCNAME_MACRO); \
> > +             if (__builtin_constant_p(elevel) && (elevel) >= ERROR) \
> > +                     pg_unreachable(); \
> > +     } while(0)
> > +#else                                                        /* !HAVE_PG_ATTRIBUTE_HOT_AND_COLD */
> >  #define ereport_domain(elevel, domain, ...)  \
> >       do { \
> >               pg_prevent_errno_in_scope(); \
> > @@ -129,6 +141,7 @@
> >               if (__builtin_constant_p(elevel) && (elevel) >= ERROR) \
> >                       pg_unreachable(); \
> >       } while(0)
> > +#endif                                                       /* HAVE_PG_ATTRIBUTE_HOT_AND_COLD */
> >  #else                                                        /* !HAVE__BUILTIN_CONSTANT_P */
> >  #define ereport_domain(elevel, domain, ...)  \
> >       do { \
> > @@ -146,6 +159,9 @@
>
> Could we do this without another copy? Feels like we should be able to
> just do that in the existing #ifdef HAVE__BUILTIN_CONSTANT_P if we just
> add errstart_cold() independent HAVE_PG_ATTRIBUTE_HOT_AND_COLD.

Yeah. I just did it that way so we didn't get the extra function hop
in compilers that don't support __attribute((cold)). If I can inline
errstart_cold() and have the compiler still properly determine that
it's a cold function, then it seems wise to do it that way. If not,
then I'll need to keep a separate macro.

David

[1] https://www.postgresql.org/message-id/20171030094449.ffqhvt5n623zvyja%40alap3.anarazel.de


Reply | Threaded
Open this post in threaded view
|

Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

David Rowley
On Fri, 26 Jun 2020 at 13:21, David Rowley <[hidden email]> wrote:

>
> On Fri, 26 Jun 2020 at 04:35, Andres Freund <[hidden email]> wrote:
> > On 2020-06-25 13:50:37 +1200, David Rowley wrote:
> > > In the attached, I've come up with a way that works. Basically I've
> > > just added a function named errstart_cold() that is attributed with
> > > __attribute__((cold)), which will hint to the compiler to keep
> > > branches which call this function in a cold path.
> >
> > I recall you trying this before? Has that gotten easier because we
> > evolved ereport()/elog(), or has gcc become smarter, or ...?
>
> Yeah, I appear to have tried it and found it not to work in [1]. I can
> only assume GCC got smarter in regards to how it marks a path as cold.
> Previously it seemed not do due to the do/while(0). I'm pretty sure
> back when I tested last that ditching the do while made it work, just
> we can't really get rid of it.
>
> > > To make the compiler properly mark just >= ERROR calls as cold, and
> > > only when the elevel is a constant, I modified the ereport_domain
> > > macro to do:
> > >
> > > if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
> > > errstart_cold(elevel, domain) : \
> > > errstart(elevel, domain)) \
> >
> > I think it'd be good to not just do this for ERROR, but also for <=
> > DEBUG1. I recall seing quite a few debug elogs that made the code worse
> > just by "being there".
>
> I think that case is different. We don't want to move the entire elog
> path into the cold path for that. We'd only want to hint that errstart
> is unlikely to return true if elevel <= DEBUG1
I played around with this trying to find if there was a way to make this work.

v2 patch includes the change you mentioned about using __has_attribute
(cold) and removes the additional ereport_domain macro
v3 is v2 plus an additional change to mark the branch within
ereport_domain as unlikely when elevel <= DEBUG1
v4 is v2 plus it marks the errstart call as unlikely regardless of elevel.

I tried v4 as I was having trouble as v3 was showing worse performance
than v2.  v4 appears better on the AMD system, but that system is
producing noisy results (very obvious if looking at attached
amd3990x_elog_cold.png)

I ran pgbench -S T 600 -P 10 with each patch and for the AMD machine I got:

master = 27817.32167 tps
v2 =  28991.65667 tps (104.22% of master)
v3 =  28622.775 tps (102.90% of master)
v4 = 29648.91 tps (106.58% of master)

(I attribute the speedup here not being the same as my last report due
to noise. A recent bios update partially fixed the problem, but not
completely)

For the intel laptop I got:

master = 25452.38167 tps
v2 = 25473.695 tps (100.08% of master)
v3 = 25434.89333 tps (99.93% of master)
v4 = 25389.02833 tps (99.75% of master)

Looking at the assembly for the v3 patch, it does appear that the
elevel <= DEBUG1 calls were correctly moved to the cold area and that
the ones > DEBUG1 and < ERROR were left alone. However, I did only
look at xlog.s. The intel results don't look very promising, but
perhaps this is not the ideal test to show improvements with
instruction cache efficiency.

> > > +bool
> > > +pg_attribute_cold errstart_cold(int elevel, const char *domain)
> > > +{
> > > +     return errstart(elevel, domain);
> > > +}
> > > +#endif
> >
> > Hm. Would it make sense to have this be a static inline?

I didn't find a way to make this work (using gcc-10). Inlining, of
course, makes the assembly just call errstart().  errstart_cold() is
nowhere to be seen. The __attribute(cold) does not seem to be applied
to the errstart() call where the errstart_cold() call was inlined.

I've attached a graph showing the results for the AMD and Intel runs
and also csv files of the pgbench tps output.  I've also attached each
version of the patch I tested.

It would be good to see some testing done on other hardware.

David

amd3990x_elog_cold.png (278K) Download Attachment
intel-i78565u_elog_cold.png (107K) Download Attachment
amd3990x_elog_cold.csv (2K) Download Attachment
intel_i7-8565U_elog_cold.csv (2K) Download Attachment
elog_ereport_attribute_cold_v2.patch (3K) Download Attachment
elog_ereport_attribute_cold_v3.patch (3K) Download Attachment
elog_ereport_attribute_cold_v4.patch (3K) Download Attachment