In PG12, query with float calculations is slower than PG11

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|

In PG12, query with float calculations is slower than PG11

keisuke kuroda
Hi,

I am testing performance both PG12 and PG11.
I found the case of performance degradation in PG12.

Amit Langote help me to analyze and to create patch.
Thanks!

* environment

CentOS Linux release 7.6.1810 (Core)
postgresql 12.1
postgresql 11.6

* postgresql.conf

shared_buffers = 2048MB
max_parallel_workers_per_gather = 0
work_mem = '64MB'
jit = off

* test case

CREATE TABLE realtest(a real, b real, c real, d real, e real);
INSERT INTO realtest SELECT i,i,i,i,i FROM generate_series(0,10000000) AS i;

EXPLAIN (ANALYZE on, VERBOSE on, BUFFERS on)
 select (2 * a) , (2 * b) , (2 * c), (2 * d),  (2 * e)
 from realtest;

* result

 PG12.1 5878.389 ms
 PG11.6 4533.554 ms

** PostgreSQL 12.1

pgbench=#  EXPLAIN (ANALYZE on, VERBOSE on, BUFFERS on)
  select (2 * a) , (2 * b) , (2 * c), (2 * d),  (2 * e)
  from realtest;
                                                                        QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.realtest  (cost=0.00..288697.59 rows=10000115 width=40) (actual time=0.040..5195.328 rows=10000001 loops=1)
   Output: ('2'::double precision * a), ('2'::double precision * b), ('2'::double precision * c), ('2'::double precision * d), ('2'::double precision * e)
   Buffers: shared hit=63695
 Planning Time: 0.051 ms
 Execution Time: 5878.389 ms
(5 行)

Samples: 6K of event 'cpu-clock', Event count (approx.): 1577750000
Overhead  Command   Shared Object      Symbol
  25.48%  postgres  postgres           [.] ExecInterpExpr
★18.65%  postgres  libc-2.17.so       [.] __isinf
  14.36%  postgres  postgres           [.] float84mul
   8.54%  postgres  [vdso]             [.] __vdso_clock_gettime
   4.02%  postgres  postgres           [.] ExecScan
   3.69%  postgres  postgres           [.] tts_buffer_heap_getsomeattrs
   2.63%  postgres  libc-2.17.so       [.] __clock_gettime
   2.55%  postgres  postgres           [.] HeapTupleSatisfiesVisibility
   2.00%  postgres  postgres           [.] heapgettup_pagemode

** PostgreSQL 11.6

pgbench=#  EXPLAIN (ANALYZE on, VERBOSE on, BUFFERS on)
  select (2 * a) , (2 * b) , (2 * c), (2 * d),  (2 * e)
  from realtest;
                                                                        QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.realtest  (cost=0.00..288697.59 rows=10000115 width=40) (actual time=0.012..3845.480 rows=10000001 loops=1)
   Output: ('2'::double precision * a), ('2'::double precision * b), ('2'::double precision * c), ('2'::double precision * d), ('2'::double precision * e)
   Buffers: shared hit=63695
 Planning Time: 0.033 ms
 Execution Time: 4533.554 ms
(5 行)

Samples: 4K of event 'cpu-clock', Event count (approx.): 1192000000
Overhead  Command   Shared Object      Symbol
  32.30%  postgres  postgres           [.] ExecInterpExpr
  14.95%  postgres  postgres           [.] float84mul
  10.57%  postgres  [vdso]             [.] __vdso_clock_gettime
★ 6.84%  postgres  libc-2.17.so       [.] __isinf
   3.96%  postgres  postgres           [.] ExecScan
   3.50%  postgres  libc-2.17.so       [.] __clock_gettime
   3.31%  postgres  postgres           [.] heap_getnext
   3.08%  postgres  postgres           [.] HeapTupleSatisfiesMVCC
   2.77%  postgres  postgres           [.] slot_deform_tuple
   2.37%  postgres  postgres           [.] ExecProcNodeInstr
   2.08%  postgres  postgres           [.] standard_ExecutorRun

* cause

Obviously, even in common cases where no overflow occurs,
you can tell that PG 12 is performing isinf() 3 times on every call of
float8_mul() once for each of val1, val2, result where as PG 11
is performing only once for result.

That's because check_float8_val() (in PG 12) is a function
whose arguments must be evaluated before
it is called (it is inline, but that's irrelevant),
whereas CHECKFLOATVAL() (in PG11) is a macro
whose arguments are only substituted into its body.

By the way, this change of float8mul() implementation is
mostly due to the following commit in PG 12 development cycle:
commit 6bf0bc842bd75877e31727eb559c6a69e237f831

Especially, the following diff:

@@ -894,13 +746,8 @@ float8mul(PG_FUNCTION_ARGS)  {
    float8      arg1 = PG_GETARG_FLOAT8(0);
    float8      arg2 = PG_GETARG_FLOAT8(1);
-   float8      result;
-
-   result = arg1 * arg2;

-   CHECKFLOATVAL(result, isinf(arg1) || isinf(arg2),
-                 arg1 == 0 || arg2 == 0);
-   PG_RETURN_FLOAT8(result);
+   PG_RETURN_FLOAT8(float8_mul(arg1, arg2));
 }

* patch

This patch uses MACRO which was used by PG11.
I tried attached patch, which can be applied to PG 12 source and performed a benchmark:

 PG12.1 5878.389 ms
 PG11.6 4533.554 ms

 PG12.1 + Patch 4679.162 ms

** PostgreSQL 12.1 + Patch

postgres=# EXPLAIN (ANALYZE on, VERBOSE on, BUFFERS on)
 select (2 * a) , (2 * b) , (2 * c), (2 * d),  (2 * e)
 from realtest;
                                                                        QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.realtest  (cost=0.00..307328.38 rows=10828150 width=40) (actual time=0.012..4009.012 rows=10000001 loops=1)
   Output: ('2'::double precision * a), ('2'::double precision * b), ('2'::double precision * c), ('2'::double precision * d), ('2'::double precision * e)
   Buffers: shared hit=63695
 Planning Time: 0.038 ms
 Execution Time: 4679.162 ms
(5 rows)

Samples: 5K of event 'cpu-clock', Event count (approx.): 1376750000
Overhead  Command   Shared Object      Symbol
  31.43%  postgres  postgres           [.] ExecInterpExpr
  14.24%  postgres  postgres           [.] float84mul
  10.40%  postgres  [vdso]             [.] __vdso_clock_gettime
★ 5.41%  postgres  libc-2.17.so       [.] __isinf
   4.63%  postgres  postgres           [.] tts_buffer_heap_getsomeattrs
   4.03%  postgres  postgres           [.] ExecScan
   3.54%  postgres  libc-2.17.so       [.] __clock_gettime
   3.12%  postgres  postgres           [.] HeapTupleSatisfiesVisibility
   2.36%  postgres  postgres           [.] heap_getnextslot
   2.16%  postgres  postgres           [.] heapgettup_pagemode
   2.09%  postgres  postgres           [.] standard_ExecutorRun
   2.07%  postgres  postgres           [.] SeqNext
   2.03%  postgres  postgres           [.] ExecProcNodeInstr
   2.03%  postgres  postgres           [.] tts_virtual_clear

PG 12 is still slower compared to PG 11, but the __isinf() situation is better with the patch.

Best Regards,
Keisuke Kuroda

check-float-val.patch (14K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

Andres Freund
Hi,

On 2020-02-06 14:25:03 +0900, keisuke kuroda wrote:
> That's because check_float8_val() (in PG 12) is a function
> whose arguments must be evaluated before
> it is called (it is inline, but that's irrelevant),
> whereas CHECKFLOATVAL() (in PG11) is a macro
> whose arguments are only substituted into its body.

Hm - it's not that clear to me that it is irrelevant that the function
gets inlined. The compiler should know that isinf is side-effect free,
and that it doesn't have to evaluate before necessary.

Normally isinf is implemented by a compiler intrisic within the system
headers. But not in your profile:
> ★ 5.41%  postgres  libc-2.17.so       [.] __isinf

I checked, and I don't see any references to isinf from within float.c
(looking at the disassembly - there's some debug strings containing the
word, but that's it).

What compiler & compiler version on what kind of architecture is this?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

Amit Langote
Hi,

On Thu, Feb 6, 2020 at 2:55 PM Andres Freund <[hidden email]> wrote:

> On 2020-02-06 14:25:03 +0900, keisuke kuroda wrote:
> > That's because check_float8_val() (in PG 12) is a function
> > whose arguments must be evaluated before
> > it is called (it is inline, but that's irrelevant),
> > whereas CHECKFLOATVAL() (in PG11) is a macro
> > whose arguments are only substituted into its body.
>
> Hm - it's not that clear to me that it is irrelevant that the function
> gets inlined. The compiler should know that isinf is side-effect free,
> and that it doesn't have to evaluate before necessary.
>
> Normally isinf is implemented by a compiler intrisic within the system
> headers. But not in your profile:
> > ★ 5.41%  postgres  libc-2.17.so       [.] __isinf
>
> I checked, and I don't see any references to isinf from within float.c
> (looking at the disassembly - there's some debug strings containing the
> word, but that's it).
>
> What compiler & compiler version on what kind of architecture is this?

As Kuroda-san mentioned, I also checked the behavior that he reports.
The compiler I used is an ancient one (CentOS 7 default):

$ gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)

Compiler dependent behavior of inlining might be relevant here, but
there is one more thing to consider. The if () condition in
check_float8_val (PG 12) and CHECKFLOATVAL (PG 11) is calculated
differently, causing isinf() to be called more times in PG 12:

static inline void
check_float8_val(const float8 val, const bool inf_is_valid,
                 const bool zero_is_valid)
{
    if (!inf_is_valid && unlikely(isinf(val)))
        ereport(ERROR,
                (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
                 errmsg("value out of range: overflow")));

#define CHECKFLOATVAL(val, inf_is_valid, zero_is_valid)         \
do {                                                            \
    if (isinf(val) && !(inf_is_valid))                          \
        ereport(ERROR,                                          \
                (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),   \
          errmsg("value out of range: overflow")));             \

called thusly:

    check_float8_val(result, isinf(val1) || isinf(val2),
                     val1 == 0.0 || val2 == 0.0);

and

    CHECKFLOATVAL(result, isinf(arg1) || isinf(arg2),
                  arg1 == 0 || arg2 == 0);

from float8_mul() and float8mul() in PG 12 and PG 11, respectively.

You may notice that the if () condition is reversed, so while PG 12
calculates isinf(arg1) || isinf(arg2) first and isinf(result) only if
the first is false, which it is in most cases, PG 11 calculates
isinf(result) first, followed by isinf(arg1) || isinf(arg2) if the
former is true.  I don't understand why such reversal was necessary,
but it appears to be the main factor behind this slowdown.  So, even
if PG 12's check_float8_val() is perfectly inlined, this slowdown
couldn't be helped.

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

Tom Lane-2
So it appears to me that what commit 6bf0bc842 did in this area was
not just wrong, but disastrously so.  Before that, we had a macro that
evaluated isinf(val) before it evaluated the inf_is_valid condition.
Now we have check_float[48]_val which do it the other way around.
That would be okay if the inf_is_valid condition were cheap to
evaluate, but in common code paths it's actually twice as expensive
as isinf().

Andres seems to be of the opinion that the compiler should be willing
to ignore the semantic requirements of the C standard in order
to rearrange the code back into the cheaper order.  That sounds like
wishful thinking to me ... even if it actually works on his compiler,
it certainly isn't going to work for everyone.

The patch looks unduly invasive to me, but I think that it might be
right that we should go back to a macro-based implementation, because
otherwise we don't have a good way to be certain that the function
parameter won't get evaluated first.  (Another reason to do so is
so that the file/line numbers generated for the error reports go back
to being at least a little bit useful.)  We could use local variables
within the macro to avoid double evals, if anyone thinks that's
actually important --- I don't.

I think the current code is probably also misusing unlikely(),
and that the right way would be more like

        if (unlikely(isinf(val) && !(inf_is_valid)))

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

Robert Haas
On Thu, Feb 6, 2020 at 11:04 AM Tom Lane <[hidden email]> wrote:
> So it appears to me that what commit 6bf0bc842 did in this area was
> not just wrong, but disastrously so.  Before that, we had a macro that
> evaluated isinf(val) before it evaluated the inf_is_valid condition.
> Now we have check_float[48]_val which do it the other way around.
> That would be okay if the inf_is_valid condition were cheap to
> evaluate, but in common code paths it's actually twice as expensive
> as isinf().

Well, if the previous coding was a deliberate attempt to dodge this
performance issue, the evidence seems to be well-concealed. Neither
the comments for that macro nor the related commit messages make any
mention of it. When subtle things like this are performance-critical,
good comments are pretty critical, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

Andres Freund
In reply to this post by Tom Lane-2
Hi,

On 2020-02-06 11:03:51 -0500, Tom Lane wrote:
> Andres seems to be of the opinion that the compiler should be willing
> to ignore the semantic requirements of the C standard in order
> to rearrange the code back into the cheaper order.  That sounds like
> wishful thinking to me ... even if it actually works on his compiler,
> it certainly isn't going to work for everyone.

Sorry, but, uh, what are you talking about?  Please tell me which single
standards violation I'm advocating for?

I was asking about the inlining bit because the first email of the topic
explained that as the problem, which I don't believe can be the full
explanation - and it turns out it isn't. As Amit Langote's followup
email explained, there's the whole issue of the order of checks being
inverted - which is clearly bad. And wholly unrelated to inlining.

And I asked about __isinf() being used because there are issues with
accidentally ending up with the non-intrinsic version of isinf() when
not using gcc, due to badly written standard library headers.


> The patch looks unduly invasive to me, but I think that it might be
> right that we should go back to a macro-based implementation, because
> otherwise we don't have a good way to be certain that the function
> parameter won't get evaluated first.

I'd first like to see some actual evidence of this being a problem,
rather than just the order of the checks.


> (Another reason to do so is so that the file/line numbers generated
> for the error reports go back to being at least a little bit useful.)
> We could use local variables within the macro to avoid double evals,
> if anyone thinks that's actually important --- I don't.

I don't think that's necessarily a good idea. In fact, I think we should
probably do the exact opposite, and move the error messages further out
of line. All these otherwise very small functions having their own
ereports makes them much bigger. Our low code density, and the resulting
rate of itlb misses, is pretty significant cost (cf [1]).

master:
   text   data    bss    dec    hex filename
  36124     44     65  36233   8d89 float.o
error messages moved out of line:
   text   data    bss    dec    hex filename
  32883     44     65  32992   80e0 float.o

Taking int4pl as an example - solely because it is simpler assembly to
look at - we get:

master:
   0x00000000004ac190 <+0>: mov    0x30(%rdi),%rax
   0x00000000004ac194 <+4>: add    0x20(%rdi),%eax
   0x00000000004ac197 <+7>: jo     0x4ac19c <int4pl+12>
   0x00000000004ac199 <+9>: cltq
   0x00000000004ac19b <+11>: retq
   0x00000000004ac19c <+12>: push   %rbp
   0x00000000004ac19d <+13>: lea    0x1a02c4(%rip),%rsi        # 0x64c468
   0x00000000004ac1a4 <+20>: xor    %r8d,%r8d
   0x00000000004ac1a7 <+23>: lea    0x265da1(%rip),%rcx        # 0x711f4f <__func__.26823>
   0x00000000004ac1ae <+30>: mov    $0x30b,%edx
   0x00000000004ac1b3 <+35>: mov    $0x14,%edi
   0x00000000004ac1b8 <+40>: callq  0x586060 <errstart>
   0x00000000004ac1bd <+45>: lea    0x147e0e(%rip),%rdi        # 0x5f3fd2
   0x00000000004ac1c4 <+52>: xor    %eax,%eax
   0x00000000004ac1c6 <+54>: callq  0x5896a0 <errmsg>
   0x00000000004ac1cb <+59>: mov    $0x3000082,%edi
   0x00000000004ac1d0 <+64>: mov    %eax,%ebp
   0x00000000004ac1d2 <+66>: callq  0x589540 <errcode>
   0x00000000004ac1d7 <+71>: mov    %eax,%edi
   0x00000000004ac1d9 <+73>: mov    %ebp,%esi
   0x00000000004ac1db <+75>: xor    %eax,%eax
   0x00000000004ac1dd <+77>: callq  0x588fb0 <errfinish>

out-of-line error:
   0x00000000004b04e0 <+0>: mov    0x30(%rdi),%rax
   0x00000000004b04e4 <+4>: add    0x20(%rdi),%eax
   0x00000000004b04e7 <+7>: jo     0x4b04ec <int4pl+12>
   0x00000000004b04e9 <+9>: cltq
   0x00000000004b04eb <+11>: retq
   0x00000000004b04ec <+12>: push   %rax
   0x00000000004b04ed <+13>: callq  0x115e17 <out_of_range_err>

With the out-of-line error, we can fit multiple of these functions into one
cache line. With the inline error, not even one.

Greetings,

Andres Freund

[1] https://twitter.com/AndresFreundTec/status/1214305610172289024


Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

keisuke kuroda
Hi,

I have been testing with newer compiler (clang-7)
and result is a bit different at least with clang-7.
Compiling PG 12.1 (even without patch) with clang-7
results in __isinf() no longer being a bottleneck,
that is, you don't see it in profiler at all.

So, there is no issue for people who use the modern clang toolchain,
but maybe that's not everyone.
So there would still be some interest in doing something about this.

* clang

bash-4.2$ which clang
/opt/rh/llvm-toolset-7.0/root/usr/bin/clang

bash-4.2$ clang -v
clang version 7.0.1 (tags/RELEASE_701/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rh/llvm-toolset-7.0/root/usr/bin
Found candidate GCC installation: /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7
Found candidate GCC installation: /opt/rh/devtoolset-8/root/usr/lib/gcc/x86_64-redhat-linux/8
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/4.8.2
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/4.8.5
Selected GCC installation: /opt/rh/devtoolset-8/root/usr/lib/gcc/x86_64-redhat-linux/8
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64

** pg_config

---
CONFIGURE = '--prefix=/var/lib/pgsql/pgsql/12.1' 'CC=/opt/rh/llvm-toolset-7.0/root/usr/bin/clang' 'PKG_CONFIG_PATH=/opt/rh/llvm-toolset-7.0/root/usr/lib64/pkgconfig'
CC = /opt/rh/llvm-toolset-7.0/root/usr/bin/clang
---

* result(PostgreSQL 12.1 (even without patch))

postgres=# EXPLAIN (ANALYZE on, VERBOSE on, BUFFERS on)
 select (2 * a) , (2 * b) , (2 * c), (2 * d),  (2 * e)
 from realtest;
                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.realtest  (cost=0.00..288697.59 rows=10000115 width=40) (actual time=0.012..3878.284 rows=10000001 loops=1)
   Output: ('2'::double precision * a), ('2'::double precision * b), ('2'::double precision * c), ('2'::double precision * d), ('2'::double precision * e)
   Buffers: shared hit=63695
 Planning Time: 0.038 ms
 Execution Time: 4533.767 ms
(5 rows)

Samples: 5K of event 'cpu-clock', Event count (approx.): 1275000000
Overhead  Command   Shared Object      Symbol
  33.92%  postgres  postgres           [.] ExecInterpExpr
  13.27%  postgres  postgres           [.] float84mul
  10.86%  postgres  [vdso]             [.] __vdso_clock_gettime
   5.49%  postgres  postgres           [.] tts_buffer_heap_getsomeattrs
   3.96%  postgres  postgres           [.] ExecScan
   3.25%  postgres  libc-2.17.so       [.] __clock_gettime
   3.16%  postgres  postgres           [.] heap_getnextslot
   2.41%  postgres  postgres           [.] tts_virtual_clear
   2.39%  postgres  postgres           [.] SeqNext
   2.22%  postgres  postgres           [.] InstrStopNode

Best Regards,
Keisuke Kuroda

2020年2月7日(金) 3:48 Andres Freund <[hidden email]>:
Hi,

On 2020-02-06 11:03:51 -0500, Tom Lane wrote:
> Andres seems to be of the opinion that the compiler should be willing
> to ignore the semantic requirements of the C standard in order
> to rearrange the code back into the cheaper order.  That sounds like
> wishful thinking to me ... even if it actually works on his compiler,
> it certainly isn't going to work for everyone.

Sorry, but, uh, what are you talking about?  Please tell me which single
standards violation I'm advocating for?

I was asking about the inlining bit because the first email of the topic
explained that as the problem, which I don't believe can be the full
explanation - and it turns out it isn't. As Amit Langote's followup
email explained, there's the whole issue of the order of checks being
inverted - which is clearly bad. And wholly unrelated to inlining.

And I asked about __isinf() being used because there are issues with
accidentally ending up with the non-intrinsic version of isinf() when
not using gcc, due to badly written standard library headers.


> The patch looks unduly invasive to me, but I think that it might be
> right that we should go back to a macro-based implementation, because
> otherwise we don't have a good way to be certain that the function
> parameter won't get evaluated first.

I'd first like to see some actual evidence of this being a problem,
rather than just the order of the checks.


> (Another reason to do so is so that the file/line numbers generated
> for the error reports go back to being at least a little bit useful.)
> We could use local variables within the macro to avoid double evals,
> if anyone thinks that's actually important --- I don't.

I don't think that's necessarily a good idea. In fact, I think we should
probably do the exact opposite, and move the error messages further out
of line. All these otherwise very small functions having their own
ereports makes them much bigger. Our low code density, and the resulting
rate of itlb misses, is pretty significant cost (cf [1]).

master:
   text    data     bss     dec     hex filename
  36124      44      65   36233    8d89 float.o
error messages moved out of line:
   text    data     bss     dec     hex filename
  32883      44      65   32992    80e0 float.o

Taking int4pl as an example - solely because it is simpler assembly to
look at - we get:

master:
   0x00000000004ac190 <+0>:     mov    0x30(%rdi),%rax
   0x00000000004ac194 <+4>:     add    0x20(%rdi),%eax
   0x00000000004ac197 <+7>:     jo     0x4ac19c <int4pl+12>
   0x00000000004ac199 <+9>:     cltq
   0x00000000004ac19b <+11>:    retq
   0x00000000004ac19c <+12>:    push   %rbp
   0x00000000004ac19d <+13>:    lea    0x1a02c4(%rip),%rsi        # 0x64c468
   0x00000000004ac1a4 <+20>:    xor    %r8d,%r8d
   0x00000000004ac1a7 <+23>:    lea    0x265da1(%rip),%rcx        # 0x711f4f <__func__.26823>
   0x00000000004ac1ae <+30>:    mov    $0x30b,%edx
   0x00000000004ac1b3 <+35>:    mov    $0x14,%edi
   0x00000000004ac1b8 <+40>:    callq  0x586060 <errstart>
   0x00000000004ac1bd <+45>:    lea    0x147e0e(%rip),%rdi        # 0x5f3fd2
   0x00000000004ac1c4 <+52>:    xor    %eax,%eax
   0x00000000004ac1c6 <+54>:    callq  0x5896a0 <errmsg>
   0x00000000004ac1cb <+59>:    mov    $0x3000082,%edi
   0x00000000004ac1d0 <+64>:    mov    %eax,%ebp
   0x00000000004ac1d2 <+66>:    callq  0x589540 <errcode>
   0x00000000004ac1d7 <+71>:    mov    %eax,%edi
   0x00000000004ac1d9 <+73>:    mov    %ebp,%esi
   0x00000000004ac1db <+75>:    xor    %eax,%eax
   0x00000000004ac1dd <+77>:    callq  0x588fb0 <errfinish>

out-of-line error:
   0x00000000004b04e0 <+0>:     mov    0x30(%rdi),%rax
   0x00000000004b04e4 <+4>:     add    0x20(%rdi),%eax
   0x00000000004b04e7 <+7>:     jo     0x4b04ec <int4pl+12>
   0x00000000004b04e9 <+9>:     cltq
   0x00000000004b04eb <+11>:    retq
   0x00000000004b04ec <+12>:    push   %rax
   0x00000000004b04ed <+13>:    callq  0x115e17 <out_of_range_err>

With the out-of-line error, we can fit multiple of these functions into one
cache line. With the inline error, not even one.

Greetings,

Andres Freund

[1] https://twitter.com/AndresFreundTec/status/1214305610172289024
Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

Andres Freund
Hi,

On February 6, 2020 11:42:30 PM PST, keisuke kuroda <[hidden email]> wrote:
>Hi,
>
>I have been testing with newer compiler (clang-7)
>and result is a bit different at least with clang-7.
>Compiling PG 12.1 (even without patch) with clang-7
>results in __isinf() no longer being a bottleneck,
>that is, you don't see it in profiler at all.

I don't think that's necessarily the right conclusion. What's quite possibly happening is that you do not see the external isinf function anymore, because it is implemented as an intrinsic,  but that there still are more computations being done. Due to the changed order of the isinf checks. You'd have to compare with 11 using the same compiler.

Andres


>* result(PostgreSQL 12.1 (even without patch))
>
>postgres=# EXPLAIN (ANALYZE on, VERBOSE on, BUFFERS on)
> select (2 * a) , (2 * b) , (2 * c), (2 * d),  (2 * e)
> from realtest;
>
>QUERY PLAN
>-----------------------------------------------------------------------------------------------------------------------
>Seq Scan on public.realtest  (cost=0.00..288697.59 rows=10000115
>width=40)
>(actual time=0.012..3878.284 rows=10000001 loops=1)
>   Output: ('2'::double precision * a), ('2'::double precision * b),
>('2'::double precision * c), ('2'::double precision * d), ('2'::double
>precision * e)
>   Buffers: shared hit=63695
> Planning Time: 0.038 ms
> Execution Time: 4533.767 ms
>(5 rows)
>
>Samples: 5K of event 'cpu-clock', Event count (approx.): 1275000000
>Overhead  Command   Shared Object      Symbol
>  33.92%  postgres  postgres           [.] ExecInterpExpr
>  13.27%  postgres  postgres           [.] float84mul
>  10.86%  postgres  [vdso]             [.] __vdso_clock_gettime
>   5.49%  postgres  postgres           [.] tts_buffer_heap_getsomeattrs
>   3.96%  postgres  postgres           [.] ExecScan
>   3.25%  postgres  libc-2.17.so       [.] __clock_gettime
>   3.16%  postgres  postgres           [.] heap_getnextslot
>   2.41%  postgres  postgres           [.] tts_virtual_clear
>   2.39%  postgres  postgres           [.] SeqNext
>   2.22%  postgres  postgres           [.] InstrStopNode
>
>Best Regards,
>Keisuke Kuroda
>
>2020年2月7日(金) 3:48 Andres Freund <[hidden email]>:
>
>> Hi,
>>
>> On 2020-02-06 11:03:51 -0500, Tom Lane wrote:
>> > Andres seems to be of the opinion that the compiler should be
>willing
>> > to ignore the semantic requirements of the C standard in order
>> > to rearrange the code back into the cheaper order.  That sounds
>like
>> > wishful thinking to me ... even if it actually works on his
>compiler,
>> > it certainly isn't going to work for everyone.
>>
>> Sorry, but, uh, what are you talking about?  Please tell me which
>single
>> standards violation I'm advocating for?
>>
>> I was asking about the inlining bit because the first email of the
>topic
>> explained that as the problem, which I don't believe can be the full
>> explanation - and it turns out it isn't. As Amit Langote's followup
>> email explained, there's the whole issue of the order of checks being
>> inverted - which is clearly bad. And wholly unrelated to inlining.
>>
>> And I asked about __isinf() being used because there are issues with
>> accidentally ending up with the non-intrinsic version of isinf() when
>> not using gcc, due to badly written standard library headers.
>>
>>
>> > The patch looks unduly invasive to me, but I think that it might be
>> > right that we should go back to a macro-based implementation,
>because
>> > otherwise we don't have a good way to be certain that the function
>> > parameter won't get evaluated first.
>>
>> I'd first like to see some actual evidence of this being a problem,
>> rather than just the order of the checks.
>>
>>
>> > (Another reason to do so is so that the file/line numbers generated
>> > for the error reports go back to being at least a little bit
>useful.)
>> > We could use local variables within the macro to avoid double
>evals,
>> > if anyone thinks that's actually important --- I don't.
>>
>> I don't think that's necessarily a good idea. In fact, I think we
>should
>> probably do the exact opposite, and move the error messages further
>out
>> of line. All these otherwise very small functions having their own
>> ereports makes them much bigger. Our low code density, and the
>resulting
>> rate of itlb misses, is pretty significant cost (cf [1]).
>>
>> master:
>>    text    data     bss     dec     hex filename
>>   36124      44      65   36233    8d89 float.o
>> error messages moved out of line:
>>    text    data     bss     dec     hex filename
>>   32883      44      65   32992    80e0 float.o
>>
>> Taking int4pl as an example - solely because it is simpler assembly
>to
>> look at - we get:
>>
>> master:
>>    0x00000000004ac190 <+0>:     mov    0x30(%rdi),%rax
>>    0x00000000004ac194 <+4>:     add    0x20(%rdi),%eax
>>    0x00000000004ac197 <+7>:     jo     0x4ac19c <int4pl+12>
>>    0x00000000004ac199 <+9>:     cltq
>>    0x00000000004ac19b <+11>:    retq
>>    0x00000000004ac19c <+12>:    push   %rbp
>>    0x00000000004ac19d <+13>:    lea    0x1a02c4(%rip),%rsi        #
>> 0x64c468
>>    0x00000000004ac1a4 <+20>:    xor    %r8d,%r8d
>>    0x00000000004ac1a7 <+23>:    lea    0x265da1(%rip),%rcx        #
>> 0x711f4f <__func__.26823>
>>    0x00000000004ac1ae <+30>:    mov    $0x30b,%edx
>>    0x00000000004ac1b3 <+35>:    mov    $0x14,%edi
>>    0x00000000004ac1b8 <+40>:    callq  0x586060 <errstart>
>>    0x00000000004ac1bd <+45>:    lea    0x147e0e(%rip),%rdi        #
>> 0x5f3fd2
>>    0x00000000004ac1c4 <+52>:    xor    %eax,%eax
>>    0x00000000004ac1c6 <+54>:    callq  0x5896a0 <errmsg>
>>    0x00000000004ac1cb <+59>:    mov    $0x3000082,%edi
>>    0x00000000004ac1d0 <+64>:    mov    %eax,%ebp
>>    0x00000000004ac1d2 <+66>:    callq  0x589540 <errcode>
>>    0x00000000004ac1d7 <+71>:    mov    %eax,%edi
>>    0x00000000004ac1d9 <+73>:    mov    %ebp,%esi
>>    0x00000000004ac1db <+75>:    xor    %eax,%eax
>>    0x00000000004ac1dd <+77>:    callq  0x588fb0 <errfinish>
>>
>> out-of-line error:
>>    0x00000000004b04e0 <+0>:     mov    0x30(%rdi),%rax
>>    0x00000000004b04e4 <+4>:     add    0x20(%rdi),%eax
>>    0x00000000004b04e7 <+7>:     jo     0x4b04ec <int4pl+12>
>>    0x00000000004b04e9 <+9>:     cltq
>>    0x00000000004b04eb <+11>:    retq
>>    0x00000000004b04ec <+12>:    push   %rax
>>    0x00000000004b04ed <+13>:    callq  0x115e17 <out_of_range_err>
>>
>> With the out-of-line error, we can fit multiple of these functions
>into one
>> cache line. With the inline error, not even one.
>>
>> Greetings,
>>
>> Andres Freund
>>
>> [1] https://twitter.com/AndresFreundTec/status/1214305610172289024
>>

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

Amit Langote
On Fri, Feb 7, 2020 at 4:54 PM Andres Freund <[hidden email]> wrote:

> On February 6, 2020 11:42:30 PM PST, keisuke kuroda <[hidden email]> wrote:
> >Hi,
> >
> >I have been testing with newer compiler (clang-7)
> >and result is a bit different at least with clang-7.
> >Compiling PG 12.1 (even without patch) with clang-7
> >results in __isinf() no longer being a bottleneck,
> >that is, you don't see it in profiler at all.
>
> I don't think that's necessarily the right conclusion. What's quite possibly happening is that you do not see the external isinf function anymore, because it is implemented as an intrinsic,  but that there still are more computations being done. Due to the changed order of the isinf checks. You'd have to compare with 11 using the same compiler.

I did some tests using two relatively recent compilers: gcc 8 and
clang-7 and here are the results:

Setup:

create table realtest (a real, b real, c real, d real, e real);
insert into realtest select i, i, i, i, i from generate_series(1, 1000000) i;

Test query:

/tmp/query.sql
select avg(2*dsqrt(a)), avg(2*dsqrt(b)), avg(2*dsqrt(c)),
avg(2*dsqrt(d)), avg(2*dsqrt(e)) from realtest;

pgbench -n -T 60 -f /tmp/query.sql

Latency and profiling results:

gcc 8 (gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3))
====

11.6

latency average = 463.968 ms

    40.62%  postgres  postgres           [.] ExecInterpExpr
     9.74%  postgres  postgres           [.] float8_accum
     6.12%  postgres  libc-2.17.so       [.] __isinf
     5.96%  postgres  postgres           [.] float8mul
     5.33%  postgres  postgres           [.] dsqrt
     3.90%  postgres  postgres           [.] ftod
     3.53%  postgres  postgres           [.] Float8GetDatum
     2.34%  postgres  postgres           [.] DatumGetFloat8
     2.15%  postgres  postgres           [.] AggCheckCallContext
     2.03%  postgres  postgres           [.] slot_deform_tuple
     1.95%  postgres  libm-2.17.so       [.] __sqrt
     1.19%  postgres  postgres           [.] check_float8_array

HEAD

latency average = 549.071 ms

    31.74%  postgres  postgres           [.] ExecInterpExpr
    11.02%  postgres  libc-2.17.so       [.] __isinf
    10.58%  postgres  postgres           [.] float8_accum
     4.84%  postgres  postgres           [.] check_float8_val
     4.66%  postgres  postgres           [.] dsqrt
     3.91%  postgres  postgres           [.] float8mul
     3.56%  postgres  postgres           [.] ftod
     3.26%  postgres  postgres           [.] Float8GetDatum
     2.91%  postgres  postgres           [.] float8_mul
     2.30%  postgres  postgres           [.] DatumGetFloat8
     2.19%  postgres  postgres           [.] slot_deform_heap_tuple
     1.81%  postgres  postgres           [.] AggCheckCallContext
     1.31%  postgres  libm-2.17.so       [.] __sqrt
     1.25%  postgres  postgres           [.] check_float8_array

HEAD + patch

latency average = 546.624 ms

    33.51%  postgres  postgres           [.] ExecInterpExpr
    10.35%  postgres  postgres           [.] float8_accum
    10.06%  postgres  libc-2.17.so       [.] __isinf
     4.58%  postgres  postgres           [.] dsqrt
     4.14%  postgres  postgres           [.] check_float8_val
     4.03%  postgres  postgres           [.] ftod
     3.54%  postgres  postgres           [.] float8mul
     2.96%  postgres  postgres           [.] Float8GetDatum
     2.38%  postgres  postgres           [.] slot_deform_heap_tuple
     2.23%  postgres  postgres           [.] DatumGetFloat8
     2.09%  postgres  postgres           [.] float8_mul
     1.88%  postgres  postgres           [.] AggCheckCallContext
     1.65%  postgres  libm-2.17.so       [.] __sqrt
     1.22%  postgres  postgres           [.] check_float8_array


clang-7 (clang version 7.0.1 (tags/RELEASE_701/final))
=====

11.6

latency average = 419.014 ms

    47.57%  postgres  postgres           [.] ExecInterpExpr
     7.99%  postgres  postgres           [.] float8_accum
     5.96%  postgres  postgres           [.] dsqrt
     4.88%  postgres  postgres           [.] float8mul
     4.23%  postgres  postgres           [.] ftod
     3.30%  postgres  postgres           [.] slot_deform_tuple
     3.19%  postgres  postgres           [.] DatumGetFloat8
     1.92%  postgres  libm-2.17.so       [.] __sqrt
     1.72%  postgres  postgres           [.] check_float8_array

HEAD

latency average = 452.958 ms

    40.55%  postgres  postgres           [.] ExecInterpExpr
    10.61%  postgres  postgres           [.] float8_accum
     4.58%  postgres  postgres           [.] dsqrt
     3.59%  postgres  postgres           [.] slot_deform_heap_tuple
     3.54%  postgres  postgres           [.] check_float8_val
     3.48%  postgres  postgres           [.] ftod
     3.42%  postgres  postgres           [.] float8mul
     3.22%  postgres  postgres           [.] DatumGetFloat8
     2.69%  postgres  postgres           [.] Float8GetDatum
     2.46%  postgres  postgres           [.] float8_mul
     2.29%  postgres  libm-2.17.so       [.] __sqrt
     1.47%  postgres  postgres           [.] check_float8_array

HEAD + patch

latency average = 452.533 ms

    41.05%  postgres  postgres           [.] ExecInterpExpr
    10.15%  postgres  postgres           [.] float8_accum
     5.62%  postgres  postgres           [.] dsqrt
     3.86%  postgres  postgres           [.] check_float8_val
     3.27%  postgres  postgres           [.] float8mul
     3.09%  postgres  postgres           [.] slot_deform_heap_tuple
     2.91%  postgres  postgres           [.] ftod
     2.88%  postgres  postgres           [.] DatumGetFloat8
     2.62%  postgres  postgres           [.] float8_mul
     2.03%  postgres  libm-2.17.so       [.] __sqrt
     2.00%  postgres  postgres           [.] check_float8_array

The patch mentioned above is this:

diff --git a/src/include/utils/float.h b/src/include/utils/float.h
index e2c5dc0f57..dc97d19293 100644
--- a/src/include/utils/float.h
+++ b/src/include/utils/float.h
@@ -136,12 +136,12 @@ static inline void
 check_float4_val(const float4 val, const bool inf_is_valid,
                  const bool zero_is_valid)
 {
-    if (!inf_is_valid && unlikely(isinf(val)))
+    if (unlikely(isinf(val)) && !inf_is_valid)
         ereport(ERROR,
                 (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
                  errmsg("value out of range: overflow")));

-    if (!zero_is_valid && unlikely(val == 0.0))
+    if (unlikely(val == 0.0) && !zero_is_valid)
         ereport(ERROR,
                 (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
                  errmsg("value out of range: underflow")));
@@ -151,12 +151,12 @@ static inline void
 check_float8_val(const float8 val, const bool inf_is_valid,
                  const bool zero_is_valid)
 {
-    if (!inf_is_valid && unlikely(isinf(val)))
+    if (unlikely(isinf(val)) && !inf_is_valid)
         ereport(ERROR,
                 (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
                  errmsg("value out of range: overflow")));

-    if (!zero_is_valid && unlikely(val == 0.0))
+    if (unlikely(val == 0.0) && !zero_is_valid)
         ereport(ERROR,
                 (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
                  errmsg("value out of range: underflow")));

So, the patch appears to do very little here. I can only conclude that
the check_float{8|4}_val() (PG 12) is slower than CHECKFLOATVAL() (PG
11) due to arguments being evaluated first.  It's entirely possible
though that the patch shown above is not enough.

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

Amit Langote
Fwiw, also tried the patch that Kuroda-san had posted yesterday.

On Fri, Feb 7, 2020 at 5:17 PM Amit Langote <[hidden email]> wrote:

> Latency and profiling results:
>
> gcc 8 (gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3))
> ====
>
> 11.6
>
> latency average = 463.968 ms
>
>     40.62%  postgres  postgres           [.] ExecInterpExpr
>      9.74%  postgres  postgres           [.] float8_accum
>      6.12%  postgres  libc-2.17.so       [.] __isinf
>      5.96%  postgres  postgres           [.] float8mul
>      5.33%  postgres  postgres           [.] dsqrt
>      3.90%  postgres  postgres           [.] ftod
>      3.53%  postgres  postgres           [.] Float8GetDatum
>      2.34%  postgres  postgres           [.] DatumGetFloat8
>      2.15%  postgres  postgres           [.] AggCheckCallContext
>      2.03%  postgres  postgres           [.] slot_deform_tuple
>      1.95%  postgres  libm-2.17.so       [.] __sqrt
>      1.19%  postgres  postgres           [.] check_float8_array
>
> HEAD
>
> latency average = 549.071 ms
>
>     31.74%  postgres  postgres           [.] ExecInterpExpr
>     11.02%  postgres  libc-2.17.so       [.] __isinf
>     10.58%  postgres  postgres           [.] float8_accum
>      4.84%  postgres  postgres           [.] check_float8_val
>      4.66%  postgres  postgres           [.] dsqrt
>      3.91%  postgres  postgres           [.] float8mul
>      3.56%  postgres  postgres           [.] ftod
>      3.26%  postgres  postgres           [.] Float8GetDatum
>      2.91%  postgres  postgres           [.] float8_mul
>      2.30%  postgres  postgres           [.] DatumGetFloat8
>      2.19%  postgres  postgres           [.] slot_deform_heap_tuple
>      1.81%  postgres  postgres           [.] AggCheckCallContext
>      1.31%  postgres  libm-2.17.so       [.] __sqrt
>      1.25%  postgres  postgres           [.] check_float8_array
>
> HEAD + patch
>
> latency average = 546.624 ms
>
>     33.51%  postgres  postgres           [.] ExecInterpExpr
>     10.35%  postgres  postgres           [.] float8_accum
>     10.06%  postgres  libc-2.17.so       [.] __isinf
>      4.58%  postgres  postgres           [.] dsqrt
>      4.14%  postgres  postgres           [.] check_float8_val
>      4.03%  postgres  postgres           [.] ftod
>      3.54%  postgres  postgres           [.] float8mul
>      2.96%  postgres  postgres           [.] Float8GetDatum
>      2.38%  postgres  postgres           [.] slot_deform_heap_tuple
>      2.23%  postgres  postgres           [.] DatumGetFloat8
>      2.09%  postgres  postgres           [.] float8_mul
>      1.88%  postgres  postgres           [.] AggCheckCallContext
>      1.65%  postgres  libm-2.17.so       [.] __sqrt
>      1.22%  postgres  postgres           [.] check_float8_array

HEAD + Kuroda-san's patch (compiled with gcc 8)

latency average = 484.604 ms

    37.41%  postgres  postgres            [.] ExecInterpExpr
    10.83%  postgres  postgres            [.] float8_accum
     5.62%  postgres  postgres            [.] dsqrt
     4.23%  postgres  libc-2.17.so        [.] __isinf
     4.05%  postgres  postgres            [.] float8mul
     3.85%  postgres  postgres            [.] ftod
     3.18%  postgres  postgres            [.] Float8GetDatum
     2.81%  postgres  postgres            [.] slot_deform_heap_tuple
     2.63%  postgres  postgres            [.] DatumGetFloat8
     2.46%  postgres  postgres            [.] float8_mul
     1.91%  postgres  libm-2.17.so        [.] __sqrt

> clang-7 (clang version 7.0.1 (tags/RELEASE_701/final))
> =====
>
> 11.6
>
> latency average = 419.014 ms
>
>     47.57%  postgres  postgres           [.] ExecInterpExpr
>      7.99%  postgres  postgres           [.] float8_accum
>      5.96%  postgres  postgres           [.] dsqrt
>      4.88%  postgres  postgres           [.] float8mul
>      4.23%  postgres  postgres           [.] ftod
>      3.30%  postgres  postgres           [.] slot_deform_tuple
>      3.19%  postgres  postgres           [.] DatumGetFloat8
>      1.92%  postgres  libm-2.17.so       [.] __sqrt
>      1.72%  postgres  postgres           [.] check_float8_array
>
> HEAD
>
> latency average = 452.958 ms
>
>     40.55%  postgres  postgres           [.] ExecInterpExpr
>     10.61%  postgres  postgres           [.] float8_accum
>      4.58%  postgres  postgres           [.] dsqrt
>      3.59%  postgres  postgres           [.] slot_deform_heap_tuple
>      3.54%  postgres  postgres           [.] check_float8_val
>      3.48%  postgres  postgres           [.] ftod
>      3.42%  postgres  postgres           [.] float8mul
>      3.22%  postgres  postgres           [.] DatumGetFloat8
>      2.69%  postgres  postgres           [.] Float8GetDatum
>      2.46%  postgres  postgres           [.] float8_mul
>      2.29%  postgres  libm-2.17.so       [.] __sqrt
>      1.47%  postgres  postgres           [.] check_float8_array
>
> HEAD + patch
>
> latency average = 452.533 ms
>
>     41.05%  postgres  postgres           [.] ExecInterpExpr
>     10.15%  postgres  postgres           [.] float8_accum
>      5.62%  postgres  postgres           [.] dsqrt
>      3.86%  postgres  postgres           [.] check_float8_val
>      3.27%  postgres  postgres           [.] float8mul
>      3.09%  postgres  postgres           [.] slot_deform_heap_tuple
>      2.91%  postgres  postgres           [.] ftod
>      2.88%  postgres  postgres           [.] DatumGetFloat8
>      2.62%  postgres  postgres           [.] float8_mul
>      2.03%  postgres  libm-2.17.so       [.] __sqrt
>      2.00%  postgres  postgres           [.] check_float8_array

HEAD + Kuroda-san's patch (compiled with clang-7)

latency average = 435.454 ms

    43.02%  postgres  postgres           [.] ExecInterpExpr
    10.86%  postgres  postgres           [.] float8_accum
     3.97%  postgres  postgres           [.] dsqrt
     3.97%  postgres  postgres           [.] float8mul
     3.51%  postgres  postgres           [.] ftod
     3.42%  postgres  postgres           [.] slot_deform_heap_tuple
     3.36%  postgres  postgres           [.] DatumGetFloat8
     1.97%  postgres  libm-2.17.so       [.] __sqrt
     1.97%  postgres  postgres           [.] check_float8_array
     1.88%  postgres  postgres           [.] float8_mul

Needless to say, that one makes a visible difference, although still
slower compared to PG 11.

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

hasegeli
> Fwiw, also tried the patch that Kuroda-san had posted yesterday.

I run the same test case too:

clang version 7.0.0:

HEAD 2548.119 ms
with patch 2320.974 ms

clang version 8.0.0:

HEAD 2431.766 ms
with patch 2419.439 ms

clang version 9.0.0:

HEAD 2477.493 ms
with patch 2365.509 ms

gcc version 7.4.0:

HEAD 2451.261 ms
with patch 2343.393 ms

gcc version 8.3.0:

HEAD 2540.626 ms
with patch 2299.653 ms


Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

hasegeli
In reply to this post by Andres Freund
> > The patch looks unduly invasive to me, but I think that it might be
> > right that we should go back to a macro-based implementation, because
> > otherwise we don't have a good way to be certain that the function
> > parameter won't get evaluated first.
>
> I'd first like to see some actual evidence of this being a problem,
> rather than just the order of the checks.

There seem to be enough evidence of this being the problem.  We are
better off going back to the macro-based implementation.  I polished
Keisuke Kuroda's patch commenting about the performance issue, removed
the check_float*_val() functions completely, and added unlikely() as
Tom Lane suggested.  It is attached.  I confirmed with different
compilers that the macro, and unlikely() makes this noticeably faster.

0001-Bring-back-CHECKFLOATVAL-macro-v01.patch (33K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

Tels
Moin,

On 2020-02-07 15:42, Emre Hasegeli wrote:

>> > The patch looks unduly invasive to me, but I think that it might be
>> > right that we should go back to a macro-based implementation, because
>> > otherwise we don't have a good way to be certain that the function
>> > parameter won't get evaluated first.
>>
>> I'd first like to see some actual evidence of this being a problem,
>> rather than just the order of the checks.
>
> There seem to be enough evidence of this being the problem.  We are
> better off going back to the macro-based implementation.  I polished
> Keisuke Kuroda's patch commenting about the performance issue, removed
> the check_float*_val() functions completely, and added unlikely() as
> Tom Lane suggested.  It is attached.  I confirmed with different
> compilers that the macro, and unlikely() makes this noticeably faster.
Hm, the diff has the macro tests as:

  + if (unlikely(isinf(val) && !(inf_is_valid)))
  ...
  +      if (unlikely((val) == 0.0 && !(zero_is_valid)))

But the comment does not explain that this test has to be in that
order, or the compiler will for non-constant arguments evalute
the (now) right-side first. E.g. if I understand this correctly:

  +      if (!(zero_is_valid) && unlikely((val) == 0.0)

would have the same problem of evaluating "zero_is_valid" (which
might be an isinf(arg1) || isinf(arg2)) first and so be the same thing
we try to avoid with the macro? Maybe adding this bit of info to the
comment makes it clearer?

Also, a few places use the macro as:

  + CHECKFLOATVAL(result, true, true);

which evaluates to a complete NOP in both cases. IMHO this could be
replaced with a comment like:

  + // No CHECKFLOATVAL() needed, as both inf and 0.0 are valid

(or something along the lines of "no error can occur"), as otherwise
CHECKFLOATVAL() implies to the casual reader that there are some checks
done, while in reality no real checks are done at all (and hopefully
the compiler optimizes everything away, which might not be true for
debug builds).

--
Best regards,

Tels

0001-Bring-back-CHECKFLOATVAL-macro-v01.patch (33K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

Andres Freund
In reply to this post by Amit Langote
Hi,

On 2020-02-07 17:17:21 +0900, Amit Langote wrote:
> I did some tests using two relatively recent compilers: gcc 8 and
> clang-7 and here are the results:

Hm, these very much look like they've been done in an unoptimized build?

>     40.62%  postgres  postgres           [.] ExecInterpExpr
>      9.74%  postgres  postgres           [.] float8_accum
>      6.12%  postgres  libc-2.17.so       [.] __isinf
>      5.96%  postgres  postgres           [.] float8mul
>      5.33%  postgres  postgres           [.] dsqrt
>      3.90%  postgres  postgres           [.] ftod
>      3.53%  postgres  postgres           [.] Float8GetDatum
>      2.34%  postgres  postgres           [.] DatumGetFloat8
>      2.15%  postgres  postgres           [.] AggCheckCallContext
>      2.03%  postgres  postgres           [.] slot_deform_tuple
>      1.95%  postgres  libm-2.17.so       [.] __sqrt
>      1.19%  postgres  postgres           [.] check_float8_array

> HEAD
>
> latency average = 549.071 ms
>
>     31.74%  postgres  postgres           [.] ExecInterpExpr
>     11.02%  postgres  libc-2.17.so       [.] __isinf
>     10.58%  postgres  postgres           [.] float8_accum
>      4.84%  postgres  postgres           [.] check_float8_val
>      4.66%  postgres  postgres           [.] dsqrt
>      3.91%  postgres  postgres           [.] float8mul
>      3.56%  postgres  postgres           [.] ftod
>      3.26%  postgres  postgres           [.] Float8GetDatum
>      2.91%  postgres  postgres           [.] float8_mul
>      2.30%  postgres  postgres           [.] DatumGetFloat8
>      2.19%  postgres  postgres           [.] slot_deform_heap_tuple
>      1.81%  postgres  postgres           [.] AggCheckCallContext
>      1.31%  postgres  libm-2.17.so       [.] __sqrt
>      1.25%  postgres  postgres           [.] check_float8_array

Because DatumGetFloat8, Float8GetDatum, etc aren't functions that
normally stay separate.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

Amit Langote
On Sat, Feb 8, 2020 at 3:13 AM Andres Freund <[hidden email]> wrote:

> On 2020-02-07 17:17:21 +0900, Amit Langote wrote:
> > I did some tests using two relatively recent compilers: gcc 8 and
> > clang-7 and here are the results:
>
> Hm, these very much look like they've been done in an unoptimized build?
>
> >     40.62%  postgres  postgres           [.] ExecInterpExpr
> >      9.74%  postgres  postgres           [.] float8_accum
> >      6.12%  postgres  libc-2.17.so       [.] __isinf
> >      5.96%  postgres  postgres           [.] float8mul
> >      5.33%  postgres  postgres           [.] dsqrt
> >      3.90%  postgres  postgres           [.] ftod
> >      3.53%  postgres  postgres           [.] Float8GetDatum
> >      2.34%  postgres  postgres           [.] DatumGetFloat8
> >      2.15%  postgres  postgres           [.] AggCheckCallContext
> >      2.03%  postgres  postgres           [.] slot_deform_tuple
> >      1.95%  postgres  libm-2.17.so       [.] __sqrt
> >      1.19%  postgres  postgres           [.] check_float8_array
>
> > HEAD
> >
> > latency average = 549.071 ms
> >
> >     31.74%  postgres  postgres           [.] ExecInterpExpr
> >     11.02%  postgres  libc-2.17.so       [.] __isinf
> >     10.58%  postgres  postgres           [.] float8_accum
> >      4.84%  postgres  postgres           [.] check_float8_val
> >      4.66%  postgres  postgres           [.] dsqrt
> >      3.91%  postgres  postgres           [.] float8mul
> >      3.56%  postgres  postgres           [.] ftod
> >      3.26%  postgres  postgres           [.] Float8GetDatum
> >      2.91%  postgres  postgres           [.] float8_mul
> >      2.30%  postgres  postgres           [.] DatumGetFloat8
> >      2.19%  postgres  postgres           [.] slot_deform_heap_tuple
> >      1.81%  postgres  postgres           [.] AggCheckCallContext
> >      1.31%  postgres  libm-2.17.so       [.] __sqrt
> >      1.25%  postgres  postgres           [.] check_float8_array
>
> Because DatumGetFloat8, Float8GetDatum, etc aren't functions that
> normally stay separate.

Okay, fair.

Here are numbers after compiling with -O3:

gcc 8
=====

HEAD

latency average = 350.187 ms

    34.67%  postgres  postgres           [.] ExecInterpExpr
    20.94%  postgres  libc-2.17.so       [.] __isinf
    10.74%  postgres  postgres           [.] float8_accum
     8.22%  postgres  postgres           [.] dsqrt
     6.63%  postgres  postgres           [.] float8mul
     3.45%  postgres  postgres           [.] ftod
     2.32%  postgres  postgres           [.] tts_buffer_heap_getsomeattrs

HEAD + reverse-if-condition patch

latency average = 346.710 ms

    34.48%  postgres  postgres           [.] ExecInterpExpr
    21.00%  postgres  libc-2.17.so       [.] __isinf
    12.26%  postgres  postgres           [.] float8_accum
     8.31%  postgres  postgres           [.] dsqrt
     6.32%  postgres  postgres           [.] float8mul
     3.23%  postgres  postgres           [.] ftod
     2.25%  postgres  postgres           [.] tts_buffer_heap_getsomeattrs

HEAD + revert-to-macro patch

latency average = 297.493 ms

    39.25%  postgres  postgres           [.] ExecInterpExpr
    14.44%  postgres  postgres           [.] float8_accum
    11.02%  postgres  libc-2.17.so       [.] __isinf
     8.21%  postgres  postgres           [.] dsqrt
     5.55%  postgres  postgres           [.] float8mul
     4.15%  postgres  postgres           [.] ftod
     2.78%  postgres  postgres           [.] tts_buffer_heap_getsomeattrs

11.6

latency average = 290.301 ms

    42.78%  postgres  postgres            [.] ExecInterpExpr
    12.27%  postgres  postgres            [.] float8_accum
    12.12%  postgres  libc-2.17.so        [.] __isinf
     8.96%  postgres  postgres            [.] dsqrt
     5.77%  postgres  postgres            [.] float8mul
     3.94%  postgres  postgres            [.] ftod
     2.61%  postgres  postgres            [.] AggCheckCallContext


clang-7
=======

HEAD

latency average = 246.278 ms

    44.47%  postgres  postgres           [.] ExecInterpExpr
    14.56%  postgres  postgres           [.] float8_accum
     7.25%  postgres  postgres           [.] float8mul
     7.22%  postgres  postgres           [.] dsqrt
     5.40%  postgres  postgres           [.] ftod
     4.09%  postgres  postgres           [.] tts_buffer_heap_getsomeattrs
     2.20%  postgres  postgres           [.] check_float8_val

HEAD + reverse-if-condition patch

latency average = 240.212 ms

    45.49%  postgres  postgres           [.] ExecInterpExpr
    13.69%  postgres  postgres           [.] float8_accum
     8.32%  postgres  postgres           [.] dsqrt
     5.28%  postgres  postgres           [.] ftod
     5.19%  postgres  postgres           [.] float8mul
     3.68%  postgres  postgres           [.] tts_buffer_heap_getsomeattrs
     2.90%  postgres  postgres           [.] float8_mul

HEAD + revert-to-macro patch

latency average = 240.620 ms

    44.04%  postgres  postgres           [.] ExecInterpExpr
    13.72%  postgres  postgres           [.] float8_accum
     9.26%  postgres  postgres           [.] dsqrt
     5.30%  postgres  postgres           [.] ftod
     4.66%  postgres  postgres           [.] float8mul
     3.53%  postgres  postgres           [.] tts_buffer_heap_getsomeattrs
     3.39%  postgres  postgres           [.] float8_mul

11.6

latency average = 237.045 ms

    46.85%  postgres  postgres            [.] ExecInterpExpr
    11.39%  postgres  postgres            [.] float8_accum
     8.02%  postgres  postgres            [.] dsqrt
     7.29%  postgres  postgres            [.] slot_deform_tuple
     6.04%  postgres  postgres            [.] float8mul
     5.49%  postgres  postgres            [.] ftod

PG 12 is worse than PG 11 when compiled with gcc.

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

Amit Langote
In reply to this post by hasegeli
On Fri, Feb 7, 2020 at 11:43 PM Emre Hasegeli <[hidden email]> wrote:

> > > The patch looks unduly invasive to me, but I think that it might be
> > > right that we should go back to a macro-based implementation, because
> > > otherwise we don't have a good way to be certain that the function
> > > parameter won't get evaluated first.
> >
> > I'd first like to see some actual evidence of this being a problem,
> > rather than just the order of the checks.
>
> There seem to be enough evidence of this being the problem.  We are
> better off going back to the macro-based implementation.  I polished
> Keisuke Kuroda's patch commenting about the performance issue, removed
> the check_float*_val() functions completely, and added unlikely() as
> Tom Lane suggested.  It is attached.  I confirmed with different
> compilers that the macro, and unlikely() makes this noticeably faster.

Thanks for updating the patch.

Should we update the same macro in contrib/btree_gist/btree_utils_num.h too?

Regards,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

hasegeli
In reply to this post by Tels
> But the comment does not explain that this test has to be in that
> order, or the compiler will for non-constant arguments evalute
> the (now) right-side first. E.g. if I understand this correctly:
>
>   +      if (!(zero_is_valid) && unlikely((val) == 0.0)
>
> would have the same problem of evaluating "zero_is_valid" (which
> might be an isinf(arg1) || isinf(arg2)) first and so be the same thing
> we try to avoid with the macro? Maybe adding this bit of info to the
> comment makes it clearer?

Added.

> Also, a few places use the macro as:
>
>   +     CHECKFLOATVAL(result, true, true);
>
> which evaluates to a complete NOP in both cases. IMHO this could be
> replaced with a comment like:
>
>   +     // No CHECKFLOATVAL() needed, as both inf and 0.0 are valid
>
> (or something along the lines of "no error can occur"), as otherwise
> CHECKFLOATVAL() implies to the casual reader that there are some checks
> done, while in reality no real checks are done at all (and hopefully
> the compiler optimizes everything away, which might not be true for
> debug builds).
I don't know why those trigonometric functions don't check for
overflow/underflow like all the rest of float.c.  I'll submit another
patch to make them error when overflow/underflow.

The new version is attached.

0001-Bring-back-CHECKFLOATVAL-macro-v02.patch (35K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

hasegeli
In reply to this post by Amit Langote
> Should we update the same macro in contrib/btree_gist/btree_utils_num.h too?

I posted another version incorporating this.


Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

Andres Freund
In reply to this post by hasegeli
Hi,

On 2020-02-12 11:54:13 +0000, Emre Hasegeli wrote:

> From fb5052b869255ef9465b1de92e84b2fb66dd6eb3 Mon Sep 17 00:00:00 2001
> From: Emre Hasegeli <[hidden email]>
> Date: Fri, 7 Feb 2020 10:27:25 +0000
> Subject: [PATCH] Bring back CHECKFLOATVAL() macro
>
> The inline functions added by 6bf0bc842b caused the conditions of
> overflow/underflow checks to be evaluated when no overflow/underflow
> happen.  This slowed down floating point operations.  This commit brings
> back the macro that was in use before 6bf0bc842b to fix the performace
> regression.

Wait, no. Didn't we get to the point that we figured out that the
primary issue is the reversal of the order of what is checked is the
primary problem, rather than the macro/inline piece?

Nor do I see how it's going to be ok to just rename the function in a
stable branch.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: In PG12, query with float calculations is slower than PG11

hasegeli
> Wait, no. Didn't we get to the point that we figured out that the
> primary issue is the reversal of the order of what is checked is the
> primary problem, rather than the macro/inline piece?

Reversal of the order makes a small or no difference.  The
macro/inline change causes the real slowdown at least on GCC.

> Nor do I see how it's going to be ok to just rename the function in a
> stable branch.

I'll post another version to keep them around.


12