Why is pq_begintypsend so slow?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Why is pq_begintypsend so slow?

Jeff Janes
I was using COPY recently and was wondering why BINARY format is not much (if any) faster than the default format.  Once I switched from mostly exporting ints to mostly exporting double precisions (7e6 rows of 100 columns, randomly generated), it was faster, but not by as much as I intuitively thought it should be.  

Running 'perf top' to profile a "COPY BINARY .. TO '/dev/null'" on a AWS m5.large machine running Ubuntu 18.04, with self compiled PostgreSQL:

PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit

I saw that the hotspot was pq_begintypsend at 20%, which was twice the percentage as the next place winner (AllocSetAlloc).  If I drill down into teh function, I see something like the below.  I don't really speak assembly, but usually when I see an assembly instruction being especially hot and not being the inner most instruction in a loop, I blame it on CPU cache misses.  But everything being touched here should already be well cached, since  initStringInfo has just got done setting it up. And if not for that, then the by the 2nd invocation of appendStringInfoCharMacro it certainly should be in the cache, yet that one is even slower than the 1st appendStringInfoCharMacro.

Why is this such a bottleneck?

pq_begintypsend  /usr/local/pgsql/bin/postgres

 0.15 |    push  %rbx
 0.09 |    mov  %rdi,%rbx
   |       initStringInfo(buf);
 3.03 |    callq initStringInfo
   |       /* Reserve four bytes for the bytea length word */
   |       appendStringInfoCharMacro(buf, '\0');
   |    movslq 0x8(%rbx),%rax
 1.05 |    lea  0x1(%rax),%edx
 0.72 |    cmp  0xc(%rbx),%edx
   |    jge  b0
 2.92 |    mov  (%rbx),%rdx
   |    movb  $0x0,(%rdx,%rax,1)
13.76 |    mov  0x8(%rbx),%eax
 0.81 |    mov  (%rbx),%rdx
 0.52 |    add  $0x1,%eax
 0.12 |    mov  %eax,0x8(%rbx)
 2.85 |    cltq
 0.01 |    movb  $0x0,(%rdx,%rax,1)
   |       appendStringInfoCharMacro(buf, '\0');
10.65 |    movslq 0x8(%rbx),%rax
   |    lea  0x1(%rax),%edx
 0.90 |    cmp  0xc(%rbx),%edx
   |    jge  ca
 0.54 | 42:  mov  (%rbx),%rdx
 1.84 |    movb  $0x0,(%rdx,%rax,1)
13.88 |    mov  0x8(%rbx),%eax
 0.03 |    mov  (%rbx),%rdx
   |    add  $0x1,%eax
 0.33 |    mov  %eax,0x8(%rbx)
 2.60 |    cltq
 0.06 |    movb  $0x0,(%rdx,%rax,1)
   |       appendStringInfoCharMacro(buf, '\0');
 3.21 |    movslq 0x8(%rbx),%rax
 0.23 |    lea  0x1(%rax),%edx
 1.74 |    cmp  0xc(%rbx),%edx
   |    jge  e0
 0.21 | 67:  mov  (%rbx),%rdx
 1.18 |    movb  $0x0,(%rdx,%rax,1)
 9.29 |    mov  0x8(%rbx),%eax
 0.18 |    mov  (%rbx),%rdx
   |    add  $0x1,%eax
 0.19 |    mov  %eax,0x8(%rbx)
 3.14 |    cltq
 0.12 |    movb  $0x0,(%rdx,%rax,1)
   |       appendStringInfoCharMacro(buf, '\0');
 5.29 |    movslq 0x8(%rbx),%rax
 0.03 |    lea  0x1(%rax),%edx
 1.45 |    cmp  0xc(%rbx),%edx
   |    jge  f6
 0.41 | 8c:  mov  (%rbx),%rdx

Cheers,

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

Tom Lane-2
Jeff Janes <[hidden email]> writes:
> I saw that the hotspot was pq_begintypsend at 20%, which was twice the
> percentage as the next place winner (AllocSetAlloc).

Weird.

> Why is this such a bottleneck?

Not sure, but it seems like a pretty dumb way to push the stringinfo's
len forward.  We're reading/updating the len word in each line, and
if your perf measurements are to be believed, it's the re-fetches of
the len values that are bottlenecked --- maybe your CPU isn't too
bright about that?  The bytes of the string value are getting written
twice too, thanks to uselessly setting up a terminating nul each time.

I'd be inclined to replace the appendStringInfoCharMacro calls with
appendStringInfoSpaces(buf, 4) --- I don't think we care exactly what
is inserted into those bytes at this point.  And maybe
appendStringInfoSpaces could stand some micro-optimization, too.
Use a memset and a single len adjustment, perhaps?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

David Fetter
On Sat, Jan 11, 2020 at 03:19:37PM -0500, Tom Lane wrote:

> Jeff Janes <[hidden email]> writes:
> > I saw that the hotspot was pq_begintypsend at 20%, which was twice the
> > percentage as the next place winner (AllocSetAlloc).
>
> Weird.
>
> > Why is this such a bottleneck?
>
> Not sure, but it seems like a pretty dumb way to push the stringinfo's
> len forward.  We're reading/updating the len word in each line, and
> if your perf measurements are to be believed, it's the re-fetches of
> the len values that are bottlenecked --- maybe your CPU isn't too
> bright about that?  The bytes of the string value are getting written
> twice too, thanks to uselessly setting up a terminating nul each time.
>
> I'd be inclined to replace the appendStringInfoCharMacro calls with
> appendStringInfoSpaces(buf, 4) --- I don't think we care exactly what
> is inserted into those bytes at this point.  And maybe
> appendStringInfoSpaces could stand some micro-optimization, too.
> Use a memset and a single len adjustment, perhaps?
Please find attached a patch that does it both of the things you
suggested.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

v1-0001-Make-pq_begintypsend-more-efficient.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

Tom Lane-2
David Fetter <[hidden email]> writes:
> On Sat, Jan 11, 2020 at 03:19:37PM -0500, Tom Lane wrote:
>> Jeff Janes <[hidden email]> writes:
>>> I saw that the hotspot was pq_begintypsend at 20%, which was twice the
>>> percentage as the next place winner (AllocSetAlloc).

>> I'd be inclined to replace the appendStringInfoCharMacro calls with
>> appendStringInfoSpaces(buf, 4) --- I don't think we care exactly what
>> is inserted into those bytes at this point.  And maybe
>> appendStringInfoSpaces could stand some micro-optimization, too.
>> Use a memset and a single len adjustment, perhaps?

> Please find attached a patch that does it both of the things you
> suggested.

I've been fooling around with this here.  On the test case Jeff
describes --- COPY BINARY tab TO '/dev/null' where tab contains
100 float8 columns filled from random() --- I can reproduce his
results.  pq_begintypsend is the top hotspot and if perf's
localization is accurate, it's the instructions that fetch
str->len that hurt the most.  Still not very clear why that is...

Converting pq_begintypsend to use appendStringInfoSpaces helps
a bit; it takes my test case from 91725 ms to 88847 ms, or about
3% savings.  Noodling around with appendStringInfoSpaces doesn't
help noticeably; I tried memset, as well as open-coding (cf
patch below) but the results were all well within the noise
threshold.

I saw at this point that the remaining top spots were
enlargeStringInfo and appendBinaryStringInfo, so I experimented
with inlining them (again, see patch below).  That *did* move
the needle: down to 72691 ms, or 20% better than HEAD.  Of
course, that comes at a code-size cost, but it's only about
13kB growth:

before:
$ size src/backend/postgres
   text    data     bss     dec     hex filename
7485285   58088  203328 7746701  76348d src/backend/postgres
after:
$ size src/backend/postgres
   text    data     bss     dec     hex filename
7498652   58088  203328 7760068  7668c4 src/backend/postgres

That's under two-tenths of a percent.  (This'd affect frontend
executables too, and I didn't check them.)

Seems like this is worth pursuing, especially if it can be
shown to improve any other cases noticeably.  It might be
worth inlining some of the other trivial stringinfo functions,
though I'd tread carefully on that.

                        regards, tom lane


diff --git a/src/backend/libpq/pqformat.c b/src/backend/libpq/pqformat.c
index a6f990c..0fc8c3f 100644
--- a/src/backend/libpq/pqformat.c
+++ b/src/backend/libpq/pqformat.c
@@ -328,11 +328,8 @@ void
 pq_begintypsend(StringInfo buf)
 {
  initStringInfo(buf);
- /* Reserve four bytes for the bytea length word */
- appendStringInfoCharMacro(buf, '\0');
- appendStringInfoCharMacro(buf, '\0');
- appendStringInfoCharMacro(buf, '\0');
- appendStringInfoCharMacro(buf, '\0');
+ /* Reserve four bytes for the bytea length word; contents not important */
+ appendStringInfoSpaces(buf, 4);
 }
 
 /* --------------------------------
diff --git a/src/common/stringinfo.c b/src/common/stringinfo.c
index 0badc46..8f8bb0d 100644
--- a/src/common/stringinfo.c
+++ b/src/common/stringinfo.c
@@ -207,43 +207,21 @@ appendStringInfoSpaces(StringInfo str, int count)
 {
  if (count > 0)
  {
+ char   *ptr;
+
  /* Make more room if needed */
  enlargeStringInfo(str, count);
 
  /* OK, append the spaces */
+ ptr = str->data + str->len;
+ str->len += count;
  while (--count >= 0)
- str->data[str->len++] = ' ';
- str->data[str->len] = '\0';
+ *ptr++ = ' ';
+ *ptr = '\0';
  }
 }
 
 /*
- * appendBinaryStringInfo
- *
- * Append arbitrary binary data to a StringInfo, allocating more space
- * if necessary. Ensures that a trailing null byte is present.
- */
-void
-appendBinaryStringInfo(StringInfo str, const char *data, int datalen)
-{
- Assert(str != NULL);
-
- /* Make more room if needed */
- enlargeStringInfo(str, datalen);
-
- /* OK, append the data */
- memcpy(str->data + str->len, data, datalen);
- str->len += datalen;
-
- /*
- * Keep a trailing null in place, even though it's probably useless for
- * binary data.  (Some callers are dealing with text but call this because
- * their input isn't null-terminated.)
- */
- str->data[str->len] = '\0';
-}
-
-/*
  * appendBinaryStringInfoNT
  *
  * Append arbitrary binary data to a StringInfo, allocating more space
@@ -263,7 +241,7 @@ appendBinaryStringInfoNT(StringInfo str, const char *data, int datalen)
 }
 
 /*
- * enlargeStringInfo
+ * enlargeStringInfo_OOL
  *
  * Make sure there is enough space for 'needed' more bytes
  * ('needed' does not include the terminating null).
@@ -274,13 +252,16 @@ appendBinaryStringInfoNT(StringInfo str, const char *data, int datalen)
  * can save some palloc overhead by enlarging the buffer before starting
  * to store data in it.
  *
+ * We normally reach here only if enlargement is needed, since callers
+ * go through the inline function which doesn't call this otherwise.
+ *
  * NB: In the backend, because we use repalloc() to enlarge the buffer, the
  * string buffer will remain allocated in the same memory context that was
  * current when initStringInfo was called, even if another context is now
  * current.  This is the desired and indeed critical behavior!
  */
 void
-enlargeStringInfo(StringInfo str, int needed)
+enlargeStringInfo_OOL(StringInfo str, int needed)
 {
  int newlen;
 
diff --git a/src/include/lib/stringinfo.h b/src/include/lib/stringinfo.h
index 5a2a3db..30ba826 100644
--- a/src/include/lib/stringinfo.h
+++ b/src/include/lib/stringinfo.h
@@ -87,6 +87,25 @@ extern void initStringInfo(StringInfo str);
 extern void resetStringInfo(StringInfo str);
 
 /*------------------------
+ * enlargeStringInfo
+ * Make sure a StringInfo's buffer can hold at least 'needed' more bytes
+ * ('needed' does not include the terminating null).
+ * The inlined code eliminates the common case where no work is needed.
+ */
+extern void enlargeStringInfo_OOL(StringInfo str, int needed);
+
+static inline void
+enlargeStringInfo(StringInfo str, int needed)
+{
+ /*
+ * Do the test in unsigned arithmetic so that if "needed" is negative,
+ * we'll go to the out-of-line code which will error out.
+ */
+ if ((unsigned) needed >= (unsigned) (str->maxlen - str->len))
+ enlargeStringInfo_OOL(str, needed);
+}
+
+/*------------------------
  * appendStringInfo
  * Format text data under the control of fmt (an sprintf-style format string)
  * and append it to whatever is already in str.  More space is allocated
@@ -139,10 +158,27 @@ extern void appendStringInfoSpaces(StringInfo str, int count);
 /*------------------------
  * appendBinaryStringInfo
  * Append arbitrary binary data to a StringInfo, allocating more space
- * if necessary.
+ * if necessary.  Ensures that a trailing null byte is present.
  */
-extern void appendBinaryStringInfo(StringInfo str,
-   const char *data, int datalen);
+static inline void
+appendBinaryStringInfo(StringInfo str, const char *data, int datalen)
+{
+ Assert(str != NULL);
+
+ /* Make more room if needed */
+ enlargeStringInfo(str, datalen);
+
+ /* OK, append the data */
+ memcpy(str->data + str->len, data, datalen);
+ str->len += datalen;
+
+ /*
+ * Keep a trailing null in place, even though it's probably useless for
+ * binary data.  (Some callers are dealing with text but call this because
+ * their input isn't null-terminated.)
+ */
+ str->data[str->len] = '\0';
+}
 
 /*------------------------
  * appendBinaryStringInfoNT
@@ -152,10 +188,4 @@ extern void appendBinaryStringInfo(StringInfo str,
 extern void appendBinaryStringInfoNT(StringInfo str,
  const char *data, int datalen);
 
-/*------------------------
- * enlargeStringInfo
- * Make sure a StringInfo's buffer can hold at least 'needed' more bytes.
- */
-extern void enlargeStringInfo(StringInfo str, int needed);
-
 #endif /* STRINGINFO_H */
Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

Tom Lane-2
I wrote:
> I saw at this point that the remaining top spots were
> enlargeStringInfo and appendBinaryStringInfo, so I experimented
> with inlining them (again, see patch below).  That *did* move
> the needle: down to 72691 ms, or 20% better than HEAD.

Oh ... marking the test in the inline part of enlargeStringInfo()
as unlikely() helps quite a bit more: 66100 ms, a further 9% gain.
Might be over-optimizing for this particular case, perhaps, but
I think that's a reasonable marking given that we overallocate
the stringinfo buffer for most uses.

(But ... I'm not finding these numbers to be super reproducible
across different ASLR layouts.  So take it with a grain of salt.)

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

Andres Freund
Hi,

On 2020-01-11 22:32:45 -0500, Tom Lane wrote:

> I wrote:
> > I saw at this point that the remaining top spots were
> > enlargeStringInfo and appendBinaryStringInfo, so I experimented
> > with inlining them (again, see patch below).  That *did* move
> > the needle: down to 72691 ms, or 20% better than HEAD.
>
> Oh ... marking the test in the inline part of enlargeStringInfo()
> as unlikely() helps quite a bit more: 66100 ms, a further 9% gain.
> Might be over-optimizing for this particular case, perhaps
>
> (But ... I'm not finding these numbers to be super reproducible
> across different ASLR layouts.  So take it with a grain of salt.)

FWIW, I've also observed, in another thread (the node func generation
thing [1]), that inlining enlargeStringInfo() helps a lot, especially
when inlining some of its callers. Moving e.g. appendStringInfo() inline
allows the compiler to sometimes optimize away the strlen. But if
e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo()
unconditionally, successive appends cannot optimize away memory accesses
for ->len/->data.

For the case of send functions, we really ought to have at least
pq_begintypsend(), enlargeStringInfo() and pq_endtypsend() inline. That
way the compiler ought to be able to avoid repeatedly loading/storing
->len, after the initial initStringInfo() call. Might even make sense to
also have initStringInfo() inline, because then the compiler would
probably never actually materialize the StringInfoData (and would
automatically have good aliasing information too).


The commit referenced above is obviously quite WIP-ey, and contains
things that should be split into separate commits. But I think it might
be worth moving more functions into the header, like I've done in that
commit.

The commit also adds appendStringInfoU?Int(32,64) operations - I've
unsuprisingly found these to be *considerably* faster than going through
appendStringInfoString().


> but I think that's a reasonable marking given that we overallocate
> the stringinfo buffer for most uses.

Wonder if it's worth providing a function to initialize the stringinfo
differently for the many cases where we have at least a very good idea
of how long the string will be. It's sad to allocate 1kb just for
e.g. int4send to send an integer plus length.

Greetings,

Andres Freund

[1] https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=commit;h=127e860cf65f50434e0bb97acbba4b0ea6f38cfd


Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

Andres Freund
Hi,

On 2020-01-13 15:18:04 -0800, Andres Freund wrote:

> On 2020-01-11 22:32:45 -0500, Tom Lane wrote:
> > I wrote:
> > > I saw at this point that the remaining top spots were
> > > enlargeStringInfo and appendBinaryStringInfo, so I experimented
> > > with inlining them (again, see patch below).  That *did* move
> > > the needle: down to 72691 ms, or 20% better than HEAD.
> >
> > Oh ... marking the test in the inline part of enlargeStringInfo()
> > as unlikely() helps quite a bit more: 66100 ms, a further 9% gain.
> > Might be over-optimizing for this particular case, perhaps
> >
> > (But ... I'm not finding these numbers to be super reproducible
> > across different ASLR layouts.  So take it with a grain of salt.)
>
> FWIW, I've also observed, in another thread (the node func generation
> thing [1]), that inlining enlargeStringInfo() helps a lot, especially
> when inlining some of its callers. Moving e.g. appendStringInfo() inline
> allows the compiler to sometimes optimize away the strlen. But if
> e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo()
> unconditionally, successive appends cannot optimize away memory accesses
> for ->len/->data.
With a set of patches doing so, int4send itself is not a significant
factor for my test benchmark [1] anymore. The assembly looks about as
good as one could hope, I think:

# save rbx on the stack
   0x00000000004b7f90 <+0>: push   %rbx
   0x00000000004b7f91 <+1>: sub    $0x20,%rsp
# store integer to be sent into rbx
   0x00000000004b7f95 <+5>: mov    0x20(%rdi),%rbx
# palloc length argument
   0x00000000004b7f99 <+9>: mov    $0x9,%edi
   0x00000000004b7f9e <+14>: callq  0x5d9aa0 <palloc>
# store integer in buffer (ebx is 4 byte portion of rbx)
   0x00000000004b7fa3 <+19>: movbe  %ebx,0x4(%rax)
# store varlena header
   0x00000000004b7fa8 <+24>: movl   $0x20,(%rax)
# restore stack and rbx registerz
   0x00000000004b7fae <+30>: add    $0x20,%rsp
   0x00000000004b7fb2 <+34>: pop    %rbx
   0x00000000004b7fb3 <+35>: retq

All the $0x20 constants are a bit confusing, but they just happen to be
the same for int4send. It's the size of the stack frame,
offset for FunctionCallInfoBaseData->args[0], the varlena header (and then the stack
frame again) respectively.

Note that I had to annotate palloc with __attribute__((malloc)) to make
the compiler understand that palloc's returned value will not alias with
anything problematic (e.g. the potential of aliasing with fcinfo
prevents optimizing to the above without that annotation).  I think such
annotations would be a good idea anyway, precisely because they allow
the compiler to optimize code significantly better.


These together yields about a 1.8x speedup for me. The profile shows
that the overhead now is overwhelmingly elsewhere:
+   26.30%  postgres  postgres          [.] CopyOneRowTo
+   13.40%  postgres  postgres          [.] tts_buffer_heap_getsomeattrs
+   10.61%  postgres  postgres          [.] AllocSetAlloc
+    9.26%  postgres  libc-2.29.so      [.] __memmove_avx_unaligned_erms
+    7.32%  postgres  postgres          [.] SendFunctionCall
+    6.02%  postgres  postgres          [.] palloc
+    4.45%  postgres  postgres          [.] int4send
+    3.68%  postgres  libc-2.29.so      [.] _IO_fwrite
+    2.71%  postgres  postgres          [.] heapgettup_pagemode
+    1.96%  postgres  postgres          [.] AllocSetReset
+    1.83%  postgres  postgres          [.] CopySendEndOfRow
+    1.75%  postgres  libc-2.29.so      [.] _IO_file_xsputn@@GLIBC_2.2.5
+    1.60%  postgres  postgres          [.] ExecStoreBufferHeapTuple
+    1.57%  postgres  postgres          [.] DoCopyTo
+    1.16%  postgres  postgres          [.] memcpy@plt
+    1.07%  postgres  postgres          [.] heapgetpage

Even without using the new pq_begintypesend_ex()/initStringInfoEx(), the
generated code is still considerably better than before, yielding a
1.58x speedup. Tallocator overhead unsurprisingly is higher:
+   24.93%  postgres  postgres          [.] CopyOneRowTo
+   17.10%  postgres  postgres          [.] AllocSetAlloc
+   10.09%  postgres  postgres          [.] tts_buffer_heap_getsomeattrs
+    6.50%  postgres  libc-2.29.so      [.] __memmove_avx_unaligned_erms
+    5.99%  postgres  postgres          [.] SendFunctionCall
+    5.11%  postgres  postgres          [.] palloc
+    3.95%  postgres  libc-2.29.so      [.] _int_malloc
+    3.38%  postgres  postgres          [.] int4send
+    2.54%  postgres  postgres          [.] heapgettup_pagemode
+    2.11%  postgres  libc-2.29.so      [.] _int_free
+    2.06%  postgres  postgres          [.] MemoryContextReset
+    2.02%  postgres  postgres          [.] AllocSetReset
+    1.97%  postgres  libc-2.29.so      [.] _IO_fwrite
+    1.47%  postgres  postgres          [.] DoCopyTo
+    1.14%  postgres  postgres          [.] ExecStoreBufferHeapTuple
+    1.06%  postgres  libc-2.29.so      [.] _IO_file_xsputn@@GLIBC_2.2.5
+    1.04%  postgres  libc-2.29.so      [.] malloc


Adding a few pg_restrict*, and using appendBinaryStringInfoNT instead of
appendBinaryStringInfo in CopySend* gains another 1.05x.


This does result in some code growth, but given the size of the
improvements, and that the improvements are significant even without
code changes to callsites, that seems worth it.

before:
   text   data    bss    dec    hex filename
8482739 172304 204240 8859283 872e93 src/backend/postgres
after:
   text   data    bss    dec    hex filename
8604300 172304 204240 8980844 89096c src/backend/postgres

Regards,

Andres

[1]
CREATE TABLE lotsaints4(c01 int4 NOT NULL, c02 int4 NOT NULL, c03 int4 NOT NULL, c04 int4 NOT NULL, c05 int4 NOT NULL, c06 int4 NOT NULL, c07 int4 NOT NULL, c08 int4 NOT NULL, c09 int4 NOT NULL, c10 int4 NOT NULL);
INSERT INTO lotsaints4 SELECT ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int FROM generate_series(1, 2000000);
VACUUM FREEZE lotsaints4;
COPY lotsaints4 TO '/dev/null' WITH binary;

CREATE TABLE lotsaints8(c01 int8 NOT NULL, c02 int8 NOT NULL, c03 int8 NOT NULL, c04 int8 NOT NULL, c05 int8 NOT NULL, c06 int8 NOT NULL, c07 int8 NOT NULL, c08 int8 NOT NULL, c09 int8 NOT NULL, c10 int8 NOT NULL);
INSERT INTO lotsaints8 SELECT ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int, ((random() * 2^31) - 1)::int FROM generate_series(1, 2000000);
VACUUM FREEZE lotsaints8;
COPY lotsaints8 TO '/dev/null' WITH binary;

v1-0001-stringinfo-Move-more-functions-inline-provide-ini.patch (13K) Download Attachment
v1-0002-stringinfo-Remove-in-core-use-of-appendStringInfo.patch (14K) Download Attachment
v1-0003-pqformat-Move-functions-for-type-sending-inline-a.patch (4K) Download Attachment
v1-0004-WIP-Annotate-palloc-with-malloc-and-other-compile.patch (1K) Download Attachment
v1-0005-Move-int4-int8-send-to-use-pq_begintypsend_ex.patch (1K) Download Attachment
v1-0006-copy-Use-appendBinaryStringInfoNT-for-sending-bin.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

Tom Lane-2
Andres Freund <[hidden email]> writes:
>> FWIW, I've also observed, in another thread (the node func generation
>> thing [1]), that inlining enlargeStringInfo() helps a lot, especially
>> when inlining some of its callers. Moving e.g. appendStringInfo() inline
>> allows the compiler to sometimes optimize away the strlen. But if
>> e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo()
>> unconditionally, successive appends cannot optimize away memory accesses
>> for ->len/->data.

> With a set of patches doing so, int4send itself is not a significant
> factor for my test benchmark [1] anymore.

This thread seems to have died out, possibly because the last set of
patches that Andres posted was sufficiently complicated and invasive
that nobody wanted to review it.  I thought about this again after
seeing that [1] is mostly about pq_begintypsend overhead, and had
an epiphany: there isn't really a strong reason for pq_begintypsend
to be inserting bits into the buffer at all.  The bytes will be
filled by pq_endtypsend, and nothing in between should be touching
them.  So I propose 0001 attached.  It's poking into the stringinfo
abstraction a bit more than I would want to do if there weren't a
compelling performance reason to do so, but there evidently is.

With 0001, pq_begintypsend drops from being the top single routine
in a profile of a test case like [1] to being well down the list.
The next biggest cost compared to text-format output is that
printtup() itself is noticeably more expensive.  A lot of the extra
cost there seems to be from pq_sendint32(), which is getting inlined
into printtup(), and there probably isn't much we can do to make that
cheaper. But eliminating a common subexpression as in 0002 below does
help noticeably, at least with the rather old gcc I'm using.

For me, the combination of these two eliminates most but not quite
all of the cost penalty of binary over text output as seen in [1].

                        regards, tom lane

[1] https://www.postgresql.org/message-id/CAMovtNoHFod2jMAKQjjxv209PCTJx5Kc66anwWvX0mEiaXwgmA%40mail.gmail.com


diff --git a/src/backend/libpq/pqformat.c b/src/backend/libpq/pqformat.c
index a6f990c..03b7404 100644
--- a/src/backend/libpq/pqformat.c
+++ b/src/backend/libpq/pqformat.c
@@ -328,11 +328,16 @@ void
 pq_begintypsend(StringInfo buf)
 {
  initStringInfo(buf);
- /* Reserve four bytes for the bytea length word */
- appendStringInfoCharMacro(buf, '\0');
- appendStringInfoCharMacro(buf, '\0');
- appendStringInfoCharMacro(buf, '\0');
- appendStringInfoCharMacro(buf, '\0');
+
+ /*
+ * Reserve four bytes for the bytea length word.  We don't need to fill
+ * them with anything (pq_endtypsend will do that), and this function is
+ * enough of a hot spot that it's worth cheating to save some cycles. Note
+ * in particular that we don't bother to guarantee that the buffer is
+ * null-terminated.
+ */
+ Assert(buf->maxlen > 4);
+ buf->len = 4;
 }
 
 /* --------------------------------

diff --git a/src/backend/access/common/printtup.c b/src/backend/access/common/printtup.c
index dd1bac0..a9315c6 100644
--- a/src/backend/access/common/printtup.c
+++ b/src/backend/access/common/printtup.c
@@ -438,11 +438,12 @@ printtup(TupleTableSlot *slot, DestReceiver *self)
  {
  /* Binary output */
  bytea   *outputbytes;
+ int outputlen;
 
  outputbytes = SendFunctionCall(&thisState->finfo, attr);
- pq_sendint32(buf, VARSIZE(outputbytes) - VARHDRSZ);
- pq_sendbytes(buf, VARDATA(outputbytes),
- VARSIZE(outputbytes) - VARHDRSZ);
+ outputlen = VARSIZE(outputbytes) - VARHDRSZ;
+ pq_sendint32(buf, outputlen);
+ pq_sendbytes(buf, VARDATA(outputbytes), outputlen);
  }
  }
 
Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

Ranier Vilela-2
Em seg., 18 de mai. de 2020 às 13:38, Tom Lane <[hidden email]> escreveu:
Andres Freund <[hidden email]> writes:
>> FWIW, I've also observed, in another thread (the node func generation
>> thing [1]), that inlining enlargeStringInfo() helps a lot, especially
>> when inlining some of its callers. Moving e.g. appendStringInfo() inline
>> allows the compiler to sometimes optimize away the strlen. But if
>> e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo()
>> unconditionally, successive appends cannot optimize away memory accesses
>> for ->len/->data.

> With a set of patches doing so, int4send itself is not a significant
> factor for my test benchmark [1] anymore.

This thread seems to have died out, possibly because the last set of
patches that Andres posted was sufficiently complicated and invasive
that nobody wanted to review it.  I thought about this again after
seeing that [1] is mostly about pq_begintypsend overhead, and had
an epiphany: there isn't really a strong reason for pq_begintypsend
to be inserting bits into the buffer at all.  The bytes will be
filled by pq_endtypsend, and nothing in between should be touching
them.  So I propose 0001 attached.  It's poking into the stringinfo
abstraction a bit more than I would want to do if there weren't a
compelling performance reason to do so, but there evidently is.

With 0001, pq_begintypsend drops from being the top single routine
in a profile of a test case like [1] to being well down the list.
The next biggest cost compared to text-format output is that
printtup() itself is noticeably more expensive.  A lot of the extra
cost there seems to be from pq_sendint32(), which is getting inlined
into printtup(), and there probably isn't much we can do to make that
cheaper. But eliminating a common subexpression as in 0002 below does
help noticeably, at least with the rather old gcc I'm using.
Again, I see problems with the types declared in Postgres.
1. pq_sendint32 (StringInfo buf, uint32 i)
2. extern void pq_sendbytes (StringInfo buf, const char * data, int datalen);

Wouldn't it be better to declare outputlen (0002) as uint32?
To avoid converting from (int) to (uint32), even if afterwards there is a conversion from (uint32) to (int)?

regards,
Ranier Vilela
Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

Tom Lane-2
Ranier Vilela <[hidden email]> writes:
> Again, I see problems with the types declared in Postgres.
> 1. pq_sendint32 (StringInfo buf, uint32 i)
> 2. extern void pq_sendbytes (StringInfo buf, const char * data, int
> datalen);

We could spend the next ten years cleaning up minor discrepancies
like that, and have nothing much to show for the work.

> To avoid converting from (int) to (uint32), even if afterwards there is a
> conversion from (uint32) to (int)?

You do realize that that conversion costs nothing?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

Andres Freund
In reply to this post by Tom Lane-2
Hi,

On 2020-05-18 12:38:05 -0400, Tom Lane wrote:

> Andres Freund <[hidden email]> writes:
> >> FWIW, I've also observed, in another thread (the node func generation
> >> thing [1]), that inlining enlargeStringInfo() helps a lot, especially
> >> when inlining some of its callers. Moving e.g. appendStringInfo() inline
> >> allows the compiler to sometimes optimize away the strlen. But if
> >> e.g. an inlined appendBinaryStringInfo() still calls enlargeStringInfo()
> >> unconditionally, successive appends cannot optimize away memory accesses
> >> for ->len/->data.
>
> > With a set of patches doing so, int4send itself is not a significant
> > factor for my test benchmark [1] anymore.
>
> This thread seems to have died out, possibly because the last set of
> patches that Andres posted was sufficiently complicated and invasive
> that nobody wanted to review it.
Well, I wasn't really planning to try to get that patchset into 13, and
it wasn't in the CF...


> I thought about this again after seeing that [1] is mostly about
> pq_begintypsend overhead

I'm not really convinced that that's the whole problem. Using the
benchmark from upthread, I get (median of three):
master: 1181.581
yours: 1171.445
mine: 598.031

That's a very significant difference, imo. It helps a bit with the
benchmark from your [1], but not that much.


> With 0001, pq_begintypsend drops from being the top single routine
> in a profile of a test case like [1] to being well down the list.
> The next biggest cost compared to text-format output is that
> printtup() itself is noticeably more expensive.  A lot of the extra
> cost there seems to be from pq_sendint32(), which is getting inlined
> into printtup(), and there probably isn't much we can do to make that
> cheaper. But eliminating a common subexpression as in 0002 below does
> help noticeably, at least with the rather old gcc I'm using.

I think there's plenty more we can do:

First, it's unnecessary to re-initialize a FunctionCallInfo for every
send/recv/out/in call. Instead we can reuse the same over and over.


After that, the biggest remaining overhead for Jack's test is the palloc
for the stringinfo, as far as I can see.  I've complained about that
before...

I've just hacked up a modification where, for send functions,
fcinfo->context contains a stringinfo set up by printtup/CopyTo. That,
combined with using a FunctionCallInfo set up beforehand, instead of
re-initializing it in every printtup cycle, results in a pretty good
saving.

Making the binary protocol 20% faster than text, in Jack's testcase. And
my lotsaints4 test, goes further down to ~410ms (this is 2.5x faster
than where started).

Now obviously, the hack with passing a StringInfo in ->context is just
that, a hack. A somewhat gross one even. But I think it pretty clearly
shows the problem and the way out.

I don't know what the best non-gross solution for the overhead of the
out/send functions is. There seems to be at least the following
major options (and a lots of variants thereof):

1) Just continue to incur significant overhead for every datum
2) Accept the uglyness of passing in a buffer via
   FunctionCallInfo->context. Change nearly all in-core send functions
   over to that.
3) Pass string buffer through an new INTERNAL argument to send/output
   function, allow both old/new style send functions. Use a wrapper
   function to adapt the "old style" to the "new style".
4) Like 3, but make the new argument optional, and use ad-hoc
   stringbuffer if not provided. I don't like the unnecessary branches
   this adds.

The biggest problem after that is that we waste a lot of time memcpying
stuff around repeatedly. There is:
1) send function: datum -> per datum stringinfo
2) printtup: per datum stringinfo -> per row stringinfo
3) socket_putmessage: per row stringinfo -> PqSendBuffer
4) send(): PqSendBuffer -> kernel buffer

It's obviously hard to avoid 1) and 4) in the common case, but the
number of other copies seem pretty clearly excessive.


If we change the signature of the out/send function to always target a
string buffer, we could pretty easily avoid 2), and for out functions
we'd not have to redundantly call strlen (as the argument to
pq_sendcountedtext) anymore, which seems substantial too.

As I argued before, I think it's unnecessary to have a separate buffer
between 3-4). We should construct the outgoing message inside the send
buffer. I still don't understand what "recursion" danger there is,
nothing below printtup should ever send protocol messages, no?


Sometimes there's also 0) in the above, when detoasting a datum...

Greetings,

Andres Freund

v2-0001-stringinfo-Move-more-functions-inline-provide-ini.patch (13K) Download Attachment
v2-0002-stringinfo-Remove-in-core-use-of-appendStringInfo.patch (14K) Download Attachment
v2-0003-pqformat-Move-functions-for-type-sending-inline-a.patch (4K) Download Attachment
v2-0004-WIP-Annotate-palloc-with-malloc-and-other-compile.patch (1K) Download Attachment
v2-0005-Move-int4-int8-send-to-use-pq_begintypsend_ex.patch (1K) Download Attachment
v2-0006-copy-Use-appendBinaryStringInfoNT-for-sending-bin.patch (1K) Download Attachment
v2-0007-wip-make-send-recv-calls-in-printtup.c-copy.c-che.patch (15K) Download Attachment
v2-0008-inline-pq_sendbytes-stop-maintaining-trailing-nul.patch (2K) Download Attachment
v2-0009-heavy-wip-Allow-string-buffer-reuse-in-send-funct.patch (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

Robert Haas
On Tue, Jun 2, 2020 at 9:56 PM Andres Freund <[hidden email]> wrote:
> The biggest problem after that is that we waste a lot of time memcpying
> stuff around repeatedly. There is:
> 1) send function: datum -> per datum stringinfo
> 2) printtup: per datum stringinfo -> per row stringinfo
> 3) socket_putmessage: per row stringinfo -> PqSendBuffer
> 4) send(): PqSendBuffer -> kernel buffer
>
> It's obviously hard to avoid 1) and 4) in the common case, but the
> number of other copies seem pretty clearly excessive.

I too have seen recent benchmarking data where this was a big problem.
Basically, you need a workload where the server doesn't have much or
any actual query processing to do, but is just returning a lot of
stuff to a really fast client - e.g. a locally connected client.
That's not necessarily the most common case but, if you have it, all
this extra copying is really pretty expensive.

My first thought was to wonder about changing all of our send/output
functions to write into a buffer passed as an argument rather than
returning something which we then have to copy into a different
buffer, but that would be a somewhat painful change, so it is probably
better to first pursue the idea of getting rid of some of the other
copies that happen in more centralized places (e.g. printtup). I
wonder if we could replace the whole
pq_beginmessage...()/pq_send....()/pq_endmessage...() system with
something a bit better-designed. For instance, suppose we get rid of
the idea that the caller supplies the buffer, and we move the
responsibility for error recovery into the pqcomm layer. So you do
something like:

my_message = xyz_beginmessage('D');
xyz_sendint32(my_message, 42);
xyz_endmessage(my_message);

Maybe what happens here under the hood is we keep a pool of free
message buffers sitting around, and you just grab one and put your
data into it. When you end the message we add it to a list of used
message buffers that are waiting to be sent, and once we send the data
it goes back on the free list. If an error occurs after
xyz_beginmessage() and before xyz_endmessage(), we put the buffer back
on the free list. That would allow us to merge (2) and (3) into a
single copy. To go further, we could allow send/output functions to
opt in to receiving a message buffer rather than returning a value,
and then we could get rid of (1) for types that participate. (4) seems
unavoidable AFAIK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

Andres Freund
Hi,

On 2020-06-03 11:30:42 -0400, Robert Haas wrote:
> I too have seen recent benchmarking data where this was a big problem.
> Basically, you need a workload where the server doesn't have much or
> any actual query processing to do, but is just returning a lot of
> stuff to a really fast client - e.g. a locally connected client.
> That's not necessarily the most common case but, if you have it, all
> this extra copying is really pretty expensive.

Even when the query actually is doing something, it's still quite
possible to get the memcpies to be be measurable (say > 10% of
cycles). Obviously not in a huge aggregating query. Even in something
like pgbench -M prepared -S, which is obviously spending most of its
cycles elsewhere, the patches upthread improve throughput by ~1.5% (and
that's not eliding several unnecessary copies).


> My first thought was to wonder about changing all of our send/output
> functions to write into a buffer passed as an argument rather than
> returning something which we then have to copy into a different
> buffer, but that would be a somewhat painful change, so it is probably
> better to first pursue the idea of getting rid of some of the other
> copies that happen in more centralized places (e.g. printtup).

For those I think the allocator overhead is the bigger issue than the
memcpy itself. I wonder how much we could transparently hide in
pq_begintypsend()/pq_endtypsend().


> I
> wonder if we could replace the whole
> pq_beginmessage...()/pq_send....()/pq_endmessage...() system with
> something a bit better-designed. For instance, suppose we get rid of
> the idea that the caller supplies the buffer, and we move the
> responsibility for error recovery into the pqcomm layer. So you do
> something like:
>
> my_message = xyz_beginmessage('D');
> xyz_sendint32(my_message, 42);
> xyz_endmessage(my_message);
>
> Maybe what happens here under the hood is we keep a pool of free
> message buffers sitting around, and you just grab one and put your
> data into it.

Why do we need multiple buffers? ISTM we don't want to just send
messages at endmsg() time, because that implies unnecessary syscall
overhead. Nor do we want to imply the overhead of the copy from the
message buffer to the network buffer.

To me that seems to imply that the best approach would be to have
PqSendBuffer be something stringbuffer like, and have pg_beginmessage()
record the starting position of the current message somewhere
(->cursor?). When an error is thrown, we reset the position to be where
the in-progress message would have begun.

I've previously outlined a slightly more complicated scheme, where we
have "proxy" stringinfos that point into another stringinfo, instead of
their own buffer. And know how to resize the "outer" buffer when
needed. That'd have some advantages, but I'm not sure it's really
needed.


There's some disadvantages with what I describe above, in particular
when dealing with send() sending only parts of our network buffer. We
couldn't cheaply reuse the already sent memory in that case.

I've before wondered / suggested that we should have StringInfos not
insist on having one consecutive buffer (which obviously implies needing
to copy contents when growing). Instead it should have a list of buffers
containing chunks of the data, and never copy contents around while the
string is being built. We'd only allocate a buffer big enough for all
data when the caller actually wants to have all the resulting data in
one string (rather than using an API that can iterate over chunks).

For the network buffer case that'd allow us to reuse the earlier buffers
even in the "partial send" case. And more generally it'd allow us to be
less wasteful with buffer sizes, and perhaps even have a small "inline"
buffer inside StringInfoData avoiding unnecessary memory allocations in
a lot of cases where the common case is only a small amount of data
being used. And I think the overhead while appending data to such a
stringinfo should be neglegible, because it'd just require the exact
same checks we already have to do for enlargeStringInfo().


> (4) seems unavoidable AFAIK.

Not entirely. Linux can do zero-copy sends, but it does require somewhat
complicated black magic rituals. Including more complex buffer
management for the application, because the memory containing the
to-be-sent data cannot be reused until the kernel notifies that it's
done with the buffer.

See https://www.kernel.org/doc/html/latest/networking/msg_zerocopy.html

That might be something worth pursuing in the future (since it, I think,
basically avoids spending any cpu cycles on copying data around in the
happy path, relying on DMA instead), but I think for now there's much
bigger fish to fry.

I am hoping that somebody will write a nicer abstraction for zero-copy
sends using io_uring. avoiding the need of a separate completion queue,
by simply only signalling completion for the sendmsg operation once the
buffer isn't needed anymore. There's no corresponding completion logic
for normal sendmsg() calls, so it makes sense that something had to be
invented before something like io_uring existed.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

Robert Haas
On Wed, Jun 3, 2020 at 2:10 PM Andres Freund <[hidden email]> wrote:
> Why do we need multiple buffers? ISTM we don't want to just send
> messages at endmsg() time, because that implies unnecessary syscall
> overhead. Nor do we want to imply the overhead of the copy from the
> message buffer to the network buffer.

It would only matter if there are multiple messages being constructed
at the same time, and that's probably not common, but maybe there's
some way it can happen. It doesn't seem like it really costs anything
to allow for it, and it might be useful sometime. For instance,
consider your idea of using Linux black magic to do zero-copy sends.
Now you either need multiple buffers, or you need one big buffer that
you can recycle a bit at a time.

> To me that seems to imply that the best approach would be to have
> PqSendBuffer be something stringbuffer like, and have pg_beginmessage()
> record the starting position of the current message somewhere
> (->cursor?). When an error is thrown, we reset the position to be where
> the in-progress message would have begun.

Yeah, I thought about that, but then how you detect the case where two
different people try to undertake message construction at the same
time?

Like, with the idea I was proposing, you could still decide to limit
yourself to 1 buffer at the same time, and just elog() if someone
tries to allocate a second buffer when you've already reached the
maximum number of allocated buffers (i.e. one). But if you just have
one buffer in a global variable and everybody writes into it, you
might not notice if some unrelated code writes data into that buffer
in the middle of someone else's message construction. Doing it the way
I proposed, writing data requires passing a buffer pointer, so you can
be sure that somebody had to get the buffer from somewhere... and any
rules you want to enforce can be enforced at that point.

> I've before wondered / suggested that we should have StringInfos not
> insist on having one consecutive buffer (which obviously implies needing
> to copy contents when growing). Instead it should have a list of buffers
> containing chunks of the data, and never copy contents around while the
> string is being built. We'd only allocate a buffer big enough for all
> data when the caller actually wants to have all the resulting data in
> one string (rather than using an API that can iterate over chunks).

It's a thought. I doubt it's worth it for small amounts of data, but
for large amounts it might be. On the other hand, a better idea still
might be to size the buffer correctly from the start...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

Andres Freund
Hi,

On 2020-06-09 11:46:09 -0400, Robert Haas wrote:
> On Wed, Jun 3, 2020 at 2:10 PM Andres Freund <[hidden email]> wrote:
> > Why do we need multiple buffers? ISTM we don't want to just send
> > messages at endmsg() time, because that implies unnecessary syscall
> > overhead. Nor do we want to imply the overhead of the copy from the
> > message buffer to the network buffer.
>
> It would only matter if there are multiple messages being constructed
> at the same time, and that's probably not common, but maybe there's
> some way it can happen.

ISTM that it'd be pretty broken if it could happen. We cannot have two
different parts of the system send messages to the client
independently. The protocol is pretty stateful...


> > To me that seems to imply that the best approach would be to have
> > PqSendBuffer be something stringbuffer like, and have pg_beginmessage()
> > record the starting position of the current message somewhere
> > (->cursor?). When an error is thrown, we reset the position to be where
> > the in-progress message would have begun.
>
> Yeah, I thought about that, but then how you detect the case where two
> different people try to undertake message construction at the same
> time?

Set a boolean and assert out if one already is in progress? We'd need
some state to know where to reset the position to on error anyway.


> Like, with the idea I was proposing, you could still decide to limit
> yourself to 1 buffer at the same time, and just elog() if someone
> tries to allocate a second buffer when you've already reached the
> maximum number of allocated buffers (i.e. one). But if you just have
> one buffer in a global variable and everybody writes into it, you
> might not notice if some unrelated code writes data into that buffer
> in the middle of someone else's message construction. Doing it the way
> I proposed, writing data requires passing a buffer pointer, so you can
> be sure that somebody had to get the buffer from somewhere... and any
> rules you want to enforce can be enforced at that point.

I'd hope that we'd encapsulate the buffer management into file local
variables in pqcomm.c or such, and that code outside of that cannot
access the out buffer directly without using the appropriate helpers.


> > I've before wondered / suggested that we should have StringInfos not
> > insist on having one consecutive buffer (which obviously implies needing
> > to copy contents when growing). Instead it should have a list of buffers
> > containing chunks of the data, and never copy contents around while the
> > string is being built. We'd only allocate a buffer big enough for all
> > data when the caller actually wants to have all the resulting data in
> > one string (rather than using an API that can iterate over chunks).
>
> It's a thought. I doubt it's worth it for small amounts of data, but
> for large amounts it might be. On the other hand, a better idea still
> might be to size the buffer correctly from the start...

I think those are complimentary. I do agree that's it's useful to size
stringinfos more appropriately immediately (there's an upthread patch
adding a version of initStringInfo() that does so, quite useful for
small stringinfos in particular). But there's enough cases where that's
not really knowable ahead of time that I think it'd be quite useful to
have support for the type of buffer I describe above.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

Robert Haas
On Tue, Jun 9, 2020 at 3:23 PM Andres Freund <[hidden email]> wrote:
> ISTM that it'd be pretty broken if it could happen. We cannot have two
> different parts of the system send messages to the client
> independently. The protocol is pretty stateful...

There's a difference between building messages concurrently and
sending them concurrently.

> Set a boolean and assert out if one already is in progress? We'd need
> some state to know where to reset the position to on error anyway.

Sure, that's basically just different notation for the same thing. I
might prefer my notation over yours, but you might prefer the reverse.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

Robert Haas
In reply to this post by Andres Freund
On Tue, Jun 2, 2020 at 9:56 PM Andres Freund <[hidden email]> wrote:

> I don't know what the best non-gross solution for the overhead of the
> out/send functions is. There seems to be at least the following
> major options (and a lots of variants thereof):
>
> 1) Just continue to incur significant overhead for every datum
> 2) Accept the uglyness of passing in a buffer via
>    FunctionCallInfo->context. Change nearly all in-core send functions
>    over to that.
> 3) Pass string buffer through an new INTERNAL argument to send/output
>    function, allow both old/new style send functions. Use a wrapper
>    function to adapt the "old style" to the "new style".
> 4) Like 3, but make the new argument optional, and use ad-hoc
>    stringbuffer if not provided. I don't like the unnecessary branches
>    this adds.

I ran into this problem in another context today while poking at some
pg_basebackup stuff. There's another way of solving this problem which
I think we should consider: just get rid of the per-row stringinfo and
push the bytes directly from wherever they are into PqSendBuffer. Once
we start doing this, we can't error out, because internal_flush()
might've been called, sending a partial message to the client. Trying
to now switch to sending an ErrorResponse will break protocol sync.
But it seems possible to avoid that. Just call all of the output
functions first, and also do any required encoding conversions
(pq_sendcountedtext -> pg_server_to_client). Then, do a bunch of
pq_putbytes() calls to shove the message out -- there's the small
matter of an assertion failure, but whatever. This effectively
collapses two copies into one. Or alternatively, build up an array of
iovecs and then have a variant of pq_putmessage(), like
pq_putmessage_iovec(), that knows what to do with them.

One advantage of this approach is that it approximately doubles the
size of the DataRow message we can send. We're currently limited to
<1GB because of palloc, but the wire protocol just needs it to be <2GB
so that a signed integer does not overflow. It would be nice to buy
more than a factor of two here, but that would require a wire protocol
change, and 2x is not bad.

Another advantage of this approach is that it doesn't require passing
StringInfos all over the place. For the use case that I was looking
at, that appears awkward. I'm not saying I couldn't make it work, but
it wouldn't be my first choice. Right now, I've got data bubbling down
a chain of handlers until it eventually gets sent off to the client;
with your approach, I think I'd need to bubble buffers up and then
bubble data down, which seems quite a bit more complex.

A disadvantage of this approach is that we still end up doing three
copies: one from the datum to the per-datum StringInfo, a second into
PqSendBuffer, and a third from there to the kernel. However, we could
probably improve on this. Whenever we internal_flush(), consider
whether the chunk of data we're the process of copying (the current
call to pq_putbytes(), or the current iovec) has enough bytes
remaining to completely refill the buffer. If so, secure_write() a
buffer's worth of bytes (or more) directly, bypassing PqSendBuffer.
That way, we avoid individual system calls (or calls to OpenSSL or
GSS) for small numbers of bytes, but we also avoid extra copying when
transmitting larger amounts of data.

Even with that optimization, this still seems like it could end up
being less efficient than your proposal (surprise, surprise). If we've
got a preallocated buffer which we won't be forced to resize during
message construction -- and for DataRow messages we can get there just
by keeping the buffer around, so that we only need to reallocate when
we see a larger message than we've ever seen before -- and we write
all the data directly into that buffer and then send it from there
straight to the kernel, we only ever do 2 copies, whereas what I'm
proposing sometimes does 3 copies and sometimes only 2.

While I admit that's not great, it seems likely to still be a
significant win over what we have now, and it's a *lot* less invasive
than your proposal. Not only does your approach require changing all
of the type-output and type-sending functions inside and outside core
to use this new model, admittedly with the possibility of backward
compatibility, but it also means that we could need similarly invasive
changes in any other place that wants to use this new style of message
construction. You can't write any data anywhere that you might want to
later incorporate into a protocol message unless you write it into a
StringInfo; and not only that, but you have to be able to get the
right amount of data into the right place in the StringInfo right from
the start. I think that in some cases that will require fairly complex
orchestration.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

Robert Haas
In reply to this post by Andres Freund
Here is some review for the first few patches in this series.

I am generally in favor of applying 0001-0003 no matter what else we
decide to do here. However, as might be expected, I have a few
questions and comments.

Regarding 0001:

I dislike the name initStringInfoEx() because we do not use the "Ex"
convention for function names anywhere in the code base. We do
sometimes use "extended", so this could be initStringInfoExtended(),
perhaps, or something more specific, like initStringInfoWithLength().

Regarding the FIXME in that function, I suggest that this should be
the caller's job. Otherwise, there will probably be some caller which
doesn't want the add-one behavior, and then said caller will subtract
one to compensate, and that will be silly.

I am not familiar with pg_restrict and don't entirely understand the
motivation for it here. I suspect you have done some experiments and
figured out that it produces better code, but it would be interesting
to hear more about how you got there. Perhaps there could even be some
brief comments about it. Locutions like this are particularly
confusing to me:

+static inline void
+resetStringInfo(StringInfoData *pg_restrict str)
+{
+       *(char *pg_restrict) (str->data) = '\0';
+       str->len = 0;
+       str->cursor = 0;
+}

I don't understand how this can be saving anything. I think the
restrict definitions here mean that str->data does not overlap with
str itself, but considering that these are unconditional stores, so
what? If the code were doing something like memset(str->data, 0,
str->len) then I'd get it: it might be useful to know for optimization
purposes that the memset isn't overwriting str->len. But what code can
we produce for this that wouldn't be valid if str->data = &str? I
assume this is my lack of understanding rather than an actual problem
with the patch, but I would be grateful if you could explain.

It is easier to see why the pg_restrict stuff you've introduced into
appendBinaryStringInfoNT is potentially helpful: e.g. in
appendBinaryStringInfoNT, it promises that memcpy can't clobber
str->len, so the compiler is free to reorder without changing the
results. Or so I imagine. But then the one in appendBinaryStringInfo()
confuses me again: if str->data is already declared as a restricted
pointer, then why do we need to cast str->data + str->len to be
restricted also?

In appendStringInfoChar, why do we need to cast to restrict twice? Can
we not just do something like this:

char *pg_restrict ep = str->data+str->len;
ep[0] = ch;
ep[1] = '\0';

Regarding 0002:

Totally mechanical, seems fine.

Regarding 0003:

For the same reasons as above, I suggest renaming pq_begintypsend_ex()
to pq_begintypsend_extended() or pq_begintypsend_with_length() or
something of that sort, rather than using _ex.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

Andres Freund
Hi,

On 2020-07-31 11:14:46 -0400, Robert Haas wrote:
> Here is some review for the first few patches in this series.

Thanks!


> I am generally in favor of applying 0001-0003 no matter what else we
> decide to do here. However, as might be expected, I have a few
> questions and comments.
>
> Regarding 0001:
>
> I dislike the name initStringInfoEx() because we do not use the "Ex"
> convention for function names anywhere in the code base. We do
> sometimes use "extended", so this could be initStringInfoExtended(),
> perhaps, or something more specific, like initStringInfoWithLength().

I dislike the length of the function name, but ...


> Regarding the FIXME in that function, I suggest that this should be
> the caller's job. Otherwise, there will probably be some caller which
> doesn't want the add-one behavior, and then said caller will subtract
> one to compensate, and that will be silly.

Fair point.


> I am not familiar with pg_restrict and don't entirely understand the
> motivation for it here. I suspect you have done some experiments and
> figured out that it produces better code, but it would be interesting
> to hear more about how you got there. Perhaps there could even be some
> brief comments about it. Locutions like this are particularly
> confusing to me:
>
> +static inline void
> +resetStringInfo(StringInfoData *pg_restrict str)
> +{
> +       *(char *pg_restrict) (str->data) = '\0';
> +       str->len = 0;
> +       str->cursor = 0;
> +}

The restrict tells the compiler that 'str' and 'str->data' won't be
pointing to the same memory. Which can simpilify the code its
generating.  E.g. it'll allow the compiler to keep str->data in a
register, instead of reloading it for the next
appendStringInfo*. Without the restrict it can't, because str->data = 0
could otherwise overwrite parts of the value of ->data itself, if ->data
pointed into the StringInfo. Similarly, str->data could be overwritten
by str->len in some other cases.

Partially the reason we need to add the markers is that we compile with
-fno-strict-aliasing. But even if we weren't, this case wouldn't be
solved without an explicit marker even then, because char * is allowed
to alias...

Besides keeping ->data in a register, the restrict can also just
entirely elide the null byte write in some cases, e.g. because the
resetStringInfo() is followed by a appendStringInfoString(, "constant").


> I don't understand how this can be saving anything. I think the
> restrict definitions here mean that str->data does not overlap with
> str itself, but considering that these are unconditional stores, so
> what? If the code were doing something like memset(str->data, 0,
> str->len) then I'd get it: it might be useful to know for optimization
> purposes that the memset isn't overwriting str->len. But what code can
> we produce for this that wouldn't be valid if str->data = &str? I
> assume this is my lack of understanding rather than an actual problem
> with the patch, but I would be grateful if you could explain.

I hope the above makes this make sene now? It's about subsequent uses of
the StringInfo, rather than the body of resetStringInfo itself.


> It is easier to see why the pg_restrict stuff you've introduced into
> appendBinaryStringInfoNT is potentially helpful: e.g. in
> appendBinaryStringInfoNT, it promises that memcpy can't clobber
> str->len, so the compiler is free to reorder without changing the
> results. Or so I imagine. But then the one in appendBinaryStringInfo()
> confuses me again: if str->data is already declared as a restricted
> pointer, then why do we need to cast str->data + str->len to be
> restricted also?

But str->data isn't declared restricted without the explicit use of
restrict? str is restrict'ed, but it doesn't apply "recursively" to all
pointers contained therein.


> In appendStringInfoChar, why do we need to cast to restrict twice? Can
> we not just do something like this:
>
> char *pg_restrict ep = str->data+str->len;
> ep[0] = ch;
> ep[1] = '\0';

I don't think that'd tell the compiler that this couldn't overlap with
str itself? A single 'restrict' can never (?) help, you need *two*
things that are marked as not overlapping in any way.


Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Why is pq_begintypsend so slow?

Robert Haas
On Fri, Jul 31, 2020 at 11:50 AM Andres Freund <[hidden email]> wrote:
> I hope the above makes this make sene now? It's about subsequent uses of
> the StringInfo, rather than the body of resetStringInfo itself.

That does make sense, except that
https://en.cppreference.com/w/c/language/restrict says "During each
execution of a block in which a restricted pointer P is declared
(typically each execution of a function body in which P is a function
parameter), if some object that is accessible through P (directly or
indirectly) is modified, by any means, then all accesses to that
object (both reads and writes) in that block must occur through P
(directly or indirectly), otherwise the behavior is undefined." So my
interpretation of this was that it couldn't really affect what
happened outside of the function itself, even if the compiler chose to
perform inlining. But I think see what you're saying: the *inference*
is only valid with respect to restrict pointers in a particular
function, but what can be optimized as a result of that inference may
be something further afield, if inlining is performed. Perhaps we
could add a comment about this, e.g.

Marking these pointers with pg_restrict tells the compiler that str
and str->data can't overlap, which may allow the compiler to optimize
better when this code is inlined. For example, it may be possible to
keep str->data in a register across consecutive appendStringInfoString
operations.

Since pg_restrict is not widely used, I think it's worth adding this
kind of annotation, lest other hackers get confused. I'm probably not
the only one who isn't on top of this.

> > In appendStringInfoChar, why do we need to cast to restrict twice? Can
> > we not just do something like this:
> >
> > char *pg_restrict ep = str->data+str->len;
> > ep[0] = ch;
> > ep[1] = '\0';
>
> I don't think that'd tell the compiler that this couldn't overlap with
> str itself? A single 'restrict' can never (?) help, you need *two*
> things that are marked as not overlapping in any way.

But what's the difference between:

+       *(char *pg_restrict) (str->data + str->len) = ch;
+       str->len++;
+       *(char *pg_restrict) (str->data + str->len) = '\0';

And:

char *pg_restrict ep = str->data+str->len;
ep[0] = ch;
ep[1] = '\0';
++str->len;

Whether or not str itself is marked restricted is another question;
what I'm talking about is why we need to repeat (char *pg_restrict)
(str->data + str->len).

I don't have any further comment on the remainder of your reply.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


12