A few more opportunities to use TupleTableSlotOps fields

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

A few more opportunities to use TupleTableSlotOps fields

Antonin Houska-2
Perhaps it's just a matter of taste, but I think the TupleTableSlotOps
structure, once initialized, should be used wherever possible. At least for me
personally, when I read the code, the particular callback function name is a
bit disturbing wherever it's not necessary.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com


Use_TupleTableSlotOps_where_possible.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: A few more opportunities to use TupleTableSlotOps fields

Robert Haas
On Tue, May 21, 2019 at 8:02 AM Antonin Houska <[hidden email]> wrote:
> Perhaps it's just a matter of taste, but I think the TupleTableSlotOps
> structure, once initialized, should be used wherever possible. At least for me
> personally, when I read the code, the particular callback function name is a
> bit disturbing wherever it's not necessary.

But it's significantly more efficient.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: A few more opportunities to use TupleTableSlotOps fields

Antonin Houska-2
Robert Haas <[hidden email]> wrote:

> On Tue, May 21, 2019 at 8:02 AM Antonin Houska <[hidden email]> wrote:
> > Perhaps it's just a matter of taste, but I think the TupleTableSlotOps
> > structure, once initialized, should be used wherever possible. At least for me
> > personally, when I read the code, the particular callback function name is a
> > bit disturbing wherever it's not necessary.
>
> But it's significantly more efficient.

Do you refer to the fact that for example the address of

        tts_virtual_clear(dstslot);

is immediately available in the text section while in this case

        dstslot->tts_ops->clear(dstslot);

the CPU first needs to fetch the address of tts_ops and also that of the
->clear function?

I admit I didn't think about this problem. Nevertheless I imagine that due to
constness of the variables like TTSOpsVirtual (and due to several other const
declarations) the compiler might be able to compute the address of the
tts_ops->clear() expression. Thus the only extra work for the CPU would be to
fetch tts_ops from dstslot, but that should not be a big deal because other
fields of dstslot are accessed by the surrounding code (so all of them might
be available in the CPU cache).

I don't pretend to be an expert in this area though, it's possible that I
still miss something.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com


Reply | Threaded
Open this post in threaded view
|

Re: A few more opportunities to use TupleTableSlotOps fields

Robert Haas
On Tue, May 21, 2019 at 10:48 AM Antonin Houska <[hidden email]> wrote:

> Do you refer to the fact that for example the address of
>
>         tts_virtual_clear(dstslot);
>
> is immediately available in the text section while in this case
>
>         dstslot->tts_ops->clear(dstslot);
>
> the CPU first needs to fetch the address of tts_ops and also that of the
> ->clear function?

Yes.  And since tts_virtual_clear() is marked static, it seems like it
might even possible for it to inline that function at compile time.

> I admit I didn't think about this problem. Nevertheless I imagine that due to
> constness of the variables like TTSOpsVirtual (and due to several other const
> declarations) the compiler might be able to compute the address of the
> tts_ops->clear() expression. Thus the only extra work for the CPU would be to
> fetch tts_ops from dstslot, but that should not be a big deal because other
> fields of dstslot are accessed by the surrounding code (so all of them might
> be available in the CPU cache).

I think the issue is pipeline stalls.  If the compiler can see a
direct call coming up, it can start fetching the instructions from the
target address.  If it sees an indirect call, that's probably harder
to do.

But really, I'm not an expect on this area.  I would write the code as
Andres did on the general principle of making things as easy for the
compiler and CPU as possible, but I do not know how much it really
matters.  Andres probably does know...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: A few more opportunities to use TupleTableSlotOps fields

Andres Freund
Hi,

On 2019-05-21 12:16:22 -0400, Robert Haas wrote:

> On Tue, May 21, 2019 at 10:48 AM Antonin Houska <[hidden email]> wrote:
> > Do you refer to the fact that for example the address of
> >
> >         tts_virtual_clear(dstslot);
> >
> > is immediately available in the text section while in this case
> >
> >         dstslot->tts_ops->clear(dstslot);
> >
> > the CPU first needs to fetch the address of tts_ops and also that of the
> > ->clear function?
>
> Yes.  And since tts_virtual_clear() is marked static, it seems like it
> might even possible for it to inline that function at compile time.

I sure hope so. And I did verify that at some point. We, for example,
don't want changes to slot->tts_flags tts_virtual_clear does to be
actually written to memory, if it's called from a callsite that knows
it's a virtual slot.

I just checked, and for me tts_virtual_copyslot() inlines all of
tts_virtual_clear(), and eliminates most of its contents except for the
pfree() branch. All the rest of the state changes are overwritten by
tts_virtual_copyslot() anyway.


> > I admit I didn't think about this problem. Nevertheless I imagine that due to
> > constness of the variables like TTSOpsVirtual (and due to several other const
> > declarations) the compiler might be able to compute the address of the
> > tts_ops->clear() expression. Thus the only extra work for the CPU would be to
> > fetch tts_ops from dstslot, but that should not be a big deal because other
> > fields of dstslot are accessed by the surrounding code (so all of them might
> > be available in the CPU cache).
>
> I think the issue is pipeline stalls.  If the compiler can see a
> direct call coming up, it can start fetching the instructions from the
> target address.  If it sees an indirect call, that's probably harder
> to do.

Some CPUs can do so, but it'll often be more expensive / have a higher
chance of misspeculating (rather than the 0 chance in case of a straight
line code.


> But really, I'm not an expect on this area.  I would write the code as
> Andres did on the general principle of making things as easy for the
> compiler and CPU as possible, but I do not know how much it really
> matters.  Andres probably does know...

I think the inlining bit is more crucial, but that having as few
indirect calls as possible matters here. It wasn't that easy to get the
slot virtualization to not regress performance meaningfully.

If anything, I really want to go the *opposite* direction, i.e. remove
*more* indirect calls from within execTuples.c, and get the compiler to
realize at callsites external to execTuples.c that it can cache tts_ops
in more places.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: A few more opportunities to use TupleTableSlotOps fields

Andres Freund
In reply to this post by Antonin Houska-2
Hi,

On 2019-05-21 16:47:50 +0200, Antonin Houska wrote:
> I admit I didn't think about this problem. Nevertheless I imagine that due to
> constness of the variables like TTSOpsVirtual (and due to several other const
> declarations) the compiler might be able to compute the address of the
> tts_ops->clear() expression.

It really can't, without actually fetching tts_ops, and reading the
callback's location. How would e.g. tts_virtual_copyslot() know that the
slot's tts_ops point to TTSOpsVirtual? There's simply no way to express
that in C.  If this were a class in C++, the compiler would have decent
chance at it these days (because if it's a final method it can infer
that it has to be, and because whole program optimization allows
devirtualization passes to do so), but well, it's not.

And then there's the whole inlining issue explained in my other recent
mail on the topic.

Greetings,

Andres Freund