why doesn't DestroyPartitionDirectory hash_destroy?

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

why doesn't DestroyPartitionDirectory hash_destroy?

Amit Langote-2
Hi,

I'm curious why DestroyPartitionDirectory doesn't do
hash_destroy(pdir->pdir_hash)?

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: why doesn't DestroyPartitionDirectory hash_destroy?

Kyotaro HORIGUCHI-2
At Thu, 14 Mar 2019 16:13:23 +0900, Amit Langote <[hidden email]> wrote in <[hidden email]>
> Hi,
>
> I'm curious why DestroyPartitionDirectory doesn't do
> hash_destroy(pdir->pdir_hash)?

Maybe it is trashed involved in destruction of es_query_cxt or
planner_cxt?

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: why doesn't DestroyPartitionDirectory hash_destroy?

Amit Langote-2
On 2019/03/14 16:32, Kyotaro HORIGUCHI wrote:
> At Thu, 14 Mar 2019 16:13:23 +0900, Amit Langote <[hidden email]> wrote in <[hidden email]>
>> Hi,
>>
>> I'm curious why DestroyPartitionDirectory doesn't do
>> hash_destroy(pdir->pdir_hash)?
>
> Maybe it is trashed involved in destruction of es_query_cxt or
> planner_cxt?

Hmm, the executor's partition directory (its hash table) is indeed
allocated under es_query_cxt.  But, the planner's partition directory is
not allocated under planner_cxt, it appears to be using memory under
MessageContext.

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: why doesn't DestroyPartitionDirectory hash_destroy?

Amit Langote-2
On 2019/03/14 16:46, Amit Langote wrote:

> On 2019/03/14 16:32, Kyotaro HORIGUCHI wrote:
>> At Thu, 14 Mar 2019 16:13:23 +0900, Amit Langote <[hidden email]> wrote in <[hidden email]>
>>> Hi,
>>>
>>> I'm curious why DestroyPartitionDirectory doesn't do
>>> hash_destroy(pdir->pdir_hash)?
>>
>> Maybe it is trashed involved in destruction of es_query_cxt or
>> planner_cxt?
>
> Hmm, the executor's partition directory (its hash table) is indeed
> allocated under es_query_cxt.  But, the planner's partition directory is
> not allocated under planner_cxt, it appears to be using memory under
> MessageContext.

I forgot to mention that it would be wrong to put it under planner_cxt, as
it's the context for planning a given subquery, whereas a partition
directory is maintained throughout the whole planning.

Thanks,
Amit



Reply | Threaded
Open this post in threaded view
|

Re: why doesn't DestroyPartitionDirectory hash_destroy?

Kyotaro HORIGUCHI-2
At Thu, 14 Mar 2019 17:18:29 +0900, Amit Langote <[hidden email]> wrote in <[hidden email]>

> On 2019/03/14 16:46, Amit Langote wrote:
> > On 2019/03/14 16:32, Kyotaro HORIGUCHI wrote:
> >> At Thu, 14 Mar 2019 16:13:23 +0900, Amit Langote <[hidden email]> wrote in <[hidden email]>
> >>> Hi,
> >>>
> >>> I'm curious why DestroyPartitionDirectory doesn't do
> >>> hash_destroy(pdir->pdir_hash)?
> >>
> >> Maybe it is trashed involved in destruction of es_query_cxt or
> >> planner_cxt?
> >
> > Hmm, the executor's partition directory (its hash table) is indeed
> > allocated under es_query_cxt.  But, the planner's partition directory is
> > not allocated under planner_cxt, it appears to be using memory under
> > MessageContext.

CurrentMemoryContext? It is PortalContext while planning CLUSTER
scan. And it seems to be the same with planner_cxt with several
narrow exceptions..

I think everything linked from PlannrInfo ought to be allocated
in the top planner's planner_cxt if finally not expilcitly
delinked and I *believe* subquery's planner_cxt is always same
with that of the top-level subquery_planner.

> I forgot to mention that it would be wrong to put it under planner_cxt, as
> it's the context for planning a given subquery, whereas a partition
> directory is maintained throughout the whole planning.

So if the ParitionDirectory is allocaed explicitly in
MessageContext, it would be danger on CLUSTER command. But as I
see in the code, if it is in CurrentMemoryContext it would be
safe but I think it should be fixed to use planner_cxt.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: why doesn't DestroyPartitionDirectory hash_destroy?

Robert Haas
In reply to this post by Amit Langote-2
On Thu, Mar 14, 2019 at 3:13 AM Amit Langote
<[hidden email]> wrote:
> I'm curious why DestroyPartitionDirectory doesn't do
> hash_destroy(pdir->pdir_hash)?

What would be the point?  It's more efficient to let context teardown
take care of it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: why doesn't DestroyPartitionDirectory hash_destroy?

Tom Lane-2
Robert Haas <[hidden email]> writes:
> On Thu, Mar 14, 2019 at 3:13 AM Amit Langote
> <[hidden email]> wrote:
>> I'm curious why DestroyPartitionDirectory doesn't do
>> hash_destroy(pdir->pdir_hash)?

> What would be the point?  It's more efficient to let context teardown
> take care of it.

Agreed, but the comments in this area are crap.  Why doesn't
CreatePartitionDirectory say something like

 * The object lives inside the given memory context and will be
 * freed when that context is destroyed.  Nonetheless, the caller
 * must *also* ensure that (unless the transaction is aborted)
 * DestroyPartitionDirectory is called before that happens, else
 * we may leak some relcache reference counts.

It's completely not acceptable that every reader of this code should
have to reverse-engineer these design assumptions, especially given
how shaky they are.

There's an independent question as to whether the planner's use of
the feature is specifying a safe memory context.  Has this code been
exercised under GEQO?

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: why doesn't DestroyPartitionDirectory hash_destroy?

Tom Lane-2
I wrote:
> Agreed, but the comments in this area are crap.

Actually, now that I've absorbed a bit more about 898e5e329,
I don't like very much about it at all.  I think having it
try to hang onto pointers into the relcache is a completely
wrongheaded design decision, and the right way for it to work
is to just copy the PartitionDescs out of the relcache so that
they're fully owned by the PartitionDirectory.  I don't see
a CopyPartitionDesc function anywhere (maybe it's named something
else?) but it doesn't look like it'd be hard to build; most
of the work is in partition_bounds_copy() which does exist already.

Also, at least so far as the planner's usage is concerned, claiming
that we're saving something by not copying is completely bogus,
because if we look into set_relation_partition_info, what do we
find but a partition_bounds_copy call.  That wouldn't be necessary
if we could rely on the PartitionDirectory to own the data structure.
(Maybe it's not necessary today.  But given what a house of cards
this is, I wouldn't propose ripping it out, just moving it into
the PartitionDirectory code.)

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: why doesn't DestroyPartitionDirectory hash_destroy?

Robert Haas
In reply to this post by Tom Lane-2
On Thu, Mar 14, 2019 at 12:56 PM Tom Lane <[hidden email]> wrote:

> Agreed, but the comments in this area are crap.  Why doesn't
> CreatePartitionDirectory say something like
>
>  * The object lives inside the given memory context and will be
>  * freed when that context is destroyed.  Nonetheless, the caller
>  * must *also* ensure that (unless the transaction is aborted)
>  * DestroyPartitionDirectory is called before that happens, else
>  * we may leak some relcache reference counts.
>
> It's completely not acceptable that every reader of this code should
> have to reverse-engineer these design assumptions, especially given
> how shaky they are.

Well, one reason is that everything you just said is basically
self-evident.  If you spend 5 seconds looking at the header file,
you'll see that there is a CreatePartitionDirectory() function and a
DestroyPartitionDirectory() function, and so you'll probably figure
out that the latter is intended to be called rather than just ignored.
You will probably also guess that it doesn't need to be called if
there's an ERROR, just like basically everything else in PostgreSQL.
And if you want to know why, you can look at the code and you
shouldn't have any trouble determining that it releases relcache ref
counts, which may tip you off that if you don't call it, some relcache
refcounts will not be released.

But, look, I wrote the code.  What's clear to me may not be clear to
everybody.  I have to admit I'm kinda surprised that this is the thing
that is confusing to anybody, but if it is, then sure, let's add the
comment!

> There's an independent question as to whether the planner's use of
> the feature is specifying a safe memory context.  Has this code been
> exercised under GEQO?

Probably not.  That's a good idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: why doesn't DestroyPartitionDirectory hash_destroy?

Tom Lane-2
Robert Haas <[hidden email]> writes:
> On Thu, Mar 14, 2019 at 12:56 PM Tom Lane <[hidden email]> wrote:
>> It's completely not acceptable that every reader of this code should
>> have to reverse-engineer these design assumptions, especially given
>> how shaky they are.

> Well, one reason is that everything you just said is basically
> self-evident.  If you spend 5 seconds looking at the header file,
> you'll see that there is a CreatePartitionDirectory() function and a
> DestroyPartitionDirectory() function, and so you'll probably figure
> out that the latter is intended to be called rather than just ignored.
> You will probably also guess that it doesn't need to be called if
> there's an ERROR, just like basically everything else in PostgreSQL.
> And if you want to know why, you can look at the code and you
> shouldn't have any trouble determining that it releases relcache ref
> counts, which may tip you off that if you don't call it, some relcache
> refcounts will not be released.

So here's my problem with that argument: you're effectively saying that
you needn't write any API spec for the PartitionDirectory functions
because you intend that every person calling them will read their code,
carefully and fully, before using them.  This is not my idea of sound
software engineering.  If you need me to spell out why not, I will do
so, but I'd like to think that I needn't explain abstraction to a senior
committer.

But anyway, it's somewhat moot because I think we need to change this
API spec anyhow, per the other thread.  PartitionDirectory should not
be holding on to relcache refcounts, which would make
DestroyPartitionDirectory unnecessary.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: why doesn't DestroyPartitionDirectory hash_destroy?

Robert Haas
In reply to this post by Tom Lane-2
On Thu, Mar 14, 2019 at 1:16 PM Tom Lane <[hidden email]> wrote:
> Actually, now that I've absorbed a bit more about 898e5e329,
> I don't like very much about it at all.  I think having it
> try to hang onto pointers into the relcache is a completely
> wrongheaded design decision, and the right way for it to work
> is to just copy the PartitionDescs out of the relcache so that
> they're fully owned by the PartitionDirectory.  I don't see
> a CopyPartitionDesc function anywhere (maybe it's named something
> else?) but it doesn't look like it'd be hard to build; most
> of the work is in partition_bounds_copy() which does exist already.

Yeah, we could do that.  I have to admit that I don't necessarily
understand why trying to hang onto pointers into the relcache is a bad
idea.  It is a bit complicated, but the savings in both memory and CPU
time seem worth pursuing.  There are a lot of users who wish we scaled
to a million partitions rather than a hundred, and just copying
everything all over the place all the time won't get us closer to that
goal.

More generally, I think get_relation_info() is becoming an
increasingly nasty piece of work.  It copies more and more stuff
instead of just pointing to it, which is necessary mostly because it
closes the table instead of arranging to do that at the end of query
planning.  If it did the opposite, the refcount held by the planner
would make it unnecessary for the PartitionDirectory to hold one, and
I bet we could also just point to a bunch of the other stuff in this
function rather than copying that stuff, too.  As time goes by,
relcache entries are getting more and more complex, and the optimizer
wants to use more and more data from them for planning purposes, but,
probably partly because of inertia, we're clinging to an old design
where everything has to be copied.  Every time someone gets that
wrong, and it's happened a number of times, we yell at them and tell
them to copy more stuff instead of thinking up a design where stuff
doesn't need to be copied.  I think that's a mistake.  You have
previously disagreed with that position so you probably will now, too,
but I still think it.

> Also, at least so far as the planner's usage is concerned, claiming
> that we're saving something by not copying is completely bogus,
> because if we look into set_relation_partition_info, what do we
> find but a partition_bounds_copy call.  That wouldn't be necessary
> if we could rely on the PartitionDirectory to own the data structure.
> (Maybe it's not necessary today.  But given what a house of cards
> this is, I wouldn't propose ripping it out, just moving it into
> the PartitionDirectory code.)

Ugh, I didn't notice the partition_bounds_copy() call in that
function.  Oops.  Given the foregoing griping, you won't be surprised
to hear that I'd rather just remove the copying step.  However, it
sounds like we can do that no matter whether we stick with my design
or switch to yours, because PartitionDirectory holds a relcache refcnt
then the pointer will be stable, and if we deep-copy the whole data
structure then the pointer will also be stable.  Prior to the commit
at issue, we weren't doing either of those things, so that copy was
needed until very recently.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: why doesn't DestroyPartitionDirectory hash_destroy?

Robert Haas
In reply to this post by Tom Lane-2
On Thu, Mar 14, 2019 at 1:40 PM Tom Lane <[hidden email]> wrote:
> So here's my problem with that argument: you're effectively saying that
> you needn't write any API spec for the PartitionDirectory functions
> because you intend that every person calling them will read their code,
> carefully and fully, before using them.  This is not my idea of sound
> software engineering.  If you need me to spell out why not, I will do
> so, but I'd like to think that I needn't explain abstraction to a senior
> committer.

I think you're attacking a straw man.  I expect you don't seriously
believe that I lack an understanding of abstraction.  However,
abstraction doesn't mean that the comment for CreatePartitionDirectory
must describe what DestroyPartitionDirectory is going to do
internally, as you seem to be proposing.  Had I thought about this
issue more sooner, I think I would have guessed you would be opposed
to such a comment, since the chances of someone neglecting to update
it when changing DestroyPartitionDirectory seem to be non-negligible.
At the same time, and as I already said, I am also fine to improve the
comments for these functions.  The fact that both you and Amit found
them inadequate - albeit in somewhat different ways - shows that they
need improvement.  However, that doesn't mean that what I did was
flagrantly unreasonable or that I'm full of crap.  Those things may be
true, but this isn't believable evidence of it.

If we're going to start talking about comments that make inadequate
mention of important memory management details, I think a much better
place to start than anything that I did in this commit would be the
get_relation_info() function we were just discussing in a different
part of this email thread.  As I said over there, people keep failing
to understand that any data you want to use during query planning has
got to be copied out of the relcache in that function -- and this is a
pretty subtle hazard, actually, because it works fine unless you get a
cache flush at the wrong time or build with CLOBBER_CACHE_ALWAYS.
Failing to call DestroyPartitionDirectory() breaks in a much more
obvious way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: why doesn't DestroyPartitionDirectory hash_destroy?

Amit Langote-2
In reply to this post by Robert Haas
On 2019/03/15 1:02, Robert Haas wrote:
> On Thu, Mar 14, 2019 at 3:13 AM Amit Langote
> <[hidden email]> wrote:
>> I'm curious why DestroyPartitionDirectory doesn't do
>> hash_destroy(pdir->pdir_hash)?
>
> What would be the point?  It's more efficient to let context teardown
> take care of it.

Yeah, I only noticed that after posting my email.

As I said in another reply, while the executor's partition directory is
set up and torn down under a dedicated memory context used for execution
(es_query_context), planner's is stuck into MessageContext.  But all of
the other stuff that planner allocates goes into it too, so maybe it's fine.

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: why doesn't DestroyPartitionDirectory hash_destroy?

Amit Langote-2
In reply to this post by Robert Haas
On 2019/03/15 2:56, Robert Haas wrote:

> On Thu, Mar 14, 2019 at 1:16 PM Tom Lane <[hidden email]> wrote:
>> Actually, now that I've absorbed a bit more about 898e5e329,
>> I don't like very much about it at all.  I think having it
>> try to hang onto pointers into the relcache is a completely
>> wrongheaded design decision, and the right way for it to work
>> is to just copy the PartitionDescs out of the relcache so that
>> they're fully owned by the PartitionDirectory.  I don't see
>> a CopyPartitionDesc function anywhere (maybe it's named something
>> else?) but it doesn't look like it'd be hard to build; most
>> of the work is in partition_bounds_copy() which does exist already.
>
> Yeah, we could do that.  I have to admit that I don't necessarily
> understand why trying to hang onto pointers into the relcache is a bad
> idea.  It is a bit complicated, but the savings in both memory and CPU
> time seem worth pursuing.  There are a lot of users who wish we scaled
> to a million partitions rather than a hundred, and just copying
> everything all over the place all the time won't get us closer to that
> goal.
>
> More generally, I think get_relation_info() is becoming an
> increasingly nasty piece of work.  It copies more and more stuff
> instead of just pointing to it, which is necessary mostly because it
> closes the table instead of arranging to do that at the end of query
> planning.  If it did the opposite, the refcount held by the planner
> would make it unnecessary for the PartitionDirectory to hold one, and
> I bet we could also just point to a bunch of the other stuff in this
> function rather than copying that stuff, too.  As time goes by,
> relcache entries are getting more and more complex, and the optimizer
> wants to use more and more data from them for planning purposes, but,
> probably partly because of inertia, we're clinging to an old design
> where everything has to be copied.  Every time someone gets that
> wrong, and it's happened a number of times, we yell at them and tell
> them to copy more stuff instead of thinking up a design where stuff
> doesn't need to be copied.  I think that's a mistake.  You have
> previously disagreed with that position so you probably will now, too,
> but I still think it.

+1.

>> Also, at least so far as the planner's usage is concerned, claiming
>> that we're saving something by not copying is completely bogus,
>> because if we look into set_relation_partition_info, what do we
>> find but a partition_bounds_copy call.  That wouldn't be necessary
>> if we could rely on the PartitionDirectory to own the data structure.
>> (Maybe it's not necessary today.  But given what a house of cards
>> this is, I wouldn't propose ripping it out, just moving it into
>> the PartitionDirectory code.)
>
> Ugh, I didn't notice the partition_bounds_copy() call in that
> function.  Oops.  Given the foregoing griping, you won't be surprised
> to hear that I'd rather just remove the copying step.  However, it
> sounds like we can do that no matter whether we stick with my design
> or switch to yours, because PartitionDirectory holds a relcache refcnt
> then the pointer will be stable, and if we deep-copy the whole data
> structure then the pointer will also be stable.  Prior to the commit
> at issue, we weren't doing either of those things, so that copy was
> needed until very recently.

One of the patches I've proposed in the "speed up planning with
partitions" thread [1] gets rid of the partition_bounds_copy() call,
because: 1) I too think it's unnecessary as the PartitionBoundInfo won't
change (logically or physically) as long as we've got some lock on the
table, which we do, 2) I've seen it become a significant bottleneck as the
number of partitions crosses thousands.

Thanks,
Amit

[1] https://commitfest.postgresql.org/22/1778/