segmentation fault using currtid and partitioned tables

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

segmentation fault using currtid and partitioned tables

Jaime Casanova-4
Hi,

Another one caught by sqlsmith, on the regression database run this
query (using any non-partitioned table works fine):

"""
select currtid('pagg_tab'::regclass::oid, '(0,156)'::tid) >= '(1,158)'::tid;
"""

This works on 11 (well it gives an error because the file doesn't
exists) but crash the server on 12+

attached the stack trace from master

--
Jaime Casanova                      www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

gdb.txt (13K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: segmentation fault using currtid and partitioned tables

Tom Lane-2
Jaime Casanova <[hidden email]> writes:
> Another one caught by sqlsmith, on the regression database run this
> query (using any non-partitioned table works fine):
> select currtid('pagg_tab'::regclass::oid, '(0,156)'::tid) >= '(1,158)'::tid;

Hm, so

(1) currtid_byreloid and currtid_byrelname lack any check to see
if they're dealing with a relkind that lacks storage.

(2) The proximate cause of the crash is that rd_tableam is zero,
so that the interface functions in tableam.h just crash hard.
This seems like a pretty bad thing; does anyone want to promise
that there are no other oversights of the same ilk elsewhere,
and never will be?

I think it might be a good idea to make relations-without-storage
set up rd_tableam as a vector of dummy functions that will throw
some suitable complaint about "relation lacks storage".  NULL is
a horrible default for this.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: segmentation fault using currtid and partitioned tables

Michael Paquier-2
On Sun, Apr 05, 2020 at 12:51:56PM -0400, Tom Lane wrote:
> I think it might be a good idea to make relations-without-storage
> set up rd_tableam as a vector of dummy functions that will throw
> some suitable complaint about "relation lacks storage".  NULL is
> a horrible default for this.

Yeah, that's not good, but I am not really comfortable with the
concept of implying that (pg_class.relam == InvalidOid) maps to a
dummy AM callback set instead of NULL for rd_tableam.  That feels less
natural.  As mentioned upthread, the error that we get in ~11 is
confusing as well when using a relation that has no storage:
ERROR:  58P01: could not open file "base/16384/16385": No such file or directory

I have been looking at the tree and the use of the table AM APIs, and
those TID lookups are really a particular case compared to the other
callers of the table AM callbacks.  Anyway, I have not spotted similar
problems, so I find very tempting the option to just add some
RELKIND_HAS_STORAGE() to tid.c where it matters and call it a day.

Andres, what do you think?
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: segmentation fault using currtid and partitioned tables

Michael Paquier-2
On Wed, Apr 08, 2020 at 04:13:31PM +0900, Michael Paquier wrote:
> I have been looking at the tree and the use of the table AM APIs, and
> those TID lookups are really a particular case compared to the other
> callers of the table AM callbacks.  Anyway, I have not spotted similar
> problems, so I find very tempting the option to just add some
> RELKIND_HAS_STORAGE() to tid.c where it matters and call it a day.

Playing more with this stuff, it happens that we have zero code
coverage for currtid() and currtid2(), and the main user of those
functions I can find around is the ODBC driver:
https://coverage.postgresql.org/src/backend/utils/adt/tid.c.gcov.html

There are multiple cases to consider, particularly for views:
- Case of a view with ctid as attribute taken from table.
- Case of a view with ctid as attribute with incorrect attribute
type.
It is worth noting that all those code paths can trigger various
elog() errors, which is not something that a user should be able to do
using a SQL-callable function.  There are also two code paths for
cases where a view has no or more-than-one SELECT rules, which cannot
normally be reached.

All in that, I propose something like the attached to patch the
surroundings with tests to cover everything I could think of, which I
guess had better be backpatched?  While on it, I have noticed that we
lack coverage for max(tid) and min(tid), so I have included a bonus
test.

Another issue is that we don't have any documentation for those
functions, in which case the best fit is a subsection for TID
operators under "Functions and Operators"?

For now, I am adding a patch to next CF so as we don't forget about
this set of issues.  Any thoughts?
--
Michael

tid-funcs-fixes-v1.patch (12K) Download Attachment
signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: segmentation fault using currtid and partitioned tables

Alvaro Herrera-9
On 2020-Apr-09, Michael Paquier wrote:

> Playing more with this stuff, it happens that we have zero code
> coverage for currtid() and currtid2(), and the main user of those
> functions I can find around is the ODBC driver:
> https://coverage.postgresql.org/src/backend/utils/adt/tid.c.gcov.html

Yeah, they're there solely for ODBC as far as I know.

> There are multiple cases to consider, particularly for views:
> - Case of a view with ctid as attribute taken from table.
> - Case of a view with ctid as attribute with incorrect attribute
> type.
> It is worth noting that all those code paths can trigger various
> elog() errors, which is not something that a user should be able to do
> using a SQL-callable function.  There are also two code paths for
> cases where a view has no or more-than-one SELECT rules, which cannot
> normally be reached.

> All in that, I propose something like the attached to patch the
> surroundings with tests to cover everything I could think of, which I
> guess had better be backpatched?

I don't know, but this stuff is so unused that your patch seems
excessive ... and I think we'd rather not backpatch something so large.
I propose we do something less invasive in the backbranches, like just
throw elog() errors (nothing fancy) where necessary to avoid the
crashes.  Even for pg12 it seems that that should be sufficient.

For pg13 and beyond, I liked Tom's idea of installing dummy functions
for tables without storage -- that seems safer.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: segmentation fault using currtid and partitioned tables

Michael Paquier-2
On Fri, May 22, 2020 at 07:32:57PM -0400, Alvaro Herrera wrote:
> I don't know, but this stuff is so unused that your patch seems
> excessive ... and I think we'd rather not backpatch something so large.
> I propose we do something less invasive in the backbranches, like just
> throw elog() errors (nothing fancy) where necessary to avoid the
> crashes.  Even for pg12 it seems that that should be sufficient.

Even knowing that those trigger a bunch of elog()s which are not
something that should be user-triggerable?  :)

Perhaps you are right though, and that we don't need to spend this
much energy into improving the error messages so I am fine to discard
this part.  At the end, in order to remove the crashes, you just need
to keep around the two RELKIND_HAS_STORAGE() checks.  But I would
rather keep these two to use ereport(ERRCODE_FEATURE_NOT_SUPPORTED)
instead of elog(), and keep the test coverage of the previous patch
(including the tests for the aggregates I noticed were missing).
Would you be fine with that?

> For pg13 and beyond, I liked Tom's idea of installing dummy functions
> for tables without storage -- that seems safer.

Not sure about that for v13.  That would be invasive post-beta.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: segmentation fault using currtid and partitioned tables

Michael Paquier-2
On Mon, May 25, 2020 at 06:29:10PM +0900, Michael Paquier wrote:
> Perhaps you are right though, and that we don't need to spend this
> much energy into improving the error messages so I am fine to discard
> this part.  At the end, in order to remove the crashes, you just need
> to keep around the two RELKIND_HAS_STORAGE() checks.  But I would
> rather keep these two to use ereport(ERRCODE_FEATURE_NOT_SUPPORTED)
> instead of elog(), and keep the test coverage of the previous patch
> (including the tests for the aggregates I noticed were missing).
> Would you be fine with that?

And this means the attached.  Thoughts are welcome.
--
Michael

tid-funcs-fixes-v2.patch (9K) Download Attachment
signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: segmentation fault using currtid and partitioned tables

Jaime Casanova-4
On Mon, 25 May 2020 at 22:01, Michael Paquier <[hidden email]> wrote:

>
> On Mon, May 25, 2020 at 06:29:10PM +0900, Michael Paquier wrote:
> > Perhaps you are right though, and that we don't need to spend this
> > much energy into improving the error messages so I am fine to discard
> > this part.  At the end, in order to remove the crashes, you just need
> > to keep around the two RELKIND_HAS_STORAGE() checks.  But I would
> > rather keep these two to use ereport(ERRCODE_FEATURE_NOT_SUPPORTED)
> > instead of elog(), and keep the test coverage of the previous patch
> > (including the tests for the aggregates I noticed were missing).
> > Would you be fine with that?
>
> And this means the attached.  Thoughts are welcome.

so, currently the patch just installs protections on both currtid_*
functions and adds some tests... therefore we can consider it as a bug
fix and let it go in 13? actually also backpatch in 12...

patch works, server doesn't crash anymore

only point to mention is a typo (a missing "l") in an added comment:

+ *     currtid_byrename

--
Jaime Casanova                      www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: segmentation fault using currtid and partitioned tables

Michael Paquier-2
On Wed, May 27, 2020 at 12:29:39AM -0500, Jaime Casanova wrote:
> so, currently the patch just installs protections on both currtid_*
> functions and adds some tests... therefore we can consider it as a bug
> fix and let it go in 13? actually also backpatch in 12...

Yes, and it has the advantage to be simple.

> only point to mention is a typo (a missing "l") in an added comment:
>
> + *     currtid_byrename

Oops, thanks.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: segmentation fault using currtid and partitioned tables

Alvaro Herrera-9
In reply to this post by Michael Paquier-2
On 2020-May-26, Michael Paquier wrote:

> On Mon, May 25, 2020 at 06:29:10PM +0900, Michael Paquier wrote:
> > Perhaps you are right though, and that we don't need to spend this
> > much energy into improving the error messages so I am fine to discard
> > this part.  At the end, in order to remove the crashes, you just need
> > to keep around the two RELKIND_HAS_STORAGE() checks.  But I would
> > rather keep these two to use ereport(ERRCODE_FEATURE_NOT_SUPPORTED)
> > instead of elog(), and keep the test coverage of the previous patch
> > (including the tests for the aggregates I noticed were missing).
> > Would you be fine with that?
>
> And this means the attached.  Thoughts are welcome.

Yeah, this looks good to me.  I would have used elog() instead, but
I don't care enough ... as a translator, I can come up with a message as
undecipherable as the original without worrying too much, since I
suspect nobody will ever see it in practice.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: segmentation fault using currtid and partitioned tables

Andres Freund
In reply to this post by Alvaro Herrera-9
Hi,

On 2020-05-22 19:32:57 -0400, Alvaro Herrera wrote:
> On 2020-Apr-09, Michael Paquier wrote:
>
> > Playing more with this stuff, it happens that we have zero code
> > coverage for currtid() and currtid2(), and the main user of those
> > functions I can find around is the ODBC driver:
> > https://coverage.postgresql.org/src/backend/utils/adt/tid.c.gcov.html
>
> Yeah, they're there solely for ODBC as far as I know.

And there only for very old servers (< 8.2), according to Hiroshi
Inoue. Found that out post 12 freeze. I was planning to drop them for
13, but I unfortunately didn't get around to do so :(

I guess we could decide to make a freeze exception to remove them now,
although I'm not sure the reasons for doing so are strong enough.


> > There are multiple cases to consider, particularly for views:
> > - Case of a view with ctid as attribute taken from table.
> > - Case of a view with ctid as attribute with incorrect attribute
> > type.
> > It is worth noting that all those code paths can trigger various
> > elog() errors, which is not something that a user should be able to do
> > using a SQL-callable function.  There are also two code paths for
> > cases where a view has no or more-than-one SELECT rules, which cannot
> > normally be reached.
>
> > All in that, I propose something like the attached to patch the
> > surroundings with tests to cover everything I could think of, which I
> > guess had better be backpatched?
>
> I don't know, but this stuff is so unused that your patch seems
> excessive ... and I think we'd rather not backpatch something so large.
> I propose we do something less invasive in the backbranches, like just
> throw elog() errors (nothing fancy) where necessary to avoid the
> crashes.  Even for pg12 it seems that that should be sufficient.
>
> For pg13 and beyond, I liked Tom's idea of installing dummy functions

I concur that it seems unnecessary to make these translatable, even with
the reduced scope from
https://www.postgresql.org/message-id/20200526025959.GE6155%40paquier.xyz

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: segmentation fault using currtid and partitioned tables

Andres Freund
In reply to this post by Tom Lane-2
Hi,

On 2020-04-05 12:51:56 -0400, Tom Lane wrote:

> (2) The proximate cause of the crash is that rd_tableam is zero,
> so that the interface functions in tableam.h just crash hard.
> This seems like a pretty bad thing; does anyone want to promise
> that there are no other oversights of the same ilk elsewhere,
> and never will be?
>
> I think it might be a good idea to make relations-without-storage
> set up rd_tableam as a vector of dummy functions that will throw
> some suitable complaint about "relation lacks storage".  NULL is
> a horrible default for this.

I don't have particularly strong views here. I can see a benefit to such
a pseudo AM. I can even imagine that there might some cases where we
would actually introduce some tableam functions for e.g. partitioned or
views tables, to centralize their handling more, instead of having such
considerations more distributed.  Clearly not worth actively trying to
do that for all existing code dealing with such relkinds, but there
might be cases where it's worthwhile.

OTOH, it's kinda annoying having to maintain a not insignificant number
of functions that needs to be updated whenever the tableam interface
evolves.  I guess we could partially hack our way through that by having
one such function, and just assigning it to all the mandatory callbacks
by way of a void cast. But that'd be mighty ugly.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: segmentation fault using currtid and partitioned tables

Tom Lane-2
Andres Freund <[hidden email]> writes:
> On 2020-04-05 12:51:56 -0400, Tom Lane wrote:
>> I think it might be a good idea to make relations-without-storage
>> set up rd_tableam as a vector of dummy functions that will throw
>> some suitable complaint about "relation lacks storage".  NULL is
>> a horrible default for this.

> OTOH, it's kinda annoying having to maintain a not insignificant number
> of functions that needs to be updated whenever the tableam interface
> evolves.

That argument sounds pretty weak.  If you're making breaking changes
in the tableam API, updating the signatures (not even any code) of
some dummy functions seems like by far the easiest part.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: segmentation fault using currtid and partitioned tables

Michael Paquier-2
In reply to this post by Andres Freund
On Thu, May 28, 2020 at 05:55:59PM -0700, Andres Freund wrote:
> And there only for very old servers (< 8.2), according to Hiroshi
> Inoue. Found that out post 12 freeze. I was planning to drop them for
> 13, but I unfortunately didn't get around to do so :(

[... digging ...]
Ah, I think I see your point from the code.  That's related to the use
of RETURNING for ctids.

> I guess we could decide to make a freeze exception to remove them now,
> although I'm not sure the reasons for doing so are strong enough.

Not sure that's a good thing to do after beta1 for 13, but there is an
argument for that in 14.  FWIW, my company is a huge user of the ODBC
driver (perhaps the biggest one?), and we have nothing even close to
8.2.

> I concur that it seems unnecessary to make these translatable, even with
> the reduced scope from
> https://www.postgresql.org/message-id/20200526025959.GE6155%40paquier.xyz

Okay, I have switched the patch to do that.  Any comments or
objections?
--
Michael

tid-funcs-fixes-v3.patch (9K) Download Attachment
signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: segmentation fault using currtid and partitioned tables

Michael Paquier-2
On Fri, May 29, 2020 at 03:48:40PM +0900, Michael Paquier wrote:
> Okay, I have switched the patch to do that.  Any comments or
> objections?

Applied this one then.  I also got to check the ODBC driver in more
details, and I am indeed not seeing those functions getting used.
One extra thing to know is that the ODBC driver requires libpq from at
least 9.2, which may give one more argument to just remove them.

NB: prion has been failing, just looking into it.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: segmentation fault using currtid and partitioned tables

Michael Paquier-2
On Mon, Jun 01, 2020 at 10:57:29AM +0900, Michael Paquier wrote:
> Applied this one then.  I also got to check the ODBC driver in more
> details, and I am indeed not seeing those functions getting used.
> One extra thing to know is that the ODBC driver requires libpq from at
> least 9.2, which may give one more argument to just remove them.
>
> NB: prion has been failing, just looking into it.

Woah.  This one is old, good catch from -DRELCACHE_FORCE_RELEASE.  It
happens that since its introduction in a3519a2 from 2002,
currtid_for_view() in tid.c closes the view and then looks at a RTE
from it.  I have reproduced the issue and the patch attached takes
care of the problem.  Would it be better to backpatch all the way down
or is that not worth caring about?
--
Michael

currtid-view-relcache.patch (694 bytes) Download Attachment
signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: segmentation fault using currtid and partitioned tables

Tom Lane-2
Michael Paquier <[hidden email]> writes:
> Woah.  This one is old, good catch from -DRELCACHE_FORCE_RELEASE.  It
> happens that since its introduction in a3519a2 from 2002,
> currtid_for_view() in tid.c closes the view and then looks at a RTE
> from it.  I have reproduced the issue and the patch attached takes
> care of the problem.  Would it be better to backpatch all the way down
> or is that not worth caring about?

Ugh.  Aside from the stale-pointer-deref problem, once we drop the lock
we can't even be sure the table still exists.  +1 for back-patch.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: segmentation fault using currtid and partitioned tables

Michael Paquier-2
On Sun, May 31, 2020 at 10:26:54PM -0400, Tom Lane wrote:
> Ugh.  Aside from the stale-pointer-deref problem, once we drop the lock
> we can't even be sure the table still exists.  +1 for back-patch.

Thanks.  Fixed down to 9.5 then to make prion happier.
--
Michael

signature.asc (849 bytes) Download Attachment