Avoid useless retrieval of defaults and check constraints in pg_dump -a

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Avoid useless retrieval of defaults and check constraints in pg_dump -a

Julien Rouhaud
Hi,

Currently, getTableAttrs() always retrieves info about columns defaults and
check constraints, while this will never be used if --data-only option if used.
This seems like a waste of resources, so here's a patch to skip those parts
when the DDL won't be generated.

v1-pg_dump_data_only.diff (712 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a

Justin Pryzby
On Sun, Jul 12, 2020 at 07:48:50AM +0200, Julien Rouhaud wrote:
> Currently, getTableAttrs() always retrieves info about columns defaults and
> check constraints, while this will never be used if --data-only option if used.
> This seems like a waste of resources, so here's a patch to skip those parts
> when the DDL won't be generated.

Note that the speed of default and constraint handling has come up before:
https://www.postgresql.org/message-id/flat/CAMkU%3D1xPqHP%3D7YPeChq6n1v_qd4WGf%2BZvtnR-b%2BgyzFqtJqMMQ%40mail.gmail.com
https://www.postgresql.org/message-id/CAMkU=1x-e+maqefhM1yMeSiJ8J9Z+SJHgW7c9bqo3E3JMG4iJA@...

I'd pointed out that a significant fraction of our pg_upgrade time was in
pg_dump, due to having wide tables with many child tables, and "default 0" on
every column.  (I've since dropped our defaults so this is no longer an issue
here).

It appears your patch would avoid doing unnecessary work in the --data-only
case, but it wouldn't help the pg_upgrade case.

--
Justin


Reply | Threaded
Open this post in threaded view
|

Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a

Julien Rouhaud
On Sun, Jul 12, 2020 at 4:29 PM Justin Pryzby <[hidden email]> wrote:

>
> On Sun, Jul 12, 2020 at 07:48:50AM +0200, Julien Rouhaud wrote:
> > Currently, getTableAttrs() always retrieves info about columns defaults and
> > check constraints, while this will never be used if --data-only option if used.
> > This seems like a waste of resources, so here's a patch to skip those parts
> > when the DDL won't be generated.
>
> Note that the speed of default and constraint handling has come up before:
> https://www.postgresql.org/message-id/flat/CAMkU%3D1xPqHP%3D7YPeChq6n1v_qd4WGf%2BZvtnR-b%2BgyzFqtJqMMQ%40mail.gmail.com
> https://www.postgresql.org/message-id/CAMkU=1x-e+maqefhM1yMeSiJ8J9Z+SJHgW7c9bqo3E3JMG4iJA@...

Oh, I wasn't aware of that.

> I'd pointed out that a significant fraction of our pg_upgrade time was in
> pg_dump, due to having wide tables with many child tables, and "default 0" on
> every column.  (I've since dropped our defaults so this is no longer an issue
> here).
>
> It appears your patch would avoid doing unnecessary work in the --data-only
> case, but it wouldn't help the pg_upgrade case.

Indeed.  Making the schema part faster would probably require a bigger
refactoring.  I'm wondering if we could introduce some facility to
temporarily deny any DDL change, so that the initial pg_dump -s done
by pg_upgrade can be performed before shutting down the instance.

Note that those extraneous queries were found while trying to dump
data out of a corrupted database.  The issue wasn't an excessive
runtime but corrupted catalog entries, so bypassing this code (since I
was only interested in the data anyway) was simpler than trying to
recover yet other corrupted rows.


Reply | Threaded
Open this post in threaded view
|

Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a

Daniel Gustafsson
In reply to this post by Julien Rouhaud
> On 12 Jul 2020, at 07:48, Julien Rouhaud <[hidden email]> wrote:

> Currently, getTableAttrs() always retrieves info about columns defaults and
> check constraints, while this will never be used if --data-only option if used.
> This seems like a waste of resources, so here's a patch to skip those parts
> when the DDL won't be generated.

Given how unintrusive this optimization is, +1 from me to go ahead with this
patch.  pg_dump tests pass.  Personally I would've updated the nearby comments
to reflect why the check for dataOnly is there, but MMV there.  I'm moving this
patch to Ready for Committer.

I'm fairly sure there is a lot more we can do to improve the performance of
data-only dumps, but this nicely chips away at the problem.

cheers ./daniel

Reply | Threaded
Open this post in threaded view
|

Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a

Michael Paquier-2
In reply to this post by Julien Rouhaud
On Tue, Jul 14, 2020 at 11:14:50AM +0200, Julien Rouhaud wrote:
> Note that those extraneous queries were found while trying to dump
> data out of a corrupted database.  The issue wasn't an excessive
> runtime but corrupted catalog entries, so bypassing this code (since I
> was only interested in the data anyway) was simpler than trying to
> recover yet other corrupted rows.

Yeah, I don't see actually why this argument can prevent us from doing
a micro optimization if it proves to work correctly.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a

Michael Paquier-2
In reply to this post by Daniel Gustafsson
On Thu, Sep 10, 2020 at 02:31:32PM +0200, Daniel Gustafsson wrote:
> Given how unintrusive this optimization is, +1 from me to go ahead with this
> patch.  pg_dump tests pass.  Personally I would've updated the nearby comments
> to reflect why the check for dataOnly is there, but MMV there.  I'm moving this
> patch to Ready for Committer.

We need two comments here.  I would suggest something like:
"Skip default/check for a data-only dump, as this is only needed for
dumps of the table schema."

Better wording is of course welcome.

> I'm fairly sure there is a lot more we can do to improve the performance of
> data-only dumps, but this nicely chips away at the problem.

I was looking at that, and wondered about cases like the following,
artistic, thing:
CREATE FUNCTION check_data_zz() RETURNS boolean
    LANGUAGE sql STABLE STRICT
    AS $$select count(a) > 0 from zz$$;
CREATE TABLE public.yy (
    a integer,
    CONSTRAINT yy_check CHECK (check_data_zz())
);
CREATE TABLE zz (a integer);
INSERT INTO zz VALUES (1);
INSERT INTO yy VALUES (1);

Even on HEAD, this causes the data load to fail because yy's data is
inserted before zz, so keeping track of the CHECK dependency could
have made sense for --data-only if we were to make a better work at
detecting the dependency between both tables and made sure that zz
data needs to appear before yy, but it is not like this would happen
easily in pg_dump, and we document it this way (see the warning about
dump/reload in ddl.sgml for check constraints).  In short, I think
that this patch looks like a good thing to have.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a

Tom Lane-2
In reply to this post by Michael Paquier-2
Michael Paquier <[hidden email]> writes:
> On Tue, Jul 14, 2020 at 11:14:50AM +0200, Julien Rouhaud wrote:
>> Note that those extraneous queries were found while trying to dump
>> data out of a corrupted database.  The issue wasn't an excessive
>> runtime but corrupted catalog entries, so bypassing this code (since I
>> was only interested in the data anyway) was simpler than trying to
>> recover yet other corrupted rows.

> Yeah, I don't see actually why this argument can prevent us from doing
> a micro optimization if it proves to work correctly.

The main thing I'm wondering about is whether not fetching these objects
could lead to failing to detect an important dependency chain.  IIRC,
pg_dump simply ignores pg_depend entries that mention objects it has not
loaded, so there is a possible mechanism for that.  However, it's hard to
see how a --data-only dump could end up choosing an invalid dump order on
that basis.  It doesn't seem like safe load orders for the table data
objects could depend on what is referenced by defaults or CHECK
constraints.

But ... I've only spent a few minutes thinking about it, so maybe
I'm missing something.

(Note that we disallow sub-queries in CHECK constraints, and also
disclaim responsibility for what happens if you cheat by hiding
the subquery in a function.  So while it's certainly possible to
build CHECK constraints that only work if table X is loaded before
table Y, pg_dump already doesn't guarantee that'll work, --data-only
or otherwise.)

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a

Michael Paquier-2
On Mon, Sep 14, 2020 at 10:56:01PM -0400, Tom Lane wrote:
> (Note that we disallow sub-queries in CHECK constraints, and also
> disclaim responsibility for what happens if you cheat by hiding
> the subquery in a function.  So while it's certainly possible to
> build CHECK constraints that only work if table X is loaded before
> table Y, pg_dump already doesn't guarantee that'll work, --data-only
> or otherwise.)

Yep, exactly what I was thinking upthread by cheating with a schema
having cross-table references in a check constraint.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a

Julien Rouhaud
In reply to this post by Michael Paquier-2
On Tue, Sep 15, 2020 at 4:48 AM Michael Paquier <[hidden email]> wrote:

>
> On Thu, Sep 10, 2020 at 02:31:32PM +0200, Daniel Gustafsson wrote:
> > Given how unintrusive this optimization is, +1 from me to go ahead with this
> > patch.  pg_dump tests pass.  Personally I would've updated the nearby comments
> > to reflect why the check for dataOnly is there, but MMV there.  I'm moving this
> > patch to Ready for Committer.
>
> We need two comments here.  I would suggest something like:
> "Skip default/check for a data-only dump, as this is only needed for
> dumps of the table schema."
>
> Better wording is of course welcome.

FWIW I'm fine with those news comments!


Reply | Threaded
Open this post in threaded view
|

Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a

Michael Paquier-2
On Tue, Sep 15, 2020 at 10:46:56AM +0200, Julien Rouhaud wrote:
> FWIW I'm fine with those news comments!

Okay, I got again on this one today and finished by committing the
patch as of 5423853.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a

Julien Rouhaud
On Wed, Sep 16, 2020 at 3:06 PM Michael Paquier <[hidden email]> wrote:
>
> On Tue, Sep 15, 2020 at 10:46:56AM +0200, Julien Rouhaud wrote:
> > FWIW I'm fine with those news comments!
>
> Okay, I got again on this one today and finished by committing the
> patch as of 5423853.

Thanks Michael!