pg_dump's "--exclude-table" and "--exclude-table-data" options are ignored and/or cause the dump to fail entirely unless both the schema and table name use 1950s-era identifiers.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

pg_dump's "--exclude-table" and "--exclude-table-data" options are ignored and/or cause the dump to fail entirely unless both the schema and table name use 1950s-era identifiers.

tutiluren
After wasting yet another full day on this what should be a non-problem, I have come to the equally sad and maddening conclusion that the following crucial features of `pg_dump` are utterly broken, possibly "only" on Windows (only tested there):

--exclude-table
--exclude-table-data

The problem: the options *refuse* to work if your schema or table name contains any non-lowercase, non-ASII character. I've tested it heavily today (with PostgreSQL 11.7 on Windows 10). I'm 99% sure by now that I'm not making some kind of "obvious mistake", including setting the encoding in every possible manner...

Pardon me if it's actually been fixed recently, but it seems **extremely** unlikely. I couldn't find any mention of anything related to "--exclude-table" having been fixed in any changelog.

PostGIS did the same thing: it stubbornly ignored PostgreSQL's own rules of how you can name a schema/table, eventually bullying me into renaming its schema "postgis" because I *needed* it to work, which now looks awfully inconsistent with my other, properly named schemas.

I *need* this to work as well, because I can't keep making full backups of gigantic tables full of longterm-useless debug data. However, I will not let `pg_dump` bully me into renaming my entire database structure. I'm never, ever going through that again in my life. Plus I'm actually following its long-established rules, so I objectively am not doing anything wrong -- `pg_dump` is. Yet I'm the one who has to suffer... And yes, I know that it's volunteers and so on. I've paid to the PostgreSQL project with a large part of my life instead of money, with endless nights struggling with things like this.

PostGIS is a third-party(-ish) extension, so it's *slightly* more understandable there. `pg_dump`, on the other hand, is the **official tool for backing up a PostgreSQL database**, yet it still somehow doesn't understand PostgreSQL's own syntax! The PostgreSQL manual very clearly states that you can name your schemas and tables to (almost) anything you want, as long as you double-quote them when referring to them, as expected/natural.

Only its own `pg_dump` tool doesn't follow this rule. It demands -- *assumes*! -- that everything is in all-lowercase, only ASCII, with no spaces. It doesn't matter if you add quotes, which always works in PG SQL queries and which is just "how you do it".

Try it out yourself, by creating a test schema called "Personal stöff" and a table in it called "My däiary". Then create a text column and make it PK and then add the text "This is supposed to be ignored.". Then try to run this command:

pg_dump --format plain --verbose --file "C:\test.txt" --exclude-table-data="Personal stöff"."My däiary" --host="localhost" --port="5432" --username="postgres" --dbname="TestDB"

I've also tried a million variations of those quotes as well, including nested ones, and also tried with and without the --encoding option as well as setting the environment variable for encoding. It doesn't seem to matter -- it either fails to run the command or runs it but just ignores the exclude rules.

The resulting `test.txt` will contain the `This is supposed to be ignored.` text, even though we have told `pg_dump` to exclude the data in the specified table.

If you rename the schema name and table name to `personal_stuff.my_diary`, it will suddenly work.

This feature is apparently nine years old, if a blog post I found is accurate, yet is still at what can only be described as a "pre-alpha test stage". What I wonder, other than what to do now, is what all-lowercase PG users used to do a decade ago when they needed to exclude certain tables' data...? Surely that is a very basic feature for a backup tool?

Just to avoid wasting time, when the command doesn't work at all, it outputs things like this:

pg_dump: [archiver (db)] query failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xf6 0x72 0x66 0x72
pg_dump: [archiver (db)] query was: SELECT c.oid
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n
ON n.oid OPERATOR(pg_catalog.=) c.relnamespace
WHERE c.relkind OPERATOR(pg_catalog.=) ANY
(array['r', 'S', 'v', 'm', 'f', 'p'])
  AND c.relname OPERATOR(pg_catalog.~) '^(table name)$'
  AND n.nspname OPERATOR(pg_catalog.~) '^(schema name)$'

Note how "table name" and "schema name" are both in all-lowercase even though they were inputted as "Table name" and "Schema name". This indicates to me that it uses some extremely crude and inaccurate logic for its internal mechanism/queries.

PS: The thousand-step ordeal I had to go through to finally get to send this e-mail to your e-mail list did not exactly improve my already bad mood... You must have got *extreme* amounts of abuse here to actually implement such a system... I hope that my e-mail is taken seriously, because I find this (and what PostGIS does) very serious.
Reply | Threaded
Open this post in threaded view
|

Re: pg_dump's "--exclude-table" and "--exclude-table-data" options are ignored and/or cause the dump to fail entirely unless both the schema and table name use 1950s-era identifiers.

Pavel Stehule


út 21. 7. 2020 v 8:30 odesílatel <[hidden email]> napsal:
After wasting yet another full day on this what should be a non-problem, I have come to the equally sad and maddening conclusion that the following crucial features of `pg_dump` are utterly broken, possibly "only" on Windows (only tested there):

--exclude-table
--exclude-table-data

The problem: the options *refuse* to work if your schema or table name contains any non-lowercase, non-ASII character. I've tested it heavily today (with PostgreSQL 11.7 on Windows 10). I'm 99% sure by now that I'm not making some kind of "obvious mistake", including setting the encoding in every possible manner...

Pardon me if it's actually been fixed recently, but it seems **extremely** unlikely. I couldn't find any mention of anything related to "--exclude-table" having been fixed in any changelog.

PostGIS did the same thing: it stubbornly ignored PostgreSQL's own rules of how you can name a schema/table, eventually bullying me into renaming its schema "postgis" because I *needed* it to work, which now looks awfully inconsistent with my other, properly named schemas.

I *need* this to work as well, because I can't keep making full backups of gigantic tables full of longterm-useless debug data. However, I will not let `pg_dump` bully me into renaming my entire database structure. I'm never, ever going through that again in my life. Plus I'm actually following its long-established rules, so I objectively am not doing anything wrong -- `pg_dump` is. Yet I'm the one who has to suffer... And yes, I know that it's volunteers and so on. I've paid to the PostgreSQL project with a large part of my life instead of money, with endless nights struggling with things like this.

PostGIS is a third-party(-ish) extension, so it's *slightly* more understandable there. `pg_dump`, on the other hand, is the **official tool for backing up a PostgreSQL database**, yet it still somehow doesn't understand PostgreSQL's own syntax! The PostgreSQL manual very clearly states that you can name your schemas and tables to (almost) anything you want, as long as you double-quote them when referring to them, as expected/natural.

Only its own `pg_dump` tool doesn't follow this rule. It demands -- *assumes*! -- that everything is in all-lowercase, only ASCII, with no spaces. It doesn't matter if you add quotes, which always works in PG SQL queries and which is just "how you do it".

Try it out yourself, by creating a test schema called "Personal stöff" and a table in it called "My däiary". Then create a text column and make it PK and then add the text "This is supposed to be ignored.". Then try to run this command:

pg_dump --format plain --verbose --file "C:\test.txt" --exclude-table-data="Personal stöff"."My däiary" --host="localhost" --port="5432" --username="postgres" --dbname="TestDB"

I've also tried a million variations of those quotes as well, including nested ones, and also tried with and without the --encoding option as well as setting the environment variable for encoding. It doesn't seem to matter -- it either fails to run the command or runs it but just ignores the exclude rules.

The resulting `test.txt` will contain the `This is supposed to be ignored.` text, even though we have told `pg_dump` to exclude the data in the specified table.

If you rename the schema name and table name to `personal_stuff.my_diary`, it will suddenly work.

This feature is apparently nine years old, if a blog post I found is accurate, yet is still at what can only be described as a "pre-alpha test stage". What I wonder, other than what to do now, is what all-lowercase PG users used to do a decade ago when they needed to exclude certain tables' data...? Surely that is a very basic feature for a backup tool?

Just to avoid wasting time, when the command doesn't work at all, it outputs things like this:

pg_dump: [archiver (db)] query failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xf6 0x72 0x66 0x72
pg_dump: [archiver (db)] query was: SELECT c.oid
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n
ON n.oid OPERATOR(pg_catalog.=) c.relnamespace
WHERE c.relkind OPERATOR(pg_catalog.=) ANY
(array['r', 'S', 'v', 'm', 'f', 'p'])
  AND c.relname OPERATOR(pg_catalog.~) '^(table name)$'
  AND n.nspname OPERATOR(pg_catalog.~) '^(schema name)$'

Note how "table name" and "schema name" are both in all-lowercase even though they were inputted as "Table name" and "Schema name". This indicates to me that it uses some extremely crude and inaccurate logic for its internal mechanism/queries.

PS: The thousand-step ordeal I had to go through to finally get to send this e-mail to your e-mail list did not exactly improve my already bad mood... You must have got *extreme* amounts of abuse here to actually implement such a system... I hope that my e-mail is taken seriously, because I find this (and what PostGIS does) very serious.

It is working on Linux, but you can be careful when you are writing case sensitive identifiers:

postgres=# create table "Foo"(a int);
CREATE TABLE
postgres=# insert into "Foo" values(10);
INSERT 0 1

[pavel@nemesis tvision]$ pg_dump -T '"Foo"'
--
-- PostgreSQL database dump
--

postgres=# create schema "ŽlutýPes";
CREATE SCHEMA
postgres=# create table "ŽlutýPes"."ŽlutáKočka"(a int);
CREATE TABLE
postgres=# insert into "ŽlutýPes"."ŽlutáKočka" values(10);
INSERT 0 1

pg_dump -T '"ŽlutýPes"."ŽlutáKočka"'
pg_dump -N '"ŽlutýPes"'

Regards

Pavel

Reply | Threaded
Open this post in threaded view
|

Re: pg_dump's "--exclude-table" and "--exclude-table-data" options are ignored and/or cause the dump to fail entirely unless both the schema and table name use 1950s-era identifiers.

David G Johnston
In reply to this post by tutiluren
On Monday, July 20, 2020, <[hidden email]> wrote:
possibly "only" on Windows (only tested there):

It doesn't matter if you add quotes, which always works in PG SQL queries and which is just "how you do it".

pg_dump --format plain --verbose --file "C:\test.txt" --exclude-table-data="Personal stöff"."My däiary" --host="localhost" --port="5432" --username="postgres" --dbname="TestDB"

I've also tried a million variations of those quotes as well, including nested ones, and also tried with and without the --encoding option as well as setting the environment variable for encoding. It doesn't seem to matter -- it either fails to run the command or runs it but just ignores the exclude rules.

Not sure about encoding dynamics but your issue is likely with writing shell commands.  Instead of trying to write a full pg_dump command I suggest trying to set the value of an environment variable to the double-quoted value you want (checking it with echo) and then just supplying that variable with the pg_dump command.

Shells tend to use backslash for escaping (see the example on the pg_dump doc reference page) so you may want to try that.  Or research on the internet for the information and examples relevant to which ever shell you are using in Windows.  I’m personally experienced with Linux. 

David J.

Reply | Threaded
Open this post in threaded view
|

Re: pg_dump's "--exclude-table" and "--exclude-table-data" options are ignored and/or cause the dump to fail entirely unless both the schema and table name use 1950s-era identifiers.

Juan José Santamaría Flecha
In reply to this post by tutiluren

On Tue, Jul 21, 2020 at 8:30 AM <[hidden email]> wrote:

Try it out yourself, by creating a test schema called "Personal stöff" and a table in it called "My däiary". Then create a text column and make it PK and then add the text "This is supposed to be ignored.". Then try to run this command:

pg_dump --format plain --verbose --file "C:\test.txt" --exclude-table-data="Personal stöff"."My däiary" --host="localhost" --port="5432" --username="postgres" --dbname="TestDB"

Just to avoid wasting time, when the command doesn't work at all, it outputs things like this:

pg_dump: [archiver (db)] query failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xf6 0x72 0x66 0x72
pg_dump: [archiver (db)] query was: SELECT c.oid
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n
ON n.oid OPERATOR(pg_catalog.=) c.relnamespace
WHERE c.relkind OPERATOR(pg_catalog.=) ANY
(array['r', 'S', 'v', 'm', 'f', 'p'])
  AND c.relname OPERATOR(pg_catalog.~) '^(table name)$'
  AND n.nspname OPERATOR(pg_catalog.~) '^(schema name)$'


The source of the problem is coming from how CMD works with UTF8 (or does not). The error you are getting is using code page Windows-1252 [1], 0xf6 is ö, but pg_dump is expecting UTF8 and crashes.

You can try to configure UTF8 as your CMD encoding, see [2]. Please tell us if this works for you.


Regards,

Juan José Santamaría Flecha
Reply | Threaded
Open this post in threaded view
|

Re: pg_dump's "--exclude-table" and "--exclude-table-data" options are ignored and/or cause the dump to fail entirely unless both the schema and table name use 1950s-era identifiers.

David G Johnston
In reply to this post by David G Johnston

You need to reply-all so that the list as a whole can contribute.

On Tue, Jul 21, 2020 at 10:36 AM <[hidden email]> wrote:

Not sure about encoding dynamics but your issue is likely with writing shell commands.  Instead of trying to write a full pg_dump command I suggest trying to set the value of an environment variable to the double-quoted value you want (checking it with echo) and then just supplying that variable with the pg_dump command.
This sounded promising at first, and I have been forced to do something similar before to provide the password with "PGPASSWORD" as it wouldn't let me enter this on the command line as an argument, but unless I misunderstand you, my:

--exclude-table-data=%TEMPTESTFORPG%

... isn't turned into the value for "TEMPTESTFORPG",

ok, what is it turned into?  Though looking at it now if it is simple text substitution it may not work the way I was hoping...I would need to experiment.  Sticking with just the original command in the shell may indeed be easier - though practicing with "echo" instead "pg_dump" is still a good idea, quicker iterations.

which I set with this in PHP:

putenv('TEMPTESTFORPG="Schema name"."Table name"');

I also tried with:

--exclude-table-data=' . getenv('TEMPTESTFORPG')

.... but it just has the same stupid result with pg_dump either dumping the nonsensical errors or executing the command but ignoring my "exclude rule" (and thus dumping the table's data).


You should probably remove PHP from the equation as well - but that fact you are not doing this in a shell but instead through a separate programming language is information that is helpful to provide upfront.  It should actually be easier to do this in a language that provides a decent abstraction of the system's process since you can just stick stuff in normal variables and let the language take care of escaping and the like.

And keep in mind that the rule is a pattern matching expression and not a simple literal.

David J.

Reply | Threaded
Open this post in threaded view
|

Re: pg_dump's "--exclude-table" and "--exclude-table-data" options are ignored and/or cause the dump to fail entirely unless both the schema and table name use 1950s-era identifiers.

Juan José Santamaría Flecha
In reply to this post by tutiluren
Please keep the list in CC for future reference, and so the subscribers can contribute.

On Tue, Jul 21, 2020 at 7:32 PM <[hidden email]> wrote:
Jul 21, 2020, 11:12 AM by [hidden email]:

On Tue, Jul 21, 2020 at 8:30 AM <[hidden email]> wrote:

Try it out yourself, by creating a test schema called "Personal stöff" and a table in it called "My däiary". Then create a text column and make it PK and then add the text "This is supposed to be ignored.". Then try to run this command:

pg_dump --format plain --verbose --file "C:\test.txt" --exclude-table-data="Personal stöff"."My däiary" --host="localhost" --port="5432" --username="postgres" --dbname="TestDB"

Just to avoid wasting time, when the command doesn't work at all, it outputs things like this:

pg_dump: [archiver (db)] query failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xf6 0x72 0x66 0x72
pg_dump: [archiver (db)] query was: SELECT c.oid
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n
ON n.oid OPERATOR(pg_catalog.=) c.relnamespace
WHERE c.relkind OPERATOR(pg_catalog.=) ANY
(array['r', 'S', 'v', 'm', 'f', 'p'])
  AND c.relname OPERATOR(pg_catalog.~) '^(table name)$'
  AND n.nspname OPERATOR(pg_catalog.~) '^(schema name)$'


The source of the problem is coming from how CMD works with UTF8 (or does not). The error you are getting is using code page Windows-1252 [1], 0xf6 is ö, but pg_dump is expecting UTF8 and crashes.

You can try to configure UTF8 as your CMD encoding, see [2]. Please tell us if this works for you.


I actually have very carefully made sure (from past problems) that the cmd.exe uses UTF-8 and the same goes for my databases and the connection and everything. It truly doesn't seem to have anything to do with this. Isn't it obvious from the output that pg_dump is lowercasing/changing the input?

The problem with that query is not that it does not return any rows because of case folding. Actually it crashes because it is expecting UTF8 input but is getting something else: "pg_dump: [archiver (db)] query failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xf6 0x72 0x66 0x72"

I can reproduce a test case in an English_United States.1252 WIndows 10 machine, and the setting "Beta: Use unicode UTF-8 for worldwide language support", as mentioned above, worked in that case.

Regards,

Juan José Santamaría Flecha
Reply | Threaded
Open this post in threaded view
|

Re: pg_dump's "--exclude-table" and "--exclude-table-data" options are ignored and/or cause the dump to fail entirely unless both the schema and table name use 1950s-era identifiers.

tutiluren
In reply to this post by tutiluren
Jul 22, 2020, 9:36 AM by [hidden email]:
Please keep the list in CC for future reference, and so the subscribers can contribute.
It seems to delete the list and just send it to the person who sent the e-mail to the list... I had to manually add it in now.

The problem with that query is not that it does not return any rows because of case folding. Actually it crashes because it is expecting UTF8 input but is getting something else: "pg_dump: [archiver (db)] query failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xf6 0x72 0x66 0x72"

I can reproduce a test case in an English_United States.1252 WIndows 10 machine, and the setting "Beta: Use unicode UTF-8 for worldwide language support", as mentioned above, worked in that case.

I've tried to explain this both in my initial question and as a response to those few who replies: the error output is not what always happens. I just included it for the cases where it does error out. When it does run the command, it *ignores the exclude rules*. That is, it dumps the data anyway. I thought I was crystal-clear when describing my problem, but no matter how careful I am, people always quote some specific part and focus on that, ignoring my overall explanation.

To be clear: The problem is not solved. I'm not one step closer to a solution. It's ignoring the rule or displaying the nonsensical errors, but never actually works.
Reply | Threaded
Open this post in threaded view
|

Re: pg_dump's "--exclude-table" and "--exclude-table-data" options are ignored and/or cause the dump to fail entirely unless both the schema and table name use 1950s-era identifiers.

Juan José Santamaría Flecha

On Wed, Jul 22, 2020 at 5:17 PM <[hidden email]> wrote:
Jul 22, 2020, 9:36 AM by [hidden email]:
I can reproduce a test case in an English_United States.1252 WIndows 10 machine, and the setting "Beta: Use unicode UTF-8 for worldwide language support", as mentioned above, worked in that case.

I've tried to explain this both in my initial question and as a response to those few who replies: the error output is not what always happens. I just included it for the cases where it does error out. When it does run the command, it *ignores the exclude rules*. That is, it dumps the data anyway. I thought I was crystal-clear when describing my problem, but no matter how careful I am, people always quote some specific part and focus on that, ignoring my overall explanation.

To be clear: The problem is not solved. I'm not one step closer to a solution. It's ignoring the rule or displaying the nonsensical errors, but never actually works.

I am afraid I must insist, pg_dump is expecting a shell that honors "client_encoding = 'UTF8'", and all the issues you are describing fit the mold of a CMD that is not doing so: fails when using extended characters with "invalid byte sequence for encoding "UTF8"" or gives wrong results, and works when you use plain ASCII.

Please read an excerpt from [1]:

"The current changes also don’t cover what is required for our “processed input mode” that presents an editable input line for applications like CMD.exe. We are planning and actively updating the code for popup windows, command aliases, command history, and the editable input line itself to support full true Unicode as well."

Not sure I can help any further if your system is below Windows 10 April 2018 Update (Version 1803) and cannot active true (but beta) UTF8 support. Maybe using a client from a POSIX machine as an alternative.


Regards,

Juan José Santamaría Flecha