BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Next Topic
 
classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

PG Bug reporting form
The following bug has been logged on the website:

Bug reference:      16045
Logged by:          Hans Buschmann
Email address:      [hidden email]
PostgreSQL version: 12.0
Operating system:   Windows 10 64bit
Description:        

I just did a pg_upgrade from pg 11.5 to pg 12.0 on my development machine
under Windows 64bit (both distributions from EDB).

cpsdb=# select version ();
                          version
------------------------------------------------------------
 PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit
(1 row)

The pg_upgrade with --link went flawlessly, I started (only!) the new server
12.0 and could connect and access individual databases.

As recommended by the resulting analyze_new_cluster.bat I tried a full
vacuumdb with:

"N:/pgsql/bin/vacuumdb" -U postgres --all --analyze-only

which crashed with
vacuumdb: vacuuming database "cpsdb"
vacuumdb: error: vacuuming of table "admin.q_tbl_archiv" in database "cpsdb"
failed: ERROR:  compressed data is corrupted

I connected to the database through pgsql and looked at the table
"admin.q_tbl_archiv"

cpsdb=# \d+ q_tbl_archiv;
                                                   Table
"admin.q_tbl_archiv"
      Column      |                Type                | Collation |
Nullable | Default | Storage  | Stats target | Description
------------------+------------------------------------+-----------+----------+---------+----------+--------------+-------------
 table_name       | information_schema.sql_identifier  |           |        
 |         | plain    |              |
 column_name      | information_schema.sql_identifier  |           |        
 |         | plain    |              |
 ordinal_position | information_schema.cardinal_number |           |        
 |         | plain    |              |
 col_qualifier    | text                               |           |        
 |         | extended |              |
 id_column        | information_schema.sql_identifier  |           |        
 |         | plain    |              |
 id_default       | information_schema.character_data  |           |        
 |         | extended |              |
Access method: heap

When trying to select * from q_tbl_archiv I got:

cpsdb=# select * from q_tbl_archiv;
ERROR:  invalid memory alloc request size 18446744073709551613

This table was created a long time back under 9.5 or 9.6 with the (here
truncated) following command:


create table q_tbl_archiv as
with
qseason as (
select table_name,column_name, ordinal_position
,replace(column_name,'_season','') as col_qualifier
-- ,'id_'||replace(column_name,'_season','') as id_column
from information_schema.columns
where
column_name like '%_season'
and ordinal_position < 10
and table_name in (
 'table1'
,'table2'
-- here truncated:
-- ... (here where all table of mine having columns like xxx_season)
-- to reproduce, change to your own tablenames in a test database
)
order by table_name
)
select qs.*,c.column_name as id_column, c.column_default as id_default
from
        qseason qs
        left join information_schema.columns c on c.table_name=qs.table_name and
c.column_name like 'id_%'
;

Until now this table was always restored without error by migrating to a new
major version through pg_dump/initdb/pr_restore.

To verify the integrity of the table I restored the dump taken under pg_dump
from pg 11.5 just before the pg_upgrade to another machine.

The restore and analyze went OK and select * from q_tbl_archiv showed all
tuples, eg (edited):

cpsdb_dev=# select * from q_tbl_archiv;
        table_name        | column_name  | ordinal_position | col_qualifier
| id_column |                        id_default
--------------------------+--------------+------------------+---------------+-----------+----------------------------------------------------------
 table1                   | chm_season   |                2 | chm          
|           |
 table2                   | cs_season    |                2 | cs          
| id_cs     | nextval('table2_id_cs_seq'::regclass)
...

In conclusion, this seems to me like an error/omission of pg_upgrade.

It seems to handle these specially derived tables from information_schema
not correctly, resulting in failures of the upgraded database.

For me, this error is not so crucial, because this table is only used for
administrative purposes and can easily be restored from backup.

But I want to share my findings for the sake of other users of pg_upgrade.

Thanks for investigating!

Hans Buschmann

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra-4
On Tue, Oct 08, 2019 at 05:08:53PM +0000, PG Bug reporting form wrote:

>The following bug has been logged on the website:
>
>Bug reference:      16045
>Logged by:          Hans Buschmann
>Email address:      [hidden email]
>PostgreSQL version: 12.0
>Operating system:   Windows 10 64bit
>Description:
>
>I just did a pg_upgrade from pg 11.5 to pg 12.0 on my development machine
>under Windows 64bit (both distributions from EDB).
>
>cpsdb=# select version ();
>                          version
>------------------------------------------------------------
> PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit
>(1 row)
>
>The pg_upgrade with --link went flawlessly, I started (only!) the new server
>12.0 and could connect and access individual databases.
>
>As recommended by the resulting analyze_new_cluster.bat I tried a full
>vacuumdb with:
>
>"N:/pgsql/bin/vacuumdb" -U postgres --all --analyze-only
>
>which crashed with
>vacuumdb: vacuuming database "cpsdb"
>vacuumdb: error: vacuuming of table "admin.q_tbl_archiv" in database "cpsdb"
>failed: ERROR:  compressed data is corrupted
>
>I connected to the database through pgsql and looked at the table
>"admin.q_tbl_archiv"
>
>cpsdb=# \d+ q_tbl_archiv;
>                                                   Table
>"admin.q_tbl_archiv"
>      Column      |                Type                | Collation |
>Nullable | Default | Storage  | Stats target | Description
>------------------+------------------------------------+-----------+----------+---------+----------+--------------+-------------
> table_name       | information_schema.sql_identifier  |           |
> |         | plain    |              |
> column_name      | information_schema.sql_identifier  |           |
> |         | plain    |              |
> ordinal_position | information_schema.cardinal_number |           |
> |         | plain    |              |
> col_qualifier    | text                               |           |
> |         | extended |              |
> id_column        | information_schema.sql_identifier  |           |
> |         | plain    |              |
> id_default       | information_schema.character_data  |           |
> |         | extended |              |
>Access method: heap
>
>When trying to select * from q_tbl_archiv I got:
>
>cpsdb=# select * from q_tbl_archiv;
>ERROR:  invalid memory alloc request size 18446744073709551613
>
>This table was created a long time back under 9.5 or 9.6 with the (here
>truncated) following command:
>
>
>create table q_tbl_archiv as
>with
>qseason as (
>select table_name,column_name, ordinal_position
>,replace(column_name,'_season','') as col_qualifier
>-- ,'id_'||replace(column_name,'_season','') as id_column
>from information_schema.columns
>where
>column_name like '%_season'
>and ordinal_position < 10
>and table_name in (
> 'table1'
>,'table2'
>-- here truncated:
>-- ... (here where all table of mine having columns like xxx_season)
>-- to reproduce, change to your own tablenames in a test database
>)
>order by table_name
>)
>select qs.*,c.column_name as id_column, c.column_default as id_default
>from
> qseason qs
> left join information_schema.columns c on c.table_name=qs.table_name and
>c.column_name like 'id_%'
>;
>
>Until now this table was always restored without error by migrating to a new
>major version through pg_dump/initdb/pr_restore.
>
>To verify the integrity of the table I restored the dump taken under pg_dump
>from pg 11.5 just before the pg_upgrade to another machine.
>
>The restore and analyze went OK and select * from q_tbl_archiv showed all
>tuples, eg (edited):
>
>cpsdb_dev=# select * from q_tbl_archiv;
>        table_name        | column_name  | ordinal_position | col_qualifier
>| id_column |                        id_default
>--------------------------+--------------+------------------+---------------+-----------+----------------------------------------------------------
> table1                   | chm_season   |                2 | chm
>|           |
> table2                   | cs_season    |                2 | cs
>| id_cs     | nextval('table2_id_cs_seq'::regclass)
>...
>
>In conclusion, this seems to me like an error/omission of pg_upgrade.
>

There's clearly something bad happening. It's a bit strange, though. Had
this been a data corruption issue, I'd expect the pg_dump to fail too,
but it succeeds.

>It seems to handle these specially derived tables from information_schema
>not correctly, resulting in failures of the upgraded database.
>

Well, I don't see how that should make any difference. It's a CTAS and
that should create a regular table, that's not an issue. I wonder if
there were some changes to the data types involved, but that would be
essentially a break in on-disk format and we're careful about not doing
that ...

>For me, this error is not so crucial, because this table is only used for
>administrative purposes and can easily be restored from backup.
>
>But I want to share my findings for the sake of other users of pg_upgrade.
>

OK, thanks. Could you maybe set

 log_error_verbosity = verbose

before invoking the vacuum (you can set that in that session)? That
should give us more details about where exactly the error is triggered.
Even better, if you could attach a debugger to the session, set
breakpoints on locations triggering 'invalid memory alloc request size'
and then show the backtrace (obviously, that's more complicated).


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra-4
FWIW I can reproduce this - it's enough to do this on the 11 cluster

create table q_tbl_archiv as
with
qseason as (
select table_name,column_name, ordinal_position
,replace(column_name,'_season','') as col_qualifier
-- ,'id_'||replace(column_name,'_season','') as id_column
from information_schema.columns
order by table_name
)
select qs.*,c.column_name as id_column, c.column_default as id_default
from
        qseason qs
        left join information_schema.columns c on c.table_name=qs.table_name and
c.column_name like 'id_%';


and then

    analyze q_tbl_archiv

which produces backtrace like this:

No symbol "stats" in current context.
(gdb) bt
#0  0x0000746095262951 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#1  0x0000000000890a8e in varstrfastcmp_locale (a1p=0x17716b4 "per_language\a", len1=<optimized out>, a2p=0x176af28 '\177' <repeats 136 times>, "\021\004", len2=-4, ssup=<optimized out>, ssup=<optimized out>) at varlena.c:2320
#2  0x0000000000890cb1 in varlenafastcmp_locale (x=24581808, y=24555300, ssup=0x7ffc649463f0) at varlena.c:2219
#3  0x00000000005b73b4 in ApplySortComparator (ssup=0x7ffc649463f0, isNull2=false, datum2=<optimized out>, isNull1=false, datum1=<optimized out>) at ../../../src/include/utils/sortsupport.h:224
#4  compare_scalars (a=<optimized out>, b=<optimized out>, arg=0x7ffc649463e0) at analyze.c:2700
#5  0x00000000008f9953 in qsort_arg (a=a@entry=0x178fdc0, n=<optimized out>, n@entry=2158, es=es@entry=16, cmp=cmp@entry=0x5b7390 <compare_scalars>, arg=arg@entry=0x7ffc649463e0) at qsort_arg.c:140
#6  0x00000000005b86a6 in compute_scalar_stats (stats=0x176a208, fetchfunc=<optimized out>, samplerows=<optimized out>, totalrows=2158) at analyze.c:2273
#7  0x00000000005b9d95 in do_analyze_rel (onerel=onerel@entry=0x74608c00d3e8, params=params@entry=0x7ffc64946970, va_cols=va_cols@entry=0x0, acquirefunc=<optimized out>, relpages=22, inh=inh@entry=false, in_outer_xact=false, elevel=13)
    at analyze.c:529
#8  0x00000000005bb2c9 in analyze_rel (relid=<optimized out>, relation=<optimized out>, params=params@entry=0x7ffc64946970, va_cols=0x0, in_outer_xact=<optimized out>, bstrategy=<optimized out>) at analyze.c:260
#9  0x000000000062c7b0 in vacuum (relations=0x1727120, params=params@entry=0x7ffc64946970, bstrategy=<optimized out>, bstrategy@entry=0x0, isTopLevel=isTopLevel@entry=true) at vacuum.c:413
#10 0x000000000062cd49 in ExecVacuum (pstate=pstate@entry=0x16c9518, vacstmt=vacstmt@entry=0x16a82b8, isTopLevel=isTopLevel@entry=true) at vacuum.c:199
#11 0x00000000007a6d64 in standard_ProcessUtility (pstmt=0x16a8618, queryString=0x16a77a8 "", context=<optimized out>, params=0x0, queryEnv=0x0, dest=0x16a8710, completionTag=0x7ffc64946cb0 "") at utility.c:670
#12 0x00000000007a4006 in PortalRunUtility (portal=0x170f368, pstmt=0x16a8618, isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>, dest=0x16a8710, completionTag=0x7ffc64946cb0 "") at pquery.c:1175
#13 0x00000000007a4b61 in PortalRunMulti (portal=portal@entry=0x170f368, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x16a8710, altdest=altdest@entry=0x16a8710,
    completionTag=completionTag@entry=0x7ffc64946cb0 "") at pquery.c:1321
#14 0x00000000007a5864 in PortalRun (portal=portal@entry=0x170f368, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x16a8710, altdest=altdest@entry=0x16a8710,
    completionTag=0x7ffc64946cb0 "") at pquery.c:796
#15 0x00000000007a174e in exec_simple_query (query_string=0x16a77a8 "") at postgres.c:1215

Looking at compute_scalar_stats, the "stats" parameter does not seem
particularly healthy:

(gdb) p *stats
$3 = {attr = 0x10, attrtypid = 12, attrtypmod = 0, attrtype = 0x1762e00, attrcollid = 356, anl_context = 0x7f7f7f7e00000000, compute_stats = 0x100, minrows = 144, extra_data = 0x1762e00, stats_valid = false, stanullfrac = 0,
  stawidth = 0, stadistinct = 0, stakind = {0, 0, 0, 0, 0}, staop = {0, 0, 0, 0, 0}, stacoll = {0, 0, 0, 0, 0}, numnumbers = {0, 0, 0, 0, 0}, stanumbers = {0x0, 0x0, 0x0, 0x0, 0x0}, numvalues = {0, 0, 0, 0, 2139062142}, stavalues = {
    0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f}, statypid = {2139062143, 2139062143, 2139062143, 2139062143, 2139062143}, statyplen = {32639, 32639, 32639, 32639, 32639},
  statypbyval = {127, 127, 127, 127, 127}, statypalign = "\177\177\177\177\177", tupattnum = 2139062143, rows = 0x7f7f7f7f7f7f7f7f, tupDesc = 0x7f7f7f7f7f7f7f7f, exprvals = 0x8, exprnulls = 0x4, rowstride = 24522240}

Not sure about the root cause yet.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tom Lane-2
Tomas Vondra <[hidden email]> writes:
> FWIW I can reproduce this - it's enough to do this on the 11 cluster

I failed to reproduce any problem from your example, but I was trying
in C locale on a Linux machine.  What environment are you testing?

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra-4
In reply to this post by Tomas Vondra-4
On Wed, Oct 09, 2019 at 03:59:07PM +0200, Tomas Vondra wrote:

>FWIW I can reproduce this - it's enough to do this on the 11 cluster
>
>create table q_tbl_archiv as
>with
>qseason as (
>select table_name,column_name, ordinal_position
>,replace(column_name,'_season','') as col_qualifier
>-- ,'id_'||replace(column_name,'_season','') as id_column
>from information_schema.columns
>order by table_name
>)
>select qs.*,c.column_name as id_column, c.column_default as id_default
>from
>       qseason qs
>       left join information_schema.columns c on c.table_name=qs.table_name and
>c.column_name like 'id_%';
>
>
>and then
>
>   analyze q_tbl_archiv
>
>which produces backtrace like this:
>
>No symbol "stats" in current context.
>(gdb) bt
>#0  0x0000746095262951 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
>#1  0x0000000000890a8e in varstrfastcmp_locale (a1p=0x17716b4 "per_language\a", len1=<optimized out>, a2p=0x176af28 '\177' <repeats 136 times>, "\021\004", len2=-4, ssup=<optimized out>, ssup=<optimized out>) at varlena.c:2320
>#2  0x0000000000890cb1 in varlenafastcmp_locale (x=24581808, y=24555300, ssup=0x7ffc649463f0) at varlena.c:2219
>#3  0x00000000005b73b4 in ApplySortComparator (ssup=0x7ffc649463f0, isNull2=false, datum2=<optimized out>, isNull1=false, datum1=<optimized out>) at ../../../src/include/utils/sortsupport.h:224
>#4  compare_scalars (a=<optimized out>, b=<optimized out>, arg=0x7ffc649463e0) at analyze.c:2700
>#5  0x00000000008f9953 in qsort_arg (a=a@entry=0x178fdc0, n=<optimized out>, n@entry=2158, es=es@entry=16, cmp=cmp@entry=0x5b7390 <compare_scalars>, arg=arg@entry=0x7ffc649463e0) at qsort_arg.c:140
>#6  0x00000000005b86a6 in compute_scalar_stats (stats=0x176a208, fetchfunc=<optimized out>, samplerows=<optimized out>, totalrows=2158) at analyze.c:2273
>#7  0x00000000005b9d95 in do_analyze_rel (onerel=onerel@entry=0x74608c00d3e8, params=params@entry=0x7ffc64946970, va_cols=va_cols@entry=0x0, acquirefunc=<optimized out>, relpages=22, inh=inh@entry=false, in_outer_xact=false, elevel=13)
>   at analyze.c:529
>#8  0x00000000005bb2c9 in analyze_rel (relid=<optimized out>, relation=<optimized out>, params=params@entry=0x7ffc64946970, va_cols=0x0, in_outer_xact=<optimized out>, bstrategy=<optimized out>) at analyze.c:260
>#9  0x000000000062c7b0 in vacuum (relations=0x1727120, params=params@entry=0x7ffc64946970, bstrategy=<optimized out>, bstrategy@entry=0x0, isTopLevel=isTopLevel@entry=true) at vacuum.c:413
>#10 0x000000000062cd49 in ExecVacuum (pstate=pstate@entry=0x16c9518, vacstmt=vacstmt@entry=0x16a82b8, isTopLevel=isTopLevel@entry=true) at vacuum.c:199
>#11 0x00000000007a6d64 in standard_ProcessUtility (pstmt=0x16a8618, queryString=0x16a77a8 "", context=<optimized out>, params=0x0, queryEnv=0x0, dest=0x16a8710, completionTag=0x7ffc64946cb0 "") at utility.c:670
>#12 0x00000000007a4006 in PortalRunUtility (portal=0x170f368, pstmt=0x16a8618, isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>, dest=0x16a8710, completionTag=0x7ffc64946cb0 "") at pquery.c:1175
>#13 0x00000000007a4b61 in PortalRunMulti (portal=portal@entry=0x170f368, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x16a8710, altdest=altdest@entry=0x16a8710,
>   completionTag=completionTag@entry=0x7ffc64946cb0 "") at pquery.c:1321
>#14 0x00000000007a5864 in PortalRun (portal=portal@entry=0x170f368, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x16a8710, altdest=altdest@entry=0x16a8710,
>   completionTag=0x7ffc64946cb0 "") at pquery.c:796
>#15 0x00000000007a174e in exec_simple_query (query_string=0x16a77a8 "") at postgres.c:1215
>
>Looking at compute_scalar_stats, the "stats" parameter does not seem
>particularly healthy:
>
>(gdb) p *stats
>$3 = {attr = 0x10, attrtypid = 12, attrtypmod = 0, attrtype = 0x1762e00, attrcollid = 356, anl_context = 0x7f7f7f7e00000000, compute_stats = 0x100, minrows = 144, extra_data = 0x1762e00, stats_valid = false, stanullfrac = 0,
> stawidth = 0, stadistinct = 0, stakind = {0, 0, 0, 0, 0}, staop = {0, 0, 0, 0, 0}, stacoll = {0, 0, 0, 0, 0}, numnumbers = {0, 0, 0, 0, 0}, stanumbers = {0x0, 0x0, 0x0, 0x0, 0x0}, numvalues = {0, 0, 0, 0, 2139062142}, stavalues = {
>   0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f}, statypid = {2139062143, 2139062143, 2139062143, 2139062143, 2139062143}, statyplen = {32639, 32639, 32639, 32639, 32639},
> statypbyval = {127, 127, 127, 127, 127}, statypalign = "\177\177\177\177\177", tupattnum = 2139062143, rows = 0x7f7f7f7f7f7f7f7f, tupDesc = 0x7f7f7f7f7f7f7f7f, exprvals = 0x8, exprnulls = 0x4, rowstride = 24522240}
>
>Not sure about the root cause yet.
>

OK, a couple more observations - the table schema looks like this:

                              Table "public.q_tbl_archiv"
      Column      |                Type                | Collation | Nullable | Default
------------------+------------------------------------+-----------+----------+---------
 table_name       | information_schema.sql_identifier  |           |          |
 column_name      | information_schema.sql_identifier  |           |          |
 ordinal_position | information_schema.cardinal_number |           |          |
 col_qualifier    | text                               |           |          |
 id_column        | information_schema.sql_identifier  |           |          |
 id_default       | information_schema.character_data  |           |          |

and I can succesfully do this:

  test=# analyze q_tbl_archiv (table_name, column_name, ordinal_position, id_column, id_default);
  ANALYZE

but as soon as I include the col_qualifier column, it fails:

  test=# analyze q_tbl_archiv (table_name, column_name, ordinal_position, id_column, id_default, col_qualifier);
  ERROR:  compressed data is corrupted

But it fails differently (with the segfault) when analyzing just the one
column:

  test=# analyze q_tbl_archiv (col_qualifier);
  server closed the connection unexpectedly
  This probably means the server terminated abnormally
  before or while processing the request.
  The connection to the server was lost. Attempting reset: Succeeded.

Moreover, there are some other interesting failures - I can do

  select max(table_name) from q_tbl_archiv;
  select max(column_name) from q_tbl_archiv;
  select max(ordinal_position) from q_tbl_archiv;

but as soon as I try doing that with col_qualifier, it crashes and
burns:

  select max(col_qualifier) from q_tbl_archiv;

The backtrace is rather strange in this case (a lot of missing calls,
etc.). However, when called for the next two columns, it still crashes,
but the backtraces look somewhat saner:

  select max(id_column) from q_tbl_archiv;

Program received signal SIGSEGV, Segmentation fault.
0x00007db3186c6617 in __strlen_avx2 () from /lib64/libc.so.6
(gdb) bt
#0  0x00007db3186c6617 in __strlen_avx2 () from /lib64/libc.so.6
#1  0x0000000000894ced in cstring_to_text (s=0x7db32ce38935 <error: Cannot access memory at address 0x7db32ce38935>) at varlena.c:173
#2  name_text (fcinfo=<optimized out>) at varlena.c:3573
#3  0x000000000063860d in ExecInterpExpr (state=0x1136900, econtext=0x1135128, isnull=<optimized out>) at execExprInterp.c:649
#4  0x000000000064f699 in ExecEvalExprSwitchContext (isNull=0x7ffcfd8f3b2f, econtext=<optimized out>, state=<optimized out>) at ../../../src/include/executor/executor.h:307
#5  advance_aggregates (aggstate=0x1134ef0, aggstate=0x1134ef0) at nodeAgg.c:679
#6  agg_retrieve_direct (aggstate=0x1134ef0) at nodeAgg.c:1847
#7  ExecAgg (pstate=0x1134ef0) at nodeAgg.c:1572
#8  0x000000000063b58b in ExecProcNode (node=0x1134ef0) at ../../../src/include/executor/executor.h:239
#9  ExecutePlan (execute_once=<optimized out>, dest=0x1144248, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1134ef0, estate=0x1134c98)
    at execMain.c:1646
#10 standard_ExecutorRun (queryDesc=0x1094f18, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#11 0x00000000007a43cc in PortalRunSelect (portal=0x10da368, forward=<optimized out>, count=0, dest=<optimized out>) at pquery.c:929
#12 0x00000000007a5958 in PortalRun (portal=portal@entry=0x10da368, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x1144248, altdest=altdest@entry=0x1144248,
    completionTag=0x7ffcfd8f3db0 "") at pquery.c:770
#13 0x00000000007a177e in exec_simple_query (query_string=0x10727a8 "select max(id_column) from q_tbl_archiv ;") at postgres.c:1215
#14 0x00000000007a2f3f in PostgresMain (argc=<optimized out>, argv=argv@entry=0x109e400, dbname=<optimized out>, username=<optimized out>) at postgres.c:4236
#15 0x00000000007237ce in BackendRun (port=0x1097c30, port=0x1097c30) at postmaster.c:4437
#16 BackendStartup (port=0x1097c30) at postmaster.c:4128
#17 ServerLoop () at postmaster.c:1704
#18 0x000000000072458e in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x106c350) at postmaster.c:1377
#19 0x000000000047d101 in main (argc=3, argv=0x106c350) at main.c:228

  select max(id_default) from q_tbl_archiv;

Program received signal SIGABRT, Aborted.
0x00007db3185a1e35 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007db3185a1e35 in raise () from /lib64/libc.so.6
#1  0x00007db31858c895 in abort () from /lib64/libc.so.6
#2  0x00000000008b4470 in ExceptionalCondition (conditionName=conditionName@entry=0xabe49e "1", errorType=errorType@entry=0x907128 "unrecognized TOAST vartag", fileName=fileName@entry=0xa4965b "execTuples.c",
    lineNumber=lineNumber@entry=971) at assert.c:54
#3  0x00000000006466d3 in slot_deform_heap_tuple (natts=6, offp=0x1135170, tuple=<optimized out>, slot=0x1135128) at execTuples.c:985
#4  tts_buffer_heap_getsomeattrs (slot=0x1135128, natts=<optimized out>) at execTuples.c:676
#5  0x00000000006489fc in slot_getsomeattrs_int (slot=slot@entry=0x1135128, attnum=6) at execTuples.c:1877
#6  0x00000000006379a3 in slot_getsomeattrs (attnum=<optimized out>, slot=0x1135128) at ../../../src/include/executor/tuptable.h:345
#7  ExecInterpExpr (state=0x11364b0, econtext=0x1134cd8, isnull=<optimized out>) at execExprInterp.c:441
#8  0x000000000064f699 in ExecEvalExprSwitchContext (isNull=0x7ffcfd8f3b2f, econtext=<optimized out>, state=<optimized out>) at ../../../src/include/executor/executor.h:307
#9  advance_aggregates (aggstate=0x1134aa0, aggstate=0x1134aa0) at nodeAgg.c:679
#10 agg_retrieve_direct (aggstate=0x1134aa0) at nodeAgg.c:1847
#11 ExecAgg (pstate=0x1134aa0) at nodeAgg.c:1572
#12 0x000000000063b58b in ExecProcNode (node=0x1134aa0) at ../../../src/include/executor/executor.h:239
#13 ExecutePlan (execute_once=<optimized out>, dest=0x11439d8, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1134aa0, estate=0x1134848)
    at execMain.c:1646
#14 standard_ExecutorRun (queryDesc=0x1094f18, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#15 0x00000000007a43cc in PortalRunSelect (portal=0x10da368, forward=<optimized out>, count=0, dest=<optimized out>) at pquery.c:929
#16 0x00000000007a5958 in PortalRun (portal=portal@entry=0x10da368, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x11439d8, altdest=altdest@entry=0x11439d8,
    completionTag=0x7ffcfd8f3db0 "") at pquery.c:770
#17 0x00000000007a177e in exec_simple_query (query_string=0x10727a8 "select max(id_default) from q_tbl_archiv ;") at postgres.c:1215
#18 0x00000000007a2f3f in PostgresMain (argc=<optimized out>, argv=argv@entry=0x109e4f0, dbname=<optimized out>, username=<optimized out>) at postgres.c:4236
#19 0x00000000007237ce in BackendRun (port=0x10976f0, port=0x10976f0) at postmaster.c:4437
#20 BackendStartup (port=0x10976f0) at postmaster.c:4128
#21 ServerLoop () at postmaster.c:1704
#22 0x000000000072458e in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x106c350) at postmaster.c:1377
#23 0x000000000047d101 in main (argc=3, argv=0x106c350) at main.c:228


It's quite puzzling, though. If I had to guess, I'd say it's some sort
of memory management issue (either we're corrupting it somehow, or
perhaps using it after pfree).

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra-4
In reply to this post by Tom Lane-2
On Wed, Oct 09, 2019 at 10:07:01AM -0400, Tom Lane wrote:
>Tomas Vondra <[hidden email]> writes:
>> FWIW I can reproduce this - it's enough to do this on the 11 cluster
>
>I failed to reproduce any problem from your example, but I was trying
>in C locale on a Linux machine.  What environment are you testing?
>
> regards, tom lane

test=# show lc_collate ;
 lc_collate
------------
 C.UTF-8
(1 row)


I can reproduce this pretty easily like this:

1) build 11

git checkout REL_11_STABLE
./configure --prefix=/home/user/pg-11 --enable-debug --enable-cassert && make -s clean && make -s -j4 install

2) build 12

git checkout REL_12_STABLE
./configure --prefix=/home/user/pg-12 --enable-debug --enable-cassert && make -s clean && make -s -j4 install

3) create the 11 cluster

/home/user/pg-11/bin/pg_ctl -D /tmp/data-11 init
/home/user/pg-11/bin/pg_ctl -D /tmp/data-11 -l /tmp/pg-11.log start
/home/user/pg-11/bin/createdb test
/home/user/pg-11/bin/psql test

4) create the table

    create table q_tbl_archiv as
    with
    qseason as (
    select table_name,column_name, ordinal_position
    ,replace(column_name,'_season','') as col_qualifier
    -- ,'id_'||replace(column_name,'_season','') as id_column
    from information_schema.columns
    order by table_name
    )
    select qs.*,c.column_name as id_column, c.column_default as id_default
    from
            qseason qs
            left join information_schema.columns c on c.table_name=qs.table_name and
    c.column_name like 'id_%';

5) shutdown the 11 cluster

  /home/user/pg-11/bin/pg_ctl -D /tmp/data-11 stop

6) init 12 cluster

  /home/user/pg-12/bin/pg_ctl -D /tmp/data-12 init

7) do the pg_upgrade thing

  /home/user/pg-12/bin/pg_upgrade -b /home/user/pg-11/bin -B /home/user/pg-12/bin -d /tmp/data-11 -D /tmp/data-12 -k

8) start 12 cluster

  /home/user/pg-12/bin/pg_ctl -D /tmp/data-12 -l /tmp/pg-12.log start

9) kabooom

  /home/user/pg-12/bin/psql test -c "analyze q_tbl_archiv"


On my system (Fedora 30 in x86_64) this reliably results a crash (and
various other crashes as demonstrated in my previous message).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

AW: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Hans Buschmann
Hi Tomas,

Nice that you could reproduce it.
This was just the way I followed.

For your Info, here are my no-standard config params:

                name                |         current_setting        
------------------------------------+---------------------------------
 application_name                   | psql                            
 auto_explain.log_analyze           | on                              
 auto_explain.log_min_duration      | 0                              
 auto_explain.log_nested_statements | on                              
 client_encoding                    | WIN1252                        
 cluster_name                       | HB_DEV                          
 data_checksums                     | on                              
 DateStyle                          | ISO, DMY                        
 default_text_search_config         | pg_catalog.german              
 dynamic_shared_memory_type         | windows                        
 effective_cache_size               | 8GB                            
 lc_collate                         | C                              
 lc_ctype                           | German_Germany.1252            
 lc_messages                        | C                              
 lc_monetary                        | German_Germany.1252            
 lc_numeric                         | German_Germany.1252            
 lc_time                            | German_Germany.1252            
 log_destination                    | stderr                          
 log_directory                      | N:/ZZ_log/pg_log_hbdev          
 log_error_verbosity                | verbose                        
 log_file_mode                      | 0640                            
 log_line_prefix                    | WHB %a %t %i %e %2l:>          
 log_statement                      | mod                            
 log_temp_files                     | 0                              
 log_timezone                       | CET                            
 logging_collector                  | on                              
 maintenance_work_mem               | 128MB                          
 max_connections                    | 100                            
 max_stack_depth                    | 2MB                            
 max_wal_size                       | 1GB                            
 min_wal_size                       | 80MB                            
 pg_stat_statements.max             | 5000                            
 pg_stat_statements.track           | all                            
 random_page_cost                   | 1                              
 search_path                        | public, archiv, ablage, admin  
 server_encoding                    | UTF8                            
 server_version                     | 12.0                            
 shared_buffers                     | 1GB                            
 shared_preload_libraries           | auto_explain,pg_stat_statements
 temp_buffers                       | 32MB                            
 TimeZone                           | CET                            
 transaction_deferrable             | off                            
 transaction_isolation              | read committed                  
 transaction_read_only              | off                            
 update_process_title               | off                            
 wal_buffers                        | 16MB                            
 wal_segment_size                   | 16MB                            
 work_mem                           | 32MB                            
(48 rows)

Indeed, the database has UTF8 Encoding.

The Extended error-log (i have set auto_explain):



WHB psql 2019-10-09 15:45:03 CEST  XX000  7:> ERROR:  XX000: invalid memory alloc request size 18446744073709551613
WHB psql 2019-10-09 15:45:03 CEST  XX000  8:> LOCATION:  palloc, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\utils\mmgr\mcxt.c:934
WHB psql 2019-10-09 15:45:03 CEST  XX000  9:> STATEMENT:  select * from q_tbl_archiv;
WHB vacuumdb 2019-10-09 15:46:42 CEST  00000  1:> LOG:  00000: duration: 0.022 ms  plan:
        Query Text: SELECT pg_catalog.set_config('search_path', '', false);
        Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.014..0.015 rows=1 loops=1)
WHB vacuumdb 2019-10-09 15:46:42 CEST  00000  2:> LOCATION:  explain_ExecutorEnd, d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415
WHB vacuumdb 2019-10-09 15:46:42 CEST  00000  3:> LOG:  00000: duration: 0.072 ms  plan:
        Query Text: SELECT datname FROM pg_database WHERE datallowconn ORDER BY 1;
        Sort  (cost=1.16..1.16 rows=1 width=64) (actual time=0.063..0.064 rows=14 loops=1)
          Sort Key: datname
          Sort Method: quicksort  Memory: 26kB
          ->  Seq Scan on pg_database  (cost=0.00..1.15 rows=1 width=64) (actual time=0.018..0.022 rows=14 loops=1)
                Filter: datallowconn
                Rows Removed by Filter: 1
WHB vacuumdb 2019-10-09 15:46:42 CEST  00000  4:> LOCATION:  explain_ExecutorEnd, d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415
WHB vacuumdb 2019-10-09 15:46:43 CEST  00000  1:> LOG:  00000: duration: 0.027 ms  plan:
        Query Text: SELECT pg_catalog.set_config('search_path', '', false);
        Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.012..0.013 rows=1 loops=1)
WHB vacuumdb 2019-10-09 15:46:43 CEST  00000  2:> LOCATION:  explain_ExecutorEnd, d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415
WHB vacuumdb 2019-10-09 15:46:43 CEST  00000  3:> LOG:  00000: duration: 1.036 ms  plan:
        Query Text: SELECT c.relname, ns.nspname FROM pg_catalog.pg_class c
         JOIN pg_catalog.pg_namespace ns ON c.relnamespace OPERATOR(pg_catalog.=) ns.oid
         LEFT JOIN pg_catalog.pg_class t ON c.reltoastrelid OPERATOR(pg_catalog.=) t.oid
         WHERE c.relkind OPERATOR(pg_catalog.=) ANY (array['r', 'm'])
         ORDER BY c.relpages DESC;
        Sort  (cost=56.56..56.59 rows=13 width=132) (actual time=0.843..0.854 rows=320 loops=1)
          Sort Key: c.relpages DESC
          Sort Method: quicksort  Memory: 110kB
          ->  Hash Join  (cost=1.23..56.32 rows=13 width=132) (actual time=0.082..0.649 rows=320 loops=1)
                Hash Cond: (c.relnamespace = ns.oid)
                ->  Seq Scan on pg_class c  (cost=0.00..55.05 rows=13 width=76) (actual time=0.034..0.545 rows=320 loops=1)
                      Filter: ((relkind)::text = ANY ('{r,m}'::text[]))
                      Rows Removed by Filter: 950
                ->  Hash  (cost=1.10..1.10 rows=10 width=68) (actual time=0.022..0.022 rows=10 loops=1)
                      Buckets: 1024  Batches: 1  Memory Usage: 9kB
                      ->  Seq Scan on pg_namespace ns  (cost=0.00..1.10 rows=10 width=68) (actual time=0.010..0.011 rows=10 loops=1)
WHB vacuumdb 2019-10-09 15:46:43 CEST  00000  4:> LOCATION:  explain_ExecutorEnd, d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415
WHB vacuumdb 2019-10-09 15:46:43 CEST  00000  5:> LOG:  00000: duration: 0.011 ms  plan:
        Query Text: SELECT pg_catalog.set_config('search_path', '', false);
        Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.008..0.008 rows=1 loops=1)
WHB vacuumdb 2019-10-09 15:46:43 CEST  00000  6:> LOCATION:  explain_ExecutorEnd, d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415
WHB  2019-10-09 15:47:01 CEST  00000 22:> LOG:  00000: server process (PID 4708) was terminated by exception 0xC0000005
WHB  2019-10-09 15:47:01 CEST  00000 23:> DETAIL:  Failed process was running: ANALYZE admin.q_tbl_archiv;
WHB  2019-10-09 15:47:01 CEST  00000 24:> HINT:  See C include file "ntstatus.h" for a description of the hexadecimal value.
WHB  2019-10-09 15:47:01 CEST  00000 25:> LOCATION:  LogChildExit, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\postmaster\postmaster.c:3670
WHB  2019-10-09 15:47:01 CEST  00000 26:> LOG:  00000: terminating any other active server processes
WHB  2019-10-09 15:47:01 CEST  00000 27:> LOCATION:  HandleChildCrash, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\postmaster\postmaster.c:3400
WHB psql 2019-10-09 15:47:01 CEST  57P02 10:> WARNING:  57P02: terminating connection because of crash of another server process
WHB psql 2019-10-09 15:47:01 CEST  57P02 11:> DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
WHB psql 2019-10-09 15:47:01 CEST  57P02 12:> HINT:  In a moment you should be able to reconnect to the database and repeat your command.
WHB psql 2019-10-09 15:47:01 CEST  57P02 13:> LOCATION:  quickdie, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\tcop\postgres.c:2717
WHB  2019-10-09 15:47:02 CEST  57P02  3:> WARNING:  57P02: terminating connection because of crash of another server process
WHB  2019-10-09 15:47:02 CEST  57P02  4:> DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
WHB  2019-10-09 15:47:02 CEST  57P02  5:> HINT:  In a moment you should be able to reconnect to the database and repeat your command.
WHB  2019-10-09 15:47:02 CEST  57P02  6:> LOCATION:  quickdie, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\tcop\postgres.c:2717
WHB  2019-10-09 15:47:02 CEST  00000 28:> LOG:  00000: all server processes terminated; reinitializing
WHB  2019-10-09 15:47:02 CEST  00000 29:> LOCATION:  PostmasterStateMachine, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\postmaster\postmaster.c:3912
WHB  2019-10-09 15:47:02 CEST  00000  1:> LOG:  00000: database system was interrupted; last known up at 2019-10-09 15:46:03 CEST
WHB  2019-10-09 15:47:02 CEST  00000  2:> LOCATION:  StartupXLOG, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\access\transam\xlog.c:6277


The table was imported successively by pg_dump/pg_restore from the previous versions into pg11.

This was the same what I did on the other machine (pg 11.5). On this test machine I could successfully Export the table with pg_dump -t.

On the erroneous PG12 Cluster I succeeded to recreate a similar  table with the original create table Statements: no Errors.

Under PG12 upgraded, I tried to select only the first column (select table_name from q_tbl_archiv) and got erroneaus results (shown first 2 entries):

cpsdb=# select table_name from q_tbl_archiv;
                 table_name
---------------------------------------------
 \x11chemmat\x17chm_season
 !collectionsheet\x15cs_season

It seems that the length Bytes are present in the Output.

Hope this Information helps.

Hans Buschmann



Reply | Threaded
Open this post in threaded view
|

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra-4
In reply to this post by Tomas Vondra-4
Well, I think I found the root cause. It's because of 7c15cef86d, which
changed the definition of sql_identifier so that it's a domain over name
instead of varchar. So we now have this:

  SELECT typname, typlen FROM pg_type WHERE typname = 'sql_identifier':

  -[ RECORD 1 ]--+---------------
  typname        | sql_identifier
  typlen         | -1

instead of this

  -[ RECORD 1 ]--+---------------
  typname        | sql_identifier
  typlen         | 64

Unfortunately, that seems very much like a break of on-disk format, and
after pg_upgrade any table containing sql_identifier columns is pretty
much guaranteed to be badly mangled. For example, the first row from the
table used in the original report looks like this on PostgreSQL 11:

  test=# select ctid, * from q_tbl_archiv limit 1;
  -[ RECORD 1 ]----+--------------------------
  ctid             | (0,1)
  table_name       | _pg_foreign_data_wrappers
  column_name      | foreign_data_wrapper_name
  ordinal_position | 5
  col_qualifier    | foreign_data_wrapper_name
  id_column        |
  id_default       |

while on PostgreSQL 12 after pg_upgrade it looks like this

  test=# select ctid, table_name, column_name, ordinal_position from q_tbl_archiv limit 1;:
  -[ RECORD 1 ]----+---------------------------------------------------------
  ctid             | (0,1)
  table_name       | 5_pg_foreign_data_wrappers5foreign_data_wrapper_name\x05
  column_name      | _data_wrapper_name
  ordinal_position | 0

Not sure what to do about this :-(


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tom Lane-2
Tomas Vondra <[hidden email]> writes:
> Well, I think I found the root cause. It's because of 7c15cef86d, which
> changed the definition of sql_identifier so that it's a domain over name
> instead of varchar.

Ah...

> Not sure what to do about this :-(

Fortunately, there should be close to zero people with user tables
depending on sql_identifier.  I think we should just add a test in
pg_upgrade that refuses to upgrade if there are any such columns.
It won't be the first such restriction.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra-4
On Wed, Oct 09, 2019 at 07:18:45PM -0400, Tom Lane wrote:

>Tomas Vondra <[hidden email]> writes:
>> Well, I think I found the root cause. It's because of 7c15cef86d, which
>> changed the definition of sql_identifier so that it's a domain over name
>> instead of varchar.
>
>Ah...
>
>> Not sure what to do about this :-(
>
>Fortunately, there should be close to zero people with user tables
>depending on sql_identifier.  I think we should just add a test in
>pg_upgrade that refuses to upgrade if there are any such columns.
>It won't be the first such restriction.
>

Hmmm, yeah.  I agree the number of people using sql_identifier in user
tables is low, but OTOH we got this report within a week after release,
so maybe it's higher than we think.

Another option would be to teach pg_upgrade to switch the columns to
'text' or 'varchar', not sure if that's possible or how much work would
that be.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tom Lane-2
Tomas Vondra <[hidden email]> writes:
> On Wed, Oct 09, 2019 at 07:18:45PM -0400, Tom Lane wrote:
>> Fortunately, there should be close to zero people with user tables
>> depending on sql_identifier.  I think we should just add a test in
>> pg_upgrade that refuses to upgrade if there are any such columns.
>> It won't be the first such restriction.

> Hmmm, yeah.  I agree the number of people using sql_identifier in user
> tables is low, but OTOH we got this report within a week after release,
> so maybe it's higher than we think.

True.

> Another option would be to teach pg_upgrade to switch the columns to
> 'text' or 'varchar', not sure if that's possible or how much work would
> that be.

I think it'd be a mess --- the actual hacking would have to happen in
pg_dump, I think, and it'd be a kluge because pg_dump doesn't normally
understand what server version its output is going to.  So we'd more
or less have to control it through a new pg_dump switch that pg_upgrade
would use.  Ick.

Also, even if we did try to silently convert such columns that way,
I bet we'd get other bug reports about "why'd my columns suddenly
change type?".  So I'd rather force the user to be involved.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Andres Freund
On 2019-10-09 19:41:54 -0400, Tom Lane wrote:
> Also, even if we did try to silently convert such columns that way,
> I bet we'd get other bug reports about "why'd my columns suddenly
> change type?".  So I'd rather force the user to be involved.

+1


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra-4
On Wed, Oct 09, 2019 at 06:48:13PM -0700, Andres Freund wrote:
>On 2019-10-09 19:41:54 -0400, Tom Lane wrote:
>> Also, even if we did try to silently convert such columns that way,
>> I bet we'd get other bug reports about "why'd my columns suddenly
>> change type?".  So I'd rather force the user to be involved.
>
>+1

Fair enough, attached is a patch doing that, I think. Maybe the file
should be named differently, as it contains other objects than just
tables.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pg-upgrade-sql-identifier-fix.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tom Lane-2
Tomas Vondra <[hidden email]> writes:
> Fair enough, attached is a patch doing that, I think. Maybe the file
> should be named differently, as it contains other objects than just
> tables.

Seems about right, though I notice it will not detect domains over
sql_identifier.  How much do we care about that?

To identify such domains, I think we'd need something like
WHERE attypid IN (recursive-WITH-query), which makes me nervous.
We did support those starting with 8.4, which is as far back as
pg_upgrade will go, so in theory it should work.  But I think we
had bugs with such cases in old releases.  Do we want to assume
that the source server has been updated enough to avoid any such
bugs?  The expense of such a query might be daunting, too.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra-4
On Thu, Oct 10, 2019 at 10:19:12AM -0400, Tom Lane wrote:

>Tomas Vondra <[hidden email]> writes:
>> Fair enough, attached is a patch doing that, I think. Maybe the file
>> should be named differently, as it contains other objects than just
>> tables.
>
>Seems about right, though I notice it will not detect domains over
>sql_identifier.  How much do we care about that?
>
>To identify such domains, I think we'd need something like
>WHERE attypid IN (recursive-WITH-query), which makes me nervous.
>We did support those starting with 8.4, which is as far back as
>pg_upgrade will go, so in theory it should work.  But I think we
>had bugs with such cases in old releases.  Do we want to assume
>that the source server has been updated enough to avoid any such
>bugs?  The expense of such a query might be daunting, too.
>

Not sure.

Regarding bugs, I think it's fine to assume the users are running
sufficiently recent version - they may not, but then they're probably
subject to various other bugs (data corruption, queries). If they're
not, then they'll either get false positives (in which case they'll be
forced to update) or false negatives (which is just as if we did
nothing).

For the query cost, I think we can assume the domain hierarchies are not
particularly deep (in practice I'd expect just domains directly on the
sql_identifier type). And I doubt people are using that very widely,
it's probably more like this report - ad-hoc CTAS, so just a couple of
items. So I wouldn't expect it to be a huge deal in most cases. But even
if it takes a second or two, it's a one-time cost.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tom Lane-2
Tomas Vondra <[hidden email]> writes:
> On Thu, Oct 10, 2019 at 10:19:12AM -0400, Tom Lane wrote:
>> To identify such domains, I think we'd need something like
>> WHERE attypid IN (recursive-WITH-query), which makes me nervous.
>> We did support those starting with 8.4, which is as far back as
>> pg_upgrade will go, so in theory it should work.  But I think we
>> had bugs with such cases in old releases.  Do we want to assume
>> that the source server has been updated enough to avoid any such
>> bugs?  The expense of such a query might be daunting, too.

> For the query cost, I think we can assume the domain hierarchies are not
> particularly deep (in practice I'd expect just domains directly on the
> sql_identifier type). And I doubt people are using that very widely,
> it's probably more like this report - ad-hoc CTAS, so just a couple of
> items. So I wouldn't expect it to be a huge deal in most cases. But even
> if it takes a second or two, it's a one-time cost.

What I was worried about was the planner possibly trying to apply the
atttypid restriction as a scan qual using a subplan, which might be rather
awful.  But it doesn't look like that happens.  I get a hash semijoin to
the CTE output, in all branches back to 8.4, on this trial query:

explain
with recursive sqlidoids(toid) as (
select 'information_schema.sql_identifier'::pg_catalog.regtype as toid
union
select oid from pg_catalog.pg_type, sqlidoids
  where typtype = 'd' and typbasetype = sqlidoids.toid
)                      
SELECT n.nspname, c.relname, a.attname
FROM    pg_catalog.pg_class c,
        pg_catalog.pg_namespace n,
        pg_catalog.pg_attribute a
WHERE   c.oid = a.attrelid AND
        NOT a.attisdropped AND
        a.atttypid in (select toid from sqlidoids) AND
        c.relkind IN ('r','v','i') and
        c.relnamespace = n.oid AND
        n.nspname !~ '^pg_temp_' AND
        n.nspname !~ '^pg_toast_temp_' AND
        n.nspname NOT IN ('pg_catalog', 'information_schema');

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra-4
On Thu, Oct 10, 2019 at 04:14:20PM -0400, Tom Lane wrote:

>Tomas Vondra <[hidden email]> writes:
>> On Thu, Oct 10, 2019 at 10:19:12AM -0400, Tom Lane wrote:
>>> To identify such domains, I think we'd need something like
>>> WHERE attypid IN (recursive-WITH-query), which makes me nervous.
>>> We did support those starting with 8.4, which is as far back as
>>> pg_upgrade will go, so in theory it should work.  But I think we
>>> had bugs with such cases in old releases.  Do we want to assume
>>> that the source server has been updated enough to avoid any such
>>> bugs?  The expense of such a query might be daunting, too.
>
>> For the query cost, I think we can assume the domain hierarchies are not
>> particularly deep (in practice I'd expect just domains directly on the
>> sql_identifier type). And I doubt people are using that very widely,
>> it's probably more like this report - ad-hoc CTAS, so just a couple of
>> items. So I wouldn't expect it to be a huge deal in most cases. But even
>> if it takes a second or two, it's a one-time cost.
>
>What I was worried about was the planner possibly trying to apply the
>atttypid restriction as a scan qual using a subplan, which might be rather
>awful.  But it doesn't look like that happens.

OK.

> I get a hash semijoin to
>the CTE output, in all branches back to 8.4, on this trial query:
>
>explain
>with recursive sqlidoids(toid) as (
>select 'information_schema.sql_identifier'::pg_catalog.regtype as toid
>union
>select oid from pg_catalog.pg_type, sqlidoids
>  where typtype = 'd' and typbasetype = sqlidoids.toid
>)
>SELECT n.nspname, c.relname, a.attname
>FROM    pg_catalog.pg_class c,
>        pg_catalog.pg_namespace n,
>        pg_catalog.pg_attribute a
>WHERE   c.oid = a.attrelid AND
>        NOT a.attisdropped AND
>        a.atttypid in (select toid from sqlidoids) AND
>        c.relkind IN ('r','v','i') and
>        c.relnamespace = n.oid AND
>        n.nspname !~ '^pg_temp_' AND
>        n.nspname !~ '^pg_toast_temp_' AND
>        n.nspname NOT IN ('pg_catalog', 'information_schema');
>

I think that's not quite sufficient - the problem is that we can have
domains and composite types on sql_identifier, in some arbitrary order.
And the recursive CTE won't handle that the way it's written - it will
miss domains on composite types containing sql_identifier. And we have
quite a few of them in the information schema, so maybe someone created
a domain on one of those (however unlikely it may seem).

I think this recursive CTE does it correctly:

WITH RECURSIVE oids AS (
    -- type itself
    SELECT 'information_schema.sql_identifier'::regtype AS oid
    UNION ALL
    SELECT * FROM (
        -- domains on the type
        WITH x AS (SELECT oid FROM oids)
        SELECT t.oid FROM pg_catalog.pg_type t, x WHERE typbasetype = x.oid AND typtype = 'd'
        UNION
        -- composite types containing the type
        SELECT t.oid FROM pg_catalog.pg_type t, pg_catalog.pg_class c, pg_catalog.pg_attribute a, x
        WHERE t.typtype = 'c' AND
              t.oid = c.reltype AND
              c.oid = a.attrelid AND
              a.atttypid = x.oid
    ) foo
)

I had to use CTE within CTE, because the 'oids' can be referenced only
once, but we have two subqueries there. Maybe there's a better solution.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra-4
OK,

here is an updated patch, with the recursive CTE. I've done a fair
amount of testing on it on older versions (up to 9.4), and it seems to
work just fine.

Another thing that I noticed is that the query does not need to look at
RELKIND_COMPOSITE_TYPE, because we only really care about cases when
sql_identifier is stored on-disk. Composite type alone does not do that,
and the CTE includes OIDs of composite types that we then check against
relations and matviews.

Barring objections, I'll push this early next week.


BTW the query (including the RELKIND_COMPSITE_TYPE) was essentially just
a lightly-massaged copy of old_9_6_check_for_unknown_data_type_usage, so
that seems wrong too. The comment explicitly says:

 * Also check composite types, in case they are used for table columns.

but even a simple "create type c as (a unknown, b int)" without any
table using it enough to trigger the failure. But maybe it's
intentional, not sure.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pg-upgrade-sql-identifier-fix-v2.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tom Lane-2
Tomas Vondra <[hidden email]> writes:
> here is an updated patch, with the recursive CTE. I've done a fair
> amount of testing on it on older versions (up to 9.4), and it seems to
> work just fine.

Might be a good idea to exclude attisdropped columns in the part of the
recursive query that's looking for sql_identifier columns of composite
types.  I'm not sure if composites can have dropped columns today,
but even if they can't it seems like a wise bit of future-proofing.
(We'll no doubt have occasion to use this logic again...)

Looks good other than that nit.

> BTW the query (including the RELKIND_COMPSITE_TYPE) was essentially just
> a lightly-massaged copy of old_9_6_check_for_unknown_data_type_usage, so
> that seems wrong too.

Yeah, we should back-port this logic into that check too, IMO.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra-4
On Sun, Oct 13, 2019 at 02:26:48PM -0400, Tom Lane wrote:

>Tomas Vondra <[hidden email]> writes:
>> here is an updated patch, with the recursive CTE. I've done a fair
>> amount of testing on it on older versions (up to 9.4), and it seems to
>> work just fine.
>
>Might be a good idea to exclude attisdropped columns in the part of the
>recursive query that's looking for sql_identifier columns of composite
>types.  I'm not sure if composites can have dropped columns today,
>but even if they can't it seems like a wise bit of future-proofing.
>(We'll no doubt have occasion to use this logic again...)
>

Hmm? How could that be safe? Let's say we have a composite type with a
sql_identifier column, it's used in a table with data, and we drop the
column. We need the pg_type information to parse the existing, so how
could we skip attisdropped columns?

>Looks good other than that nit.
>
>> BTW the query (including the RELKIND_COMPSITE_TYPE) was essentially just
>> a lightly-massaged copy of old_9_6_check_for_unknown_data_type_usage, so
>> that seems wrong too.
>
>Yeah, we should back-port this logic into that check too, IMO.
>

You mean the recursive CTE, removal of RELKIND_COMPOSITE_TYPE or the
proposed change w.r.t. dropped columns?


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


12