BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
39 messages Options
12
Reply | Threaded
Open this post in threaded view
|

BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Todd A. Cook
The following bug has been logged on the website:

Bug reference:      14932
Logged by:          Todd Cook
Email address:      [hidden email]
PostgreSQL version: 10.1
Operating system:   CentOS Linux release 7.4.1708 (Core)
Description:        

It hangs on a table with 167834 rows, though it works fine with only 167833
rows.  When it hangs, CTRL-C does not interrupt it, and the backend has to
be killed to stop it.

Some sample stack traces:

#0  0x00007f66f1ee9860 in __memset_sse2 () from /lib64/libc.so.6
#1  0x000000000083e085 in memset (__len=51539607552, __ch=0,
__dest=0x7f4cdf35c048) at /usr/include/bits/string3.h:84
#2  MemoryContextAllocExtended (context=<optimized out>, size=51539607552,
flags=flags@entry=5) at mcxt.c:843
#3  0x00000000005ec804 in tuplehash_allocate (type=0x163ecc8,
size=<optimized out>) at ../../../src/include/lib/simplehash.h:305
#4  tuplehash_grow (tb=0x163ecc8, newsize=<optimized out>) at
../../../src/include/lib/simplehash.h:379
#5  0x00000000005ece35 in tuplehash_insert (tb=0x163ecc8, key=key@entry=0x0,
found=found@entry=0x7ffdaff3eb77 "") at
../../../src/include/lib/simplehash.h:504
#6  0x00000000005ed3ea in LookupTupleHashEntry (hashtable=0x163ec38,
slot=slot@entry=0x163e220, isnew=isnew@entry=0x7ffdaff3ebd7 "") at
execGrouping.c:387
#7  0x00000000005fae62 in lookup_hash_entry (aggstate=0x163ce88) at
nodeAgg.c:2075
#8  lookup_hash_entries (aggstate=aggstate@entry=0x163ce88) at
nodeAgg.c:2106
#9  0x00000000005fc5da in agg_fill_hash_table (aggstate=0x163ce88) at
nodeAgg.c:2536
#10 ExecAgg (pstate=0x163ce88) at nodeAgg.c:2140
#11 0x00000000005eec32 in ExecProcNode (node=0x163ce88) at
../../../src/include/executor/executor.h:250
#12 ExecutePlan (execute_once=<optimized out>, dest=0x163bbf8,
direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',
operation=CMD_SELECT, use_parallel_mode=<optimized out>,
planstate=0x163ce88, estate=0x163cc28) at execMain.c:1722
#13 standard_ExecutorRun (queryDesc=0x163c818, direction=<optimized out>,
count=0, execute_once=<optimized out>) at execMain.c:363
#14 0x0000000000718e3b in PortalRunSelect (portal=portal@entry=0x158a568,
forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807,
dest=dest@entry=0x163bbf8) at pquery.c:932
#15 0x000000000071a1ef in PortalRun (portal=<optimized out>,
count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized
out>, dest=0x163bbf8, altdest=0x163bbf8, completionTag=0x7ffdaff3eed0 "") at
pquery.c:773
#16 0x0000000000716163 in exec_simple_query (query_string=<optimized out>)
at postgres.c:1099
#17 0x000000000071745c in PostgresMain (argc=<optimized out>,
argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at
postgres.c:4088
#18 0x000000000047ad1c in BackendRun (port=0x1592ab0) at postmaster.c:4357
#19 BackendStartup (port=0x1592ab0) at postmaster.c:4029
#20 ServerLoop () at postmaster.c:1753
#21 0x00000000006aea2f in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x156bed0) at postmaster.c:1361
#22 0x000000000047bb4f in main (argc=3, argv=0x156bed0) at main.c:228
Continuing.

Program received signal SIGINT, Interrupt.
0x00000000005ec8ac in tuplehash_grow (tb=0x163ecc8, newsize=<optimized out>)
at ../../../src/include/lib/simplehash.h:443
443 newentry = &newdata[curelem];
#0  0x00000000005ec8ac in tuplehash_grow (tb=0x163ecc8, newsize=<optimized
out>) at ../../../src/include/lib/simplehash.h:443
#1  0x00000000005ece35 in tuplehash_insert (tb=0x163ecc8, key=key@entry=0x0,
found=found@entry=0x7ffdaff3eb77 "") at
../../../src/include/lib/simplehash.h:504
#2  0x00000000005ed3ea in LookupTupleHashEntry (hashtable=0x163ec38,
slot=slot@entry=0x163e220, isnew=isnew@entry=0x7ffdaff3ebd7 "") at
execGrouping.c:387
#3  0x00000000005fae62 in lookup_hash_entry (aggstate=0x163ce88) at
nodeAgg.c:2075
#4  lookup_hash_entries (aggstate=aggstate@entry=0x163ce88) at
nodeAgg.c:2106
#5  0x00000000005fc5da in agg_fill_hash_table (aggstate=0x163ce88) at
nodeAgg.c:2536
#6  ExecAgg (pstate=0x163ce88) at nodeAgg.c:2140
#7  0x00000000005eec32 in ExecProcNode (node=0x163ce88) at
../../../src/include/executor/executor.h:250
#8  ExecutePlan (execute_once=<optimized out>, dest=0x163bbf8,
direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',
operation=CMD_SELECT, use_parallel_mode=<optimized out>,
planstate=0x163ce88, estate=0x163cc28) at execMain.c:1722
#9  standard_ExecutorRun (queryDesc=0x163c818, direction=<optimized out>,
count=0, execute_once=<optimized out>) at execMain.c:363
#10 0x0000000000718e3b in PortalRunSelect (portal=portal@entry=0x158a568,
forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807,
dest=dest@entry=0x163bbf8) at pquery.c:932
#11 0x000000000071a1ef in PortalRun (portal=<optimized out>,
count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized
out>, dest=0x163bbf8, altdest=0x163bbf8, completionTag=0x7ffdaff3eed0 "") at
pquery.c:773
#12 0x0000000000716163 in exec_simple_query (query_string=<optimized out>)
at postgres.c:1099
#13 0x000000000071745c in PostgresMain (argc=<optimized out>,
argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at
postgres.c:4088
#14 0x000000000047ad1c in BackendRun (port=0x1592ab0) at postmaster.c:4357
#15 BackendStartup (port=0x1592ab0) at postmaster.c:4029
#16 ServerLoop () at postmaster.c:1753
#17 0x00000000006aea2f in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x156bed0) at postmaster.c:1361
#18 0x000000000047bb4f in main (argc=3, argv=0x156bed0) at main.c:228
Continuing.

Program received signal SIGINT, Interrupt.
0x00000000005ec8ac in tuplehash_grow (tb=0x163ecc8, newsize=<optimized out>)
at ../../../src/include/lib/simplehash.h:443
443 newentry = &newdata[curelem];
#0  0x00000000005ec8ac in tuplehash_grow (tb=0x163ecc8, newsize=<optimized
out>) at ../../../src/include/lib/simplehash.h:443
#1  0x00000000005ece35 in tuplehash_insert (tb=0x163ecc8, key=key@entry=0x0,
found=found@entry=0x7ffdaff3eb77 "") at
../../../src/include/lib/simplehash.h:504
#2  0x00000000005ed3ea in LookupTupleHashEntry (hashtable=0x163ec38,
slot=slot@entry=0x163e220, isnew=isnew@entry=0x7ffdaff3ebd7 "") at
execGrouping.c:387
#3  0x00000000005fae62 in lookup_hash_entry (aggstate=0x163ce88) at
nodeAgg.c:2075
#4  lookup_hash_entries (aggstate=aggstate@entry=0x163ce88) at
nodeAgg.c:2106
#5  0x00000000005fc5da in agg_fill_hash_table (aggstate=0x163ce88) at
nodeAgg.c:2536
#6  ExecAgg (pstate=0x163ce88) at nodeAgg.c:2140
#7  0x00000000005eec32 in ExecProcNode (node=0x163ce88) at
../../../src/include/executor/executor.h:250
#8  ExecutePlan (execute_once=<optimized out>, dest=0x163bbf8,
direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',
operation=CMD_SELECT, use_parallel_mode=<optimized out>,
planstate=0x163ce88, estate=0x163cc28) at execMain.c:1722
#9  standard_ExecutorRun (queryDesc=0x163c818, direction=<optimized out>,
count=0, execute_once=<optimized out>) at execMain.c:363
#10 0x0000000000718e3b in PortalRunSelect (portal=portal@entry=0x158a568,
forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807,
dest=dest@entry=0x163bbf8) at pquery.c:932
#11 0x000000000071a1ef in PortalRun (portal=<optimized out>,
count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized
out>, dest=0x163bbf8, altdest=0x163bbf8, completionTag=0x7ffdaff3eed0 "") at
pquery.c:773
#12 0x0000000000716163 in exec_simple_query (query_string=<optimized out>)
at postgres.c:1099
#13 0x000000000071745c in PostgresMain (argc=<optimized out>,
argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at
postgres.c:4088
#14 0x000000000047ad1c in BackendRun (port=0x1592ab0) at postmaster.c:4357
#15 BackendStartup (port=0x1592ab0) at postmaster.c:4029
#16 ServerLoop () at postmaster.c:1753
#17 0x00000000006aea2f in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x156bed0) at postmaster.c:1361
#18 0x000000000047bb4f in main (argc=3, argv=0x156bed0) at main.c:228
Continuing.

Program received signal SIGINT, Interrupt.
tuplehash_grow (tb=0x163ecc8, newsize=<optimized out>) at
../../../src/include/lib/simplehash.h:450
450 curelem = SH_NEXT(tb, curelem, startelem);
#0  tuplehash_grow (tb=0x163ecc8, newsize=<optimized out>) at
../../../src/include/lib/simplehash.h:450
#1  0x00000000005ece35 in tuplehash_insert (tb=0x163ecc8, key=key@entry=0x0,
found=found@entry=0x7ffdaff3eb77 "") at
../../../src/include/lib/simplehash.h:504
#2  0x00000000005ed3ea in LookupTupleHashEntry (hashtable=0x163ec38,
slot=slot@entry=0x163e220, isnew=isnew@entry=0x7ffdaff3ebd7 "") at
execGrouping.c:387
#3  0x00000000005fae62 in lookup_hash_entry (aggstate=0x163ce88) at
nodeAgg.c:2075
#4  lookup_hash_entries (aggstate=aggstate@entry=0x163ce88) at
nodeAgg.c:2106
#5  0x00000000005fc5da in agg_fill_hash_table (aggstate=0x163ce88) at
nodeAgg.c:2536
#6  ExecAgg (pstate=0x163ce88) at nodeAgg.c:2140
#7  0x00000000005eec32 in ExecProcNode (node=0x163ce88) at
../../../src/include/executor/executor.h:250
#8  ExecutePlan (execute_once=<optimized out>, dest=0x163bbf8,
direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',
operation=CMD_SELECT, use_parallel_mode=<optimized out>,
planstate=0x163ce88, estate=0x163cc28) at execMain.c:1722
#9  standard_ExecutorRun (queryDesc=0x163c818, direction=<optimized out>,
count=0, execute_once=<optimized out>) at execMain.c:363
#10 0x0000000000718e3b in PortalRunSelect (portal=portal@entry=0x158a568,
forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807,
dest=dest@entry=0x163bbf8) at pquery.c:932
#11 0x000000000071a1ef in PortalRun (portal=<optimized out>,
count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized
out>, dest=0x163bbf8, altdest=0x163bbf8, completionTag=0x7ffdaff3eed0 "") at
pquery.c:773
#12 0x0000000000716163 in exec_simple_query (query_string=<optimized out>)
at postgres.c:1099
#13 0x000000000071745c in PostgresMain (argc=<optimized out>,
argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at
postgres.c:4088
#14 0x000000000047ad1c in BackendRun (port=0x1592ab0) at postmaster.c:4357
#15 BackendStartup (port=0x1592ab0) at postmaster.c:4029
#16 ServerLoop () at postmaster.c:1753
#17 0x00000000006aea2f in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x156bed0) at postmaster.c:1361
#18 0x000000000047bb4f in main (argc=3, argv=0x156bed0) at main.c:228
Continuing.

Program received signal SIGINT, Interrupt.
0x00000000005ec8ac in tuplehash_grow (tb=0x163ecc8, newsize=<optimized out>)
at ../../../src/include/lib/simplehash.h:443
443 newentry = &newdata[curelem];
#0  0x00000000005ec8ac in tuplehash_grow (tb=0x163ecc8, newsize=<optimized
out>) at ../../../src/include/lib/simplehash.h:443
#1  0x00000000005ece35 in tuplehash_insert (tb=0x163ecc8, key=key@entry=0x0,
found=found@entry=0x7ffdaff3eb77 "") at
../../../src/include/lib/simplehash.h:504
#2  0x00000000005ed3ea in LookupTupleHashEntry (hashtable=0x163ec38,
slot=slot@entry=0x163e220, isnew=isnew@entry=0x7ffdaff3ebd7 "") at
execGrouping.c:387
#3  0x00000000005fae62 in lookup_hash_entry (aggstate=0x163ce88) at
nodeAgg.c:2075
#4  lookup_hash_entries (aggstate=aggstate@entry=0x163ce88) at
nodeAgg.c:2106
#5  0x00000000005fc5da in agg_fill_hash_table (aggstate=0x163ce88) at
nodeAgg.c:2536
#6  ExecAgg (pstate=0x163ce88) at nodeAgg.c:2140
#7  0x00000000005eec32 in ExecProcNode (node=0x163ce88) at
../../../src/include/executor/executor.h:250
#8  ExecutePlan (execute_once=<optimized out>, dest=0x163bbf8,
direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',
operation=CMD_SELECT, use_parallel_mode=<optimized out>,
planstate=0x163ce88, estate=0x163cc28) at execMain.c:1722
#9  standard_ExecutorRun (queryDesc=0x163c818, direction=<optimized out>,
count=0, execute_once=<optimized out>) at execMain.c:363
#10 0x0000000000718e3b in PortalRunSelect (portal=portal@entry=0x158a568,
forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807,
dest=dest@entry=0x163bbf8) at pquery.c:932
#11 0x000000000071a1ef in PortalRun (portal=<optimized out>,
count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized
out>, dest=0x163bbf8, altdest=0x163bbf8, completionTag=0x7ffdaff3eed0 "") at
pquery.c:773
#12 0x0000000000716163 in exec_simple_query (query_string=<optimized out>)
at postgres.c:1099
#13 0x000000000071745c in PostgresMain (argc=<optimized out>,
argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at
postgres.c:4088
#14 0x000000000047ad1c in BackendRun (port=0x1592ab0) at postmaster.c:4357
#15 BackendStartup (port=0x1592ab0) at postmaster.c:4029
#16 ServerLoop () at postmaster.c:1753
#17 0x00000000006aea2f in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x156bed0) at postmaster.c:1361
#18 0x000000000047bb4f in main (argc=3, argv=0x156bed0) at main.c:228
Continuing.

Program received signal SIGINT, Interrupt.
0x00000000005ec8b4 in tuplehash_grow (tb=0x163ecc8, newsize=<optimized out>)
at ../../../src/include/lib/simplehash.h:445
445 if (newentry->status == SH_STATUS_EMPTY)
#0  0x00000000005ec8b4 in tuplehash_grow (tb=0x163ecc8, newsize=<optimized
out>) at ../../../src/include/lib/simplehash.h:445
#1  0x00000000005ece35 in tuplehash_insert (tb=0x163ecc8, key=key@entry=0x0,
found=found@entry=0x7ffdaff3eb77 "") at
../../../src/include/lib/simplehash.h:504
#2  0x00000000005ed3ea in LookupTupleHashEntry (hashtable=0x163ec38,
slot=slot@entry=0x163e220, isnew=isnew@entry=0x7ffdaff3ebd7 "") at
execGrouping.c:387
#3  0x00000000005fae62 in lookup_hash_entry (aggstate=0x163ce88) at
nodeAgg.c:2075
#4  lookup_hash_entries (aggstate=aggstate@entry=0x163ce88) at
nodeAgg.c:2106
#5  0x00000000005fc5da in agg_fill_hash_table (aggstate=0x163ce88) at
nodeAgg.c:2536
#6  ExecAgg (pstate=0x163ce88) at nodeAgg.c:2140
#7  0x00000000005eec32 in ExecProcNode (node=0x163ce88) at
../../../src/include/executor/executor.h:250
#8  ExecutePlan (execute_once=<optimized out>, dest=0x163bbf8,
direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',
operation=CMD_SELECT, use_parallel_mode=<optimized out>,
planstate=0x163ce88, estate=0x163cc28) at execMain.c:1722
#9  standard_ExecutorRun (queryDesc=0x163c818, direction=<optimized out>,
count=0, execute_once=<optimized out>) at execMain.c:363
#10 0x0000000000718e3b in PortalRunSelect (portal=portal@entry=0x158a568,
forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807,
dest=dest@entry=0x163bbf8) at pquery.c:932
#11 0x000000000071a1ef in PortalRun (portal=<optimized out>,
count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized
out>, dest=0x163bbf8, altdest=0x163bbf8, completionTag=0x7ffdaff3eed0 "") at
pquery.c:773
#12 0x0000000000716163 in exec_simple_query (query_string=<optimized out>)
at postgres.c:1099
#13 0x000000000071745c in PostgresMain (argc=<optimized out>,
argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at
postgres.c:4088
#14 0x000000000047ad1c in BackendRun (port=0x1592ab0) at postmaster.c:4357
#15 BackendStartup (port=0x1592ab0) at postmaster.c:4029
#16 ServerLoop () at postmaster.c:1753
#17 0x00000000006aea2f in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x156bed0) at postmaster.c:1361
#18 0x000000000047bb4f in main (argc=3, argv=0x156bed0) at main.c:228


select name, setting, unit, source, pending_restart from pg_settings where
source <> 'default' and context <> 'internal' order by lower(name) ;
             name             |      setting       | unit |        source  
    | pending_restart
------------------------------+--------------------+------+----------------------+-----------------
 application_name             | psql               |      | client          
    | f
 autovacuum_work_mem          | 131072             | kB   | configuration
file   | f
 checkpoint_completion_target | 0.8                |      | configuration
file   | f
 checkpoint_timeout           | 1800               | s    | configuration
file   | f
 client_encoding              | SQL_ASCII          |      | client          
    | f
 cluster_name                 | PG 10              |      | configuration
file   | f
 DateStyle                    | ISO, MDY           |      | configuration
file   | f
 default_text_search_config   | pg_catalog.english |      | configuration
file   | f
 dynamic_shared_memory_type   | posix              |      | configuration
file   | f
 effective_cache_size         | 8388608            | 8kB  | configuration
file   | f
 lc_messages                  | C                  |      | configuration
file   | f
 lc_monetary                  | C                  |      | configuration
file   | f
 lc_numeric                   | C                  |      | configuration
file   | f
 lc_time                      | C                  |      | configuration
file   | f
 listen_addresses             | *                  |      | configuration
file   | f
 log_destination              | stderr             |      | configuration
file   | f
 log_line_prefix              | %m [%p]            |      | configuration
file   | f
 log_lock_waits               | on                 |      | configuration
file   | f
 log_min_duration_statement   | 20000              | ms   | configuration
file   | f
 log_rotation_age             | 1440               | min  | configuration
file   | f
 log_rotation_size            | 0                  | kB   | configuration
file   | f
 log_temp_files               | 1024               | kB   | configuration
file   | f
 log_timezone                 | US/Eastern         |      | configuration
file   | f
 log_truncate_on_rotation     | off                |      | configuration
file   | f
 logging_collector            | on                 |      | configuration
file   | f
 maintenance_work_mem         | 1048576            | kB   | configuration
file   | f
 max_connections              | 100                |      | configuration
file   | f
 max_stack_depth              | 2048               | kB   | environment
variable | f
 max_wal_senders              | 0                  |      | configuration
file   | f
 max_wal_size                 | 4096               | MB   | configuration
file   | f
 port                         | 54310              |      | configuration
file   | f
 shared_buffers               | 1048576            | 8kB  | configuration
file   | f
 synchronous_commit           | off                |      | configuration
file   | f
 TimeZone                     | US/Eastern         |      | configuration
file   | f
 transaction_deferrable       | off                |      | override        
    | f
 transaction_isolation        | read committed     |      | override        
    | f
 transaction_read_only        | off                |      | override        
    | f
 wal_buffers                  | 2048               | 8kB  | override        
    | f
 wal_level                    | minimal            |      | configuration
file   | f
 work_mem                     | 65536              | kB   | configuration
file   | f

select version();
                                                 version                    
                           
---------------------------------------------------------------------------------------------------------
 PostgreSQL 10.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623 (Red Hat 4.8.5-16), 64-bit


Reply | Threaded
Open this post in threaded view
|

Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Tomas Vondra-4
Hi,

On 11/27/2017 07:57 PM, [hidden email] wrote:

> The following bug has been logged on the website:
>
> Bug reference:      14932
> Logged by:          Todd Cook
> Email address:      [hidden email]
> PostgreSQL version: 10.1
> Operating system:   CentOS Linux release 7.4.1708 (Core)
> Description:        
>
> It hangs on a table with 167834 rows, though it works fine with only 167833
> rows.  When it hangs, CTRL-C does not interrupt it, and the backend has to
> be killed to stop it.
>

Can you share the query and data, so that we can reproduce the issue?

Based on the stack traces this smells like a bug in the simplehash,
introduced in PostgreSQL 10. Perhaps somewhere in tuplehash_grow(),
which gets triggered for 167834 rows (but not for 167833).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Todd A. Cook
On 11/27/17 14:17, Tomas Vondra wrote:

> Hi,
>
> On 11/27/2017 07:57 PM, [hidden email] wrote:
>> The following bug has been logged on the website:
>>
>> Bug reference:      14932
>> Logged by:          Todd Cook
>> Email address:      [hidden email]
>> PostgreSQL version: 10.1
>> Operating system:   CentOS Linux release 7.4.1708 (Core)
>> Description:
>>
>> It hangs on a table with 167834 rows, though it works fine with only 167833
>> rows.  When it hangs, CTRL-C does not interrupt it, and the backend has to
>> be killed to stop it.
>>
>
> Can you share the query and data, so that we can reproduce the issue?
>
> Based on the stack traces this smells like a bug in the simplehash,
> introduced in PostgreSQL 10. Perhaps somewhere in tuplehash_grow(),
> which gets triggered for 167834 rows (but not for 167833).
I've attached a reproducer using real data.  (FWIW, I wasn't able to reproduce with
fake data made with generate_series().)

Also, I forgot to mention that this problem is not present in 9.6 or 9.5.

-- todd

reproducer.pgsql (4M) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #14891: Old cancel request presented by pgbouncer honored after skipping a query.

postgresql-11
As a workaround for this issue we've dropped the pgbouncer connection
lifetime to 60 seconds and that seems to have alleviated this for the
most part. No response from pgbouncer about this (either to the recently
created issue or the mailing list message last year when I initially
reported this).

Based on the comments in the past discussions on the postgres cancel
protocol it seems that this is not viewed as a big issue because there's
no real reports of it causing problems. Are other people just not using
pgbouncer in transaction mode with the default settings (or not having
two instances of pgbouncer between client and server)? Or typically
don't send many cancel requests? Or is there just something silly I'm
missing?

In one of our medium databases we see 5-15 cancel requests per day and
with pgbouncer on the hard coded default setting (3600 second connection
lifetime) we would get around 1 or 2 relevant erroneous cancels (one
that causes an insert to fail, typically failing a larger transaction)
per week. This was up from about 1 a month using the config default of
1800 seconds that we lived with for a long time.

Is there something better we should be using other than pgbouncer for
connection pooling?

On 2017-11-08 22:55, Skarsol wrote:

The following bug has been logged on the website:

Bug reference:      14891
Logged by:          Skarsol
Email address:      postgresql(at)skarsol(dot)com
PostgreSQL version: 9.6.3
Operating system:   Linux 4.4.8-hardened-r1 #4 SMP Mon Jun 12
Description:

This might be a symptom of the issue discussed in the ToDo "Changes to
make
cancellations more reliable and more secure" but as it is related to the
pgbouncer bug I've opened at
https://github.com/pgbouncer/pgbouncer/issues/245 I figured I'd post it
over
here just to make sure.

As the last step of this bug, pgbouncer 1.7.2 presents a cancel request
to
postgres 9.6.3. This request targets pid 29330 which is connected to
pgbouncer on port 33024. That pid then accepts a new query, returns a
result
set, accepts another new query, and then cancels that one out.

Expected behavior would have been for either no cancel (as that pid was
between queries at the time) or to cancel the first query. Cancelling
the
2nd query is just weird (to me).

I have no idea how much of this is related to whatever pgbouncer is
doing to
delay the cancel in the first place before presenting it to postgres.

I'm aware that we're 2 minor versions behind, but I don't see anything
that
seems relevant to this in the changelogs.

Image of the relevant wireshark display at
https://user-images.githubusercontent.com/1915152/32578433-d5d4a71c-c4a2-11e7-9d25-f59d5afbb06b.jpg

Reply | Threaded
Open this post in threaded view
|

Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Andres Freund
In reply to this post by Todd A. Cook
Hi,

On 2017-11-27 15:59:37 -0500, Todd A. Cook wrote:

> COPY reproducer (val) FROM stdin;
> 2976219712004784288
> -6429122065899879392
> -7471109962990387136
> -7471109962990387136
> -2895470491222113184
> -4083509061952565472
> 1019481548263425664
> 4639248884787347648
> -6999443831165647744
> -4199917803455020480
> -4110530183001439680

How are these values generated? They awfully look like hash values
(~same lenght, full numerical range)...

I'm kinda neck deep in something else, but I'll try to have a look
afterwards.

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Thomas Munro-3
On Tue, Nov 28, 2017 at 11:10 AM, Andres Freund <[hidden email]> wrote:

> On 2017-11-27 15:59:37 -0500, Todd A. Cook wrote:
>> COPY reproducer (val) FROM stdin;
>> 2976219712004784288
>> -6429122065899879392
>> -7471109962990387136
>> -7471109962990387136
>> -2895470491222113184
>> -4083509061952565472
>> 1019481548263425664
>> 4639248884787347648
>> -6999443831165647744
>> -4199917803455020480
>> -4110530183001439680
>
> How are these values generated? They awfully look like hash values
> (~same lenght, full numerical range)...

When SH_INSERT tries to insert that final extra value, insertdist
keeps exceeding SH_GROW_MAX_DIB (25) no matter how many times we
double the size (at least until my computer gives up, somewhere around
11 doublings and 75GB of virtual memory).  If you set SH_GROW_MAX_DIB
to 26 then it succeeds, but I guess some other attack could be crafted
for that.  What is the theory behind this parameter?

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Tom Lane-2
Thomas Munro <[hidden email]> writes:
> When SH_INSERT tries to insert that final extra value, insertdist
> keeps exceeding SH_GROW_MAX_DIB (25) no matter how many times we
> double the size (at least until my computer gives up, somewhere around
> 11 doublings and 75GB of virtual memory).  If you set SH_GROW_MAX_DIB
> to 26 then it succeeds, but I guess some other attack could be crafted
> for that.  What is the theory behind this parameter?

You beat me to it --- after looking at simplehash.h I'd guessed that
either the SH_GROW_MAX_DIB or SH_GROW_MAX_MOVE code path was causing
an infinite loop, but I'd not gotten to determining which one yet.

I'd ask what's the theory behind SH_GROW_MAX_MOVE, as well.  Neither
of them are obviously loop-proof.

Note that the sample data has a lot of collisions:

regression=# select hashint8(val), count(*) from reproducer group by 1 order by 2 desc;
  hashint8   | count
-------------+-------
   441526644 |  2337
 -1117776826 |  1221
 -1202007016 |   935
 -2068831050 |   620
  1156644653 |   538
   553783815 |   510
   259780770 |   444
   371047036 |   394
   915722575 |   359
 ... etc etc ...

It's evidently more complicated than just that the code fails with
more than SH_GROW_MAX_DIB duplicate hashcodes, but I suspect not
by a lot.  There needs to be a safety valve that prevents letting
the hash fill factor approach zero, which is what's happening in
this test case.

(I wonder whether these loops oughtn't contain CHECK_FOR_INTERRUPTS,
btw.)

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Thomas Munro-3
On Tue, Nov 28, 2017 at 5:03 PM, Tom Lane <[hidden email]> wrote:

> You beat me to it --- after looking at simplehash.h I'd guessed that
> either the SH_GROW_MAX_DIB or SH_GROW_MAX_MOVE code path was causing
> an infinite loop, but I'd not gotten to determining which one yet.
>
> I'd ask what's the theory behind SH_GROW_MAX_MOVE, as well.  Neither
> of them are obviously loop-proof.
>
> Note that the sample data has a lot of collisions:
>
> regression=# select hashint8(val), count(*) from reproducer group by 1 order by 2 desc;
>   hashint8   | count
> -------------+-------
>    441526644 |  2337
>  -1117776826 |  1221
>  -1202007016 |   935
>  -2068831050 |   620
>   1156644653 |   538
>    553783815 |   510
>    259780770 |   444
>    371047036 |   394
>    915722575 |   359
>  ... etc etc ...
>
> It's evidently more complicated than just that the code fails with
> more than SH_GROW_MAX_DIB duplicate hashcodes, but I suspect not
> by a lot.  There needs to be a safety valve that prevents letting
> the hash fill factor approach zero, which is what's happening in
> this test case.

Yeah.  If you count distinct hashint8(val) of *distinct* values, you get:

postgres=# select hashint8(val), count(*) from (select distinct val
from reproducer) ss group by 1 order by 2 desc limit 1;
  hashint8  | count
------------+-------
 1288181237 |    26
(1 row)

It doesn't matter how many bits of that you use, you'll get at least
26 collisions, so our loop can't terminate just by increasing the mask
size.  My guess is that you'd either need a defence based on something
like a secondary hash function, or at least a dynamic SH_GROW_MAX_DIB
that wouldn't allow the fill factor to plummet as you say.  A dataset
could be found that would exceed any static value of SH_GROW_MAX_DIB
by brute force.  In this case, considering the definition of hashint8,
there are more straightforward ways to find distinct bigint values
that hash to the same value... just swap some bits between the high
and low words, or something like that.

--
Thomas Munro
http://www.enterprisedb.com

Reply | Threaded
Open this post in threaded view
|

Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Tomas Vondra-4
In reply to this post by Thomas Munro-3


On 11/28/2017 03:20 AM, Thomas Munro wrote:

> On Tue, Nov 28, 2017 at 11:10 AM, Andres Freund <[hidden email]> wrote:
>> On 2017-11-27 15:59:37 -0500, Todd A. Cook wrote:
>>> COPY reproducer (val) FROM stdin;
>>> 2976219712004784288
>>> -6429122065899879392
>>> -7471109962990387136
>>> -7471109962990387136
>>> -2895470491222113184
>>> -4083509061952565472
>>> 1019481548263425664
>>> 4639248884787347648
>>> -6999443831165647744
>>> -4199917803455020480
>>> -4110530183001439680
>>
>> How are these values generated? They awfully look like hash values
>> (~same lenght, full numerical range)...
>
> When SH_INSERT tries to insert that final extra value, insertdist
> keeps exceeding SH_GROW_MAX_DIB (25) no matter how many times we
> double the size (at least until my computer gives up, somewhere around
> 11 doublings and 75GB of virtual memory).  If you set SH_GROW_MAX_DIB
> to 26 then it succeeds, but I guess some other attack could be crafted
> for that.  What is the theory behind this parameter?
>

Yeah, I came to the same hypothesis yesterday, but I see I failed to
keep pgsql-bugs on CC :-(

FWIW I believe the last doubling (from 2147483648 to 2*2147483648) is
doomed to fail due to the sizemask=0 bug. I mean, if oldsize=2147483648,
then newsize=2*2147483648, which triggers this:

    if (tb->size == SH_MAX_SIZE)
        tb->sizemask = 0;

which would explain why the last grow did not complete even after 40
minutes (while the 1073741824->2147483648 took 15 seconds). Because with
sizemask=0 the SH_NEXT/SH_PREV/.. can only ever returns 0.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Todd A. Cook
In reply to this post by Andres Freund
On 11/27/17 17:10, Andres Freund wrote:

> Hi,
>
> On 2017-11-27 15:59:37 -0500, Todd A. Cook wrote:
>> COPY reproducer (val) FROM stdin;
>> 2976219712004784288
>> -6429122065899879392
>> -7471109962990387136
>> -7471109962990387136
>> -2895470491222113184
>> -4083509061952565472
>> 1019481548263425664
>> 4639248884787347648
>> -6999443831165647744
>> -4199917803455020480
>> -4110530183001439680
>
> How are these values generated? They awfully look like hash values
> (~same lenght, full numerical range)...

They are biased hashes.  FWIW, the reproducer data is from one column of a
three-column primary key.

-- todd

Reply | Threaded
Open this post in threaded view
|

Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Todd A. Cook
In reply to this post by Tom Lane-2
On 11/27/17 23:03, Tom Lane wrote:

>
> Note that the sample data has a lot of collisions:
>
> regression=# select hashint8(val), count(*) from reproducer group by 1 order by 2 desc;
>    hashint8   | count
> -------------+-------
>     441526644 |  2337
>   -1117776826 |  1221
>   -1202007016 |   935
>   -2068831050 |   620
>    1156644653 |   538
>     553783815 |   510
>     259780770 |   444
>     371047036 |   394
>     915722575 |   359
>   ... etc etc ...

In case it matters, the complete data set will have some outlier values with 10k to 100k
collisions in this column.

-- todd

Reply | Threaded
Open this post in threaded view
|

Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Tomas Vondra-4


On 11/28/2017 03:55 PM, Todd A. Cook wrote:

> On 11/27/17 23:03, Tom Lane wrote:
>>
>> Note that the sample data has a lot of collisions:
>>
>> regression=# select hashint8(val), count(*) from reproducer group by 1
>> order by 2 desc;
>>    hashint8   | count
>> -------------+-------
>>     441526644 |  2337
>>   -1117776826 |  1221
>>   -1202007016 |   935
>>   -2068831050 |   620
>>    1156644653 |   538
>>     553783815 |   510
>>     259780770 |   444
>>     371047036 |   394
>>     915722575 |   359
>>   ... etc etc ...
>
> In case it matters, the complete data set will have some outlier values
> with 10k to 100k collisions in this column.
>

In the original values? Not a big deal, I guess. It's the hashint8
collisions that's causing the infinite loop, i.e. different values with
the same hashint8 result.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Andres Freund
In reply to this post by Todd A. Cook
On 2017-11-28 09:55:13 -0500, Todd A. Cook wrote:

> On 11/27/17 23:03, Tom Lane wrote:
> >
> > Note that the sample data has a lot of collisions:
> >
> > regression=# select hashint8(val), count(*) from reproducer group by 1 order by 2 desc;
> >    hashint8   | count
> > -------------+-------
> >     441526644 |  2337
> >   -1117776826 |  1221
> >   -1202007016 |   935
> >   -2068831050 |   620
> >    1156644653 |   538
> >     553783815 |   510
> >     259780770 |   444
> >     371047036 |   394
> >     915722575 |   359
> >   ... etc etc ...
>
> In case it matters, the complete data set will have some outlier values with 10k to 100k
> collisions in this column.

To make sure we're on the same page, this is data intentionally created
to have a lot of hash collisions, is that right?

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Todd A. Cook
On 11/28/17 13:14, Andres Freund wrote:

> On 2017-11-28 09:55:13 -0500, Todd A. Cook wrote:
>> On 11/27/17 23:03, Tom Lane wrote:
>>>
>>> Note that the sample data has a lot of collisions:
>>>
>>> regression=# select hashint8(val), count(*) from reproducer group by 1 order by 2 desc;
>>>     hashint8   | count
>>> -------------+-------
>>>      441526644 |  2337
>>>    -1117776826 |  1221
>>>    -1202007016 |   935
>>>    -2068831050 |   620
>>>     1156644653 |   538
>>>      553783815 |   510
>>>      259780770 |   444
>>>      371047036 |   394
>>>      915722575 |   359
>>>    ... etc etc ...
>>
>> In case it matters, the complete data set will have some outlier values with 10k to 100k
>> collisions in this column.
>
> To make sure we're on the same page, this is data intentionally created
> to have a lot of hash collisions, is that right?

More or less.  It might be more accurate to say that it's created in such a way that
we expect to get lots of collisions.

-- todd

Reply | Threaded
Open this post in threaded view
|

RE: BUG #14891: Old cancel request presented by pgbouncer honored after skipping a query.

postgresql-11
In reply to this post by postgresql-11
Well, taking pgbouncer down to a 60 second connection lifetime was
insufficient to completely alleviate this. We had another transaction
get killed yesterday due to a different query being cancelled. I'm still
unsure how this isn't affecting other people.

At this point, we're going to change our application to just abandon the
current postgres connection when a user hits cancel instead of actually
sending the cancel request. Ugly hack, but I'm not sure what else to do
at this point. I'm still open to suggestions though. Or really, any
discussion about this issue at all. Would this belong better on a
different list?

-----

The following bug has been logged on the website:

Bug reference:      14891
Logged by:          Skarsol
Email address:      postgresql(at)skarsol(dot)com
PostgreSQL version: 9.6.3
Operating system:   Linux 4.4.8-hardened-r1 #4 SMP Mon Jun 12
Description:

This might be a symptom of the issue discussed in the ToDo "Changes to
make
cancellations more reliable and more secure" but as it is related to the
pgbouncer bug I've opened at
https://github.com/pgbouncer/pgbouncer/issues/245 I figured I'd post it
over
here just to make sure.

As the last step of this bug, pgbouncer 1.7.2 presents a cancel request
to
postgres 9.6.3. This request targets pid 29330 which is connected to
pgbouncer on port 33024. That pid then accepts a new query, returns a
result
set, accepts another new query, and then cancels that one out.

Expected behavior would have been for either no cancel (as that pid was
between queries at the time) or to cancel the first query. Cancelling
the
2nd query is just weird (to me).

I have no idea how much of this is related to whatever pgbouncer is
doing to
delay the cancel in the first place before presenting it to postgres.

I'm aware that we're 2 minor versions behind, but I don't see anything
that
seems relevant to this in the changelogs.

Image of the relevant wireshark display at
https://user-images.githubusercontent.com/1915152/32578433-d5d4a71c-c4a2-11e7-9d25-f59d5afbb06b.jpg

Reply | Threaded
Open this post in threaded view
|

Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Todd A. Cook
In reply to this post by Tom Lane-2
On 11/27/17 23:03, Tom Lane wrote:

> Thomas Munro <[hidden email]> writes:
>> When SH_INSERT tries to insert that final extra value, insertdist
>> keeps exceeding SH_GROW_MAX_DIB (25) no matter how many times we
>> double the size (at least until my computer gives up, somewhere around
>> 11 doublings and 75GB of virtual memory).  If you set SH_GROW_MAX_DIB
>> to 26 then it succeeds, but I guess some other attack could be crafted
>> for that.  What is the theory behind this parameter?
>
> You beat me to it --- after looking at simplehash.h I'd guessed that
> either the SH_GROW_MAX_DIB or SH_GROW_MAX_MOVE code path was causing
> an infinite loop, but I'd not gotten to determining which one yet.
>
> I'd ask what's the theory behind SH_GROW_MAX_MOVE, as well.  Neither
> of them are obviously loop-proof.
>
> Note that the sample data has a lot of collisions:
>
> regression=# select hashint8(val), count(*) from reproducer group by 1 order by 2 desc;
>    hashint8   | count
> -------------+-------
>     441526644 |  2337
>   -1117776826 |  1221
>   -1202007016 |   935
>   -2068831050 |   620
>    1156644653 |   538
>     553783815 |   510
>     259780770 |   444
>     371047036 |   394
>     915722575 |   359
>   ... etc etc ...
>
> It's evidently more complicated than just that the code fails with
> more than SH_GROW_MAX_DIB duplicate hashcodes, but I suspect not
> by a lot.  There needs to be a safety valve that prevents letting
> the hash fill factor approach zero, which is what's happening in
> this test case.

FWIW, I can also reproduce the infinite loop with 167834 unique values.

It kinda looks like the problematic collisions arise from masking the
computed hash values; e.g.:

     SH_INITIAL_BUCKET(SH_TYPE * tb, uint32 hash)
     {
        return hash & tb->sizemask;
     }

(Also FWIW, changing SH_FILLFACTOR to 0.5 (from 0.9) did not help any.)

-- todd

Reply | Threaded
Open this post in threaded view
|

Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Tomas Vondra-4


On 12/05/2017 10:23 PM, Todd A. Cook wrote:

> On 11/27/17 23:03, Tom Lane wrote:
>> Thomas Munro <[hidden email]> writes:
>>> When SH_INSERT tries to insert that final extra value, insertdist
>>> keeps exceeding SH_GROW_MAX_DIB (25) no matter how many times we
>>> double the size (at least until my computer gives up, somewhere around
>>> 11 doublings and 75GB of virtual memory).  If you set SH_GROW_MAX_DIB
>>> to 26 then it succeeds, but I guess some other attack could be crafted
>>> for that.  What is the theory behind this parameter?
>>
>> You beat me to it --- after looking at simplehash.h I'd guessed that
>> either the SH_GROW_MAX_DIB or SH_GROW_MAX_MOVE code path was causing
>> an infinite loop, but I'd not gotten to determining which one yet.
>>
>> I'd ask what's the theory behind SH_GROW_MAX_MOVE, as well.  Neither
>> of them are obviously loop-proof.
>>
>> Note that the sample data has a lot of collisions:
>>
>> regression=# select hashint8(val), count(*) from reproducer group by 1
>> order by 2 desc;
>>    hashint8   | count
>> -------------+-------
>>     441526644 |  2337
>>   -1117776826 |  1221
>>   -1202007016 |   935
>>   -2068831050 |   620
>>    1156644653 |   538
>>     553783815 |   510
>>     259780770 |   444
>>     371047036 |   394
>>     915722575 |   359
>>   ... etc etc ...
>>
>> It's evidently more complicated than just that the code fails with
>> more than SH_GROW_MAX_DIB duplicate hashcodes, but I suspect not
>> by a lot.  There needs to be a safety valve that prevents letting
>> the hash fill factor approach zero, which is what's happening in
>> this test case.
>
> FWIW, I can also reproduce the infinite loop with 167834 unique values.
>

Unique values or unique *hash* values?

Can you share the data, so that whoever fixes the bug can verify it also
fixes your example?

> It kinda looks like the problematic collisions arise from masking the
> computed hash values; e.g.:
>
>     SH_INITIAL_BUCKET(SH_TYPE * tb, uint32 hash)
>     {
>     return hash & tb->sizemask;
>     }
>

That's pretty standard way to do modulo (when computing index of the
hash bucket), and certainly not the root cause.

> (Also FWIW, changing SH_FILLFACTOR to 0.5 (from 0.9) did not help any.)
>

Based on the initial discussion, the problem here is twofold.

Firstly, the code assumes that if it gets certain number of bucket
collisions (essentially, the initial bucket and certain number of
neighboring buckets already being full), making the table larger is
guaranteed to reduce the number of collisions.

Which is obviously untrue for duplicate hash values, but it may also
happen for distinct hash values that form a chain (think a sequence of
hash values 1,2,3,4,5,6,7,...,K - that is never gonna get fixed).

This causes the insane growth to the largest possible hash table.

Secondly, there's a bug that for the largest hash table we set
sizemask=0, which means we never ever get bucket index other than 0.

This causes the infinite loop.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Todd A. Cook
On 12/05/17 17:01, Tomas Vondra wrote:

>
>
> On 12/05/2017 10:23 PM, Todd A. Cook wrote:
>> On 11/27/17 23:03, Tom Lane wrote:
>>> Thomas Munro <[hidden email]> writes:
>>>> When SH_INSERT tries to insert that final extra value, insertdist
>>>> keeps exceeding SH_GROW_MAX_DIB (25) no matter how many times we
>>>> double the size (at least until my computer gives up, somewhere around
>>>> 11 doublings and 75GB of virtual memory).  If you set SH_GROW_MAX_DIB
>>>> to 26 then it succeeds, but I guess some other attack could be crafted
>>>> for that.  What is the theory behind this parameter?
>>>
>>> You beat me to it --- after looking at simplehash.h I'd guessed that
>>> either the SH_GROW_MAX_DIB or SH_GROW_MAX_MOVE code path was causing
>>> an infinite loop, but I'd not gotten to determining which one yet.
>>>
>>> I'd ask what's the theory behind SH_GROW_MAX_MOVE, as well.  Neither
>>> of them are obviously loop-proof.
>>>
>>> Note that the sample data has a lot of collisions:
>>>
>>> regression=# select hashint8(val), count(*) from reproducer group by 1
>>> order by 2 desc;
>>>     hashint8   | count
>>> -------------+-------
>>>      441526644 |  2337
>>>    -1117776826 |  1221
>>>    -1202007016 |   935
>>>    -2068831050 |   620
>>>     1156644653 |   538
>>>      553783815 |   510
>>>      259780770 |   444
>>>      371047036 |   394
>>>      915722575 |   359
>>>    ... etc etc ...
>>>
>>> It's evidently more complicated than just that the code fails with
>>> more than SH_GROW_MAX_DIB duplicate hashcodes, but I suspect not
>>> by a lot.  There needs to be a safety valve that prevents letting
>>> the hash fill factor approach zero, which is what's happening in
>>> this test case.
>>
>> FWIW, I can also reproduce the infinite loop with 167834 unique values.
>>
>
> Unique values or unique *hash* values?
Unique values.


> Can you share the data, so that whoever fixes the bug can verify it also
> fixes your example?

Sure.  It's attached.

-- todd

reproducer_2.pgsql (4M) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Tomas Vondra-4


On 12/05/2017 11:33 PM, Todd A. Cook wrote:

> On 12/05/17 17:01, Tomas Vondra wrote:
>>
>>
>> On 12/05/2017 10:23 PM, Todd A. Cook wrote:
>>> On 11/27/17 23:03, Tom Lane wrote:
>>>> Thomas Munro <[hidden email]> writes:
>>>>> When SH_INSERT tries to insert that final extra value, insertdist
>>>>> keeps exceeding SH_GROW_MAX_DIB (25) no matter how many times we
>>>>> double the size (at least until my computer gives up, somewhere around
>>>>> 11 doublings and 75GB of virtual memory).  If you set SH_GROW_MAX_DIB
>>>>> to 26 then it succeeds, but I guess some other attack could be crafted
>>>>> for that.  What is the theory behind this parameter?
>>>>
>>>> You beat me to it --- after looking at simplehash.h I'd guessed that
>>>> either the SH_GROW_MAX_DIB or SH_GROW_MAX_MOVE code path was causing
>>>> an infinite loop, but I'd not gotten to determining which one yet.
>>>>
>>>> I'd ask what's the theory behind SH_GROW_MAX_MOVE, as well.  Neither
>>>> of them are obviously loop-proof.
>>>>
>>>> Note that the sample data has a lot of collisions:
>>>>
>>>> regression=# select hashint8(val), count(*) from reproducer group by 1
>>>> order by 2 desc;
>>>>     hashint8   | count
>>>> -------------+-------
>>>>      441526644 |  2337
>>>>    -1117776826 |  1221
>>>>    -1202007016 |   935
>>>>    -2068831050 |   620
>>>>     1156644653 |   538
>>>>      553783815 |   510
>>>>      259780770 |   444
>>>>      371047036 |   394
>>>>      915722575 |   359
>>>>    ... etc etc ...
>>>>
>>>> It's evidently more complicated than just that the code fails with
>>>> more than SH_GROW_MAX_DIB duplicate hashcodes, but I suspect not
>>>> by a lot.  There needs to be a safety valve that prevents letting
>>>> the hash fill factor approach zero, which is what's happening in
>>>> this test case.
>>>
>>> FWIW, I can also reproduce the infinite loop with 167834 unique values.
>>>
>>
>> Unique values or unique *hash* values?
>
> Unique values.
>
>
>> Can you share the data, so that whoever fixes the bug can verify it also
>> fixes your example?
>
> Sure.  It's attached.
>

Seems the dataset has pretty much the same issue as the one reported
before, that is

select hashint8(val), count(distinct val), count(*) from temp_f_03 group
by 1 order by 2 desc;

      hashint8   | count | count
    -------------+-------+-------
     -1971396144 |    45 |    45
      2035896403 |    42 |    42
     -1633843397 |    30 |    30
      1425704662 |    29 |    29
      -455482779 |    22 |    22
      -300163565 |    17 |    17
     -1803963420 |    17 |    17
      -537082846 |    14 |    14
       603707034 |    13 |    13
      -176369887 |    12 |    12
      1274957136 |    11 |    11
      1465522632 |    11 |    11
     -1589862230 |    10 |    10
     -1145403239 |    10 |    10

i.e. there are many hash collisions (more than in the other data set).


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

Todd A. Cook
On 12/06/17 08:33, Tomas Vondra wrote:

>
>>> Can you share the data, so that whoever fixes the bug can verify it also
>>> fixes your example?
>>
>> Sure.  It's attached.
>>
>
> Seems the dataset has pretty much the same issue as the one reported
> before, that is
>
> select hashint8(val), count(distinct val), count(*) from temp_f_03 group
> by 1 order by 2 desc;
>
>        hashint8   | count | count
>      -------------+-------+-------
>       -1971396144 |    45 |    45
>        2035896403 |    42 |    42
>       -1633843397 |    30 |    30
>        1425704662 |    29 |    29
>        -455482779 |    22 |    22
>        -300163565 |    17 |    17
>       -1803963420 |    17 |    17
>        -537082846 |    14 |    14
>         603707034 |    13 |    13
>        -176369887 |    12 |    12
>        1274957136 |    11 |    11
>        1465522632 |    11 |    11
>       -1589862230 |    10 |    10
>       -1145403239 |    10 |    10
>
> i.e. there are many hash collisions (more than in the other data set).

If hashint8() is ultimately invoked by TupleHashTableHash() in execGroups.c,
it might be magnifying the difficulties here.  The least significant bits,
which are used as the bucket number in simplehash.h, are not very well
distributed:

select val, to_hex(val), to_hex(hashint8(val)) from temp_f_03 limit 15 ;
          val          |      to_hex      |  to_hex
----------------------+------------------+----------
   4444319256653758784 | 3dad64d121468140 | 805ffffe
    554179993563924608 | 7b0d7c49a018880  | 84dffffb
  -3383965646518123872 | d109bd2c6b2982a0 | 9c3ffff7
  -4706811054739454944 | beae0c48915f8420 | 191ffff6
    618200668902031424 | 8944a3ba5d08040  | 2a3ffff0
   5074043922812601024 | 466aa01079f982c0 | 7effffee
  -8783188184262212928 | 861bd8e1b9a482c0 | a6bfffea
  -4597800992953433792 | c031545b6b128140 | b1dfffea
   8563040839807173408 | 76d608465dde8320 | 7d9fffe6
   6092569112843158816 | 548d27540c888520 | 6f9fffe6
  -7313351060369079936 | 9a81c1f558f98180 | 68ffffe5
  -1786712428165627488 | e7345283536981a0 | 73ffffe5
  -6153596242570280896 | aa9a08d20e6b8040 | ac3fffd8
     88426174078092128 | 13a272306c58360  | b57fffd8
  -5305589938458295680 | b65ec20faa4e8280 | ba9fffd3

-- todd

12
Previous Thread Next Thread