BUG #16036: Segmentation fault while doing an update

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

BUG #16036: Segmentation fault while doing an update

PG Bug reporting form
The following bug has been logged on the website:

Bug reference:      16036
Logged by:          Антон Власов
Email address:      [hidden email]
PostgreSQL version: 12.0
Operating system:   PostgreSQL 12.0 (Debian 12.0-1.pgdg90+1) on x86_64
Description:        

I am having segmentation fault like this:

2019-10-04 02:30:02 MSK [16919-6] LOG:  server process (PID 18415) was
terminated by signal 11: Segmentation fault
2019-10-04 02:30:02 MSK [16919-7] DETAIL:  Failed process was running:
update tracking.test_session set list_position = list_position + 1 where id
= $1
2019-10-04 02:30:02 MSK [16919-8] LOG:  terminating any other active server
processes

When there are several queries running simultaneously. Tried to get some
info from coredump, but that's all i have:

(gdb) bt
#0  GetMemoryChunkContext (pointer=0x0) at
./build/../src/include/utils/memutils.h:127
#1  pfree (pointer=0x0) at ./build/../src/backend/utils/mmgr/mcxt.c:1033
Backtrace stopped: Cannot access memory at address 0x7ffc990e7218

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Антон Власов
Got normal backtrace:

#0  GetMemoryChunkContext (pointer=0x0) at ./build/../src/include/utils/memutils.h:127
#1  pfree (pointer=0x0) at ./build/../src/backend/utils/mmgr/mcxt.c:1033
#2  0x0000564c3f007aca in heap_freetuple (htup=<optimized out>) at ./build/../src/backend/access/common/heaptuple.c:1340
#3  0x0000564c3f1b5889 in tts_buffer_heap_clear (slot=0x564c40650298) at ./build/../src/backend/executor/execTuples.c:652
#4  0x0000564c3f1b5cbe in ExecClearTuple (slot=0x564c40650298) at ./build/../src/include/executor/tuptable.h:428
#5  ExecResetTupleTable (tupleTable=0x564c4068a6b0, shouldFree=false) at ./build/../src/backend/executor/execTuples.c:1165
#6  0x0000564c3f1abc40 in ExecEndPlan (estate=0x564c40689740, planstate=<optimized out>) at ./build/../src/backend/executor/execMain.c:1560
#7  standard_ExecutorEnd (queryDesc=0x564c40663d80) at ./build/../src/backend/executor/execMain.c:496
#8  0x0000564c3f30a6e8 in ProcessQuery (plan=<optimized out>, sourceText=0x564c40663c60 "update tracking.test_session set list_position = list_position + 1 where id = $1", params=0x564c40663cf0, queryEnv=0x0, dest=0x564c3f8b1c00 <donothingDR>,
    completionTag=0x7ffd3557c720 "UPDATE 0") at ./build/../src/backend/tcop/pquery.c:204
#9  0x0000564c3f30a8c3 in PortalRunMulti (portal=portal@entry=0x564c405cb3b0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=0x564c3f8b1c00 <donothingDR>, dest@entry=0x564c4052dc90, altdest=0x564c3f8b1c00 <donothingDR>,
    altdest@entry=0x564c4052dc90, completionTag=completionTag@entry=0x7ffd3557c720 "UPDATE 0") at ./build/../src/backend/tcop/pquery.c:1283
#10 0x0000564c3f30b4db in PortalRun (portal=portal@entry=0x564c405cb3b0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=<optimized out>, dest=dest@entry=0x564c4052dc90, altdest=altdest@entry=0x564c4052dc90,
    completionTag=0x7ffd3557c720 "UPDATE 0") at ./build/../src/backend/tcop/pquery.c:796
#11 0x0000564c3f308d05 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x564c4052d880 "") at ./build/../src/backend/tcop/postgres.c:2090
#12 PostgresMain (argc=<optimized out>, argv=argv@entry=0x564c40578e80, dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4297
#13 0x0000564c3effcbe1 in BackendRun (port=0x564c40574ba0, port=0x564c40574ba0) at ./build/../src/backend/postmaster/postmaster.c:4431
#14 BackendStartup (port=0x564c40574ba0) at ./build/../src/backend/postmaster/postmaster.c:4122
#15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1704
#16 0x0000564c3f28fe81 in PostmasterMain (argc=5, argv=0x564c40524b70) at ./build/../src/backend/postmaster/postmaster.c:1377
#17 0x0000564c3effdea8 in main (argc=5, argv=0x564c40524b70) at ./build/../src/backend/main/main.c:228

> 4 окт. 2019 г., в 3:29, PG Bug reporting form <[hidden email]> написал(а):
>
> The following bug has been logged on the website:
>
> Bug reference:      16036
> Logged by:          Антон Власов
> Email address:      [hidden email]
> PostgreSQL version: 12.0
> Operating system:   PostgreSQL 12.0 (Debian 12.0-1.pgdg90+1) on x86_64
> Description:        
>
> I am having segmentation fault like this:
>
> 2019-10-04 02:30:02 MSK [16919-6] LOG:  server process (PID 18415) was
> terminated by signal 11: Segmentation fault
> 2019-10-04 02:30:02 MSK [16919-7] DETAIL:  Failed process was running:
> update tracking.test_session set list_position = list_position + 1 where id
> = $1
> 2019-10-04 02:30:02 MSK [16919-8] LOG:  terminating any other active server
> processes
>
> When there are several queries running simultaneously. Tried to get some
> info from coredump, but that's all i have:
>
> (gdb) bt
> #0  GetMemoryChunkContext (pointer=0x0) at
> ./build/../src/include/utils/memutils.h:127
> #1  pfree (pointer=0x0) at ./build/../src/backend/utils/mmgr/mcxt.c:1033
> Backtrace stopped: Cannot access memory at address 0x7ffc990e7218
>



Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Andres Freund
In reply to this post by PG Bug reporting form
Hi,

On October 3, 2019 5:29:44 PM PDT, PG Bug reporting form <[hidden email]> wrote:

>The following bug has been logged on the website:
>
>Bug reference:      16036
>Logged by:          Антон Власов
>Email address:      [hidden email]
>PostgreSQL version: 12.0
>Operating system:   PostgreSQL 12.0 (Debian 12.0-1.pgdg90+1) on x86_64
>Description:        
>
>I am having segmentation fault like this:
>
>2019-10-04 02:30:02 MSK [16919-6] LOG:  server process (PID 18415) was
>terminated by signal 11: Segmentation fault
>2019-10-04 02:30:02 MSK [16919-7] DETAIL:  Failed process was running:
>update tracking.test_session set list_position = list_position + 1
>where id
>= $1
>2019-10-04 02:30:02 MSK [16919-8] LOG:  terminating any other active
>server
>processes
>
>When there are several queries running simultaneously. Tried to get
>some
>info from coredump, but that's all i have:
>
>(gdb) bt
>#0  GetMemoryChunkContext (pointer=0x0) at
>./build/../src/include/utils/memutils.h:127
>#1  pfree (pointer=0x0) at
>./build/../src/backend/utils/mmgr/mcxt.c:1033
>Backtrace stopped: Cannot access memory at address 0x7ffc990e7218

Hm, can you reproduce this? If so, any chance to package this up in a reproducible manner?

If not, could you check whether the problem vanishes if you set jit=off? If so, could you try getting a backtrace after setting jit_debugging_support=1?


Regards,

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Антон Власов
Hello,

Looks like disabling jit didn’t affect backtrace at all:

0  GetMemoryChunkContext (pointer=0x0) at ./build/../src/include/utils/memutils.h:127
#1  pfree (pointer=0x0) at ./build/../src/backend/utils/mmgr/mcxt.c:1033
#2  0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at ./build/../src/backend/access/common/heaptuple.c:1340
#3  0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118) at ./build/../src/backend/executor/execTuples.c:652
#4  0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at ./build/../src/include/executor/tuptable.h:428
#5  ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false) at ./build/../src/backend/executor/execTuples.c:1165
#6  0x0000555d7290ec40 in ExecEndPlan (estate=0x555d73db4aa0, planstate=<optimized out>) at ./build/../src/backend/executor/execMain.c:1560
#7  standard_ExecutorEnd (queryDesc=0x555d73ce1f40) at ./build/../src/backend/executor/execMain.c:496
#8  0x0000555d72a6d6e8 in ProcessQuery (plan=<optimized out>, sourceText=0x555d73ce1e20 "update tracking.test_session set list_position = list_position + 1 where id = $1", params=0x555d73ce1eb0, queryEnv=0x0, dest=0x555d73014c00 <donothingDR>,
    completionTag=0x7ffdf0913200 "UPDATE 0") at ./build/../src/backend/tcop/pquery.c:204
#9  0x0000555d72a6d8c3 in PortalRunMulti (portal=portal@entry=0x555d73c4d3f0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=0x555d73014c00 <donothingDR>, dest@entry=0x555d73bafc90, altdest=0x555d73014c00 <donothingDR>,
    altdest@entry=0x555d73bafc90, completionTag=completionTag@entry=0x7ffdf0913200 "UPDATE 0") at ./build/../src/backend/tcop/pquery.c:1283
#10 0x0000555d72a6e4db in PortalRun (portal=portal@entry=0x555d73c4d3f0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=<optimized out>, dest=dest@entry=0x555d73bafc90, altdest=altdest@entry=0x555d73bafc90,
    completionTag=0x7ffdf0913200 "UPDATE 0") at ./build/../src/backend/tcop/pquery.c:796
#11 0x0000555d72a6bd05 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x555d73baf880 "") at ./build/../src/backend/tcop/postgres.c:2090
#12 PostgresMain (argc=<optimized out>, argv=argv@entry=0x555d73bfb2e0, dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4297
#13 0x0000555d7275fbe1 in BackendRun (port=0x555d73bf6be0, port=0x555d73bf6be0) at ./build/../src/backend/postmaster/postmaster.c:4431
#14 BackendStartup (port=0x555d73bf6be0) at ./build/../src/backend/postmaster/postmaster.c:4122
#15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1704
#16 0x0000555d729f2e81 in PostmasterMain (argc=5, argv=0x555d73ba6b70) at ./build/../src/backend/postmaster/postmaster.c:1377
#17 0x0000555d72760ea8 in main (argc=5, argv=0x555d73ba6b70) at ./build/../src/backend/main/main.c:228

> 4 окт. 2019 г., в 4:09, Andres Freund <[hidden email]> написал(а):
>
> Hi,
>
> On October 3, 2019 5:29:44 PM PDT, PG Bug reporting form <[hidden email]> wrote:
>> The following bug has been logged on the website:
>>
>> Bug reference:      16036
>> Logged by:          Антон Власов
>> Email address:      [hidden email]
>> PostgreSQL version: 12.0
>> Operating system:   PostgreSQL 12.0 (Debian 12.0-1.pgdg90+1) on x86_64
>> Description:        
>>
>> I am having segmentation fault like this:
>>
>> 2019-10-04 02:30:02 MSK [16919-6] LOG:  server process (PID 18415) was
>> terminated by signal 11: Segmentation fault
>> 2019-10-04 02:30:02 MSK [16919-7] DETAIL:  Failed process was running:
>> update tracking.test_session set list_position = list_position + 1
>> where id
>> = $1
>> 2019-10-04 02:30:02 MSK [16919-8] LOG:  terminating any other active
>> server
>> processes
>>
>> When there are several queries running simultaneously. Tried to get
>> some
>> info from coredump, but that's all i have:
>>
>> (gdb) bt
>> #0  GetMemoryChunkContext (pointer=0x0) at
>> ./build/../src/include/utils/memutils.h:127
>> #1  pfree (pointer=0x0) at
>> ./build/../src/backend/utils/mmgr/mcxt.c:1033
>> Backtrace stopped: Cannot access memory at address 0x7ffc990e7218
>
> Hm, can you reproduce this? If so, any chance to package this up in a reproducible manner?
>
> If not, could you check whether the problem vanishes if you set jit=off? If so, could you try getting a backtrace after setting jit_debugging_support=1?
>
>
> Regards,
>
> Andres
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.



Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Andres Freund
Hi,

On October 3, 2019 6:14:33 PM PDT, "Антон Власов" <[hidden email]> wrote:

>Hello,
>
>Looks like disabling jit didn’t affect backtrace at all:
>
>0  GetMemoryChunkContext (pointer=0x0) at
>./build/../src/include/utils/memutils.h:127
>#1  pfree (pointer=0x0) at
>./build/../src/backend/utils/mmgr/mcxt.c:1033
>#2  0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at
>./build/../src/backend/access/common/heaptuple.c:1340
>#3  0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118)
>at ./build/../src/backend/executor/execTuples.c:652
>#4  0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at
>./build/../src/include/executor/tuptable.h:428
>#5  ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false)

That's good - I was only looking at that because of the truncated backtrace. Did you do anything to make that work better this time?

Are there any details that you can provide? Schema? Any extensions in use?

Does the problem happen always, or just under concurrency?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Антон Власов
Hello,

This test script always causes a crash:

postgres@gp-app01:~$ cat test.sh
psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &

create table tracking.test_result
(
        id serial not null
                constraint test_result_pkey
                        primary key,
        session_id integer
                constraint test_result_session_id_fkey
                        references tracking.test_session,
        reported_at timestamp(0) default now(),
        courier varchar(25) not null,
        tracking_number varchar(60) not null,
        status varchar(25) not null,
        fail_combo integer default 0 not null,
        response_test text,
        response_real text,
        response_time real,
        code smallint,
        message text,
        retry integer
);

create index test_result_session_id_idx
        on tracking.test_result (session_id);

Didn’t finish isolating bug from my database so i’m not sure if it work out for you. Single queries work as usual, only concurrent ones cause crash.


> 4 окт. 2019 г., в 4:20, Andres Freund <[hidden email]> написал(а):
>
> Hi,
>
> On October 3, 2019 6:14:33 PM PDT, "Антон Власов" <[hidden email]> wrote:
>> Hello,
>>
>> Looks like disabling jit didn’t affect backtrace at all:
>>
>> 0  GetMemoryChunkContext (pointer=0x0) at
>> ./build/../src/include/utils/memutils.h:127
>> #1  pfree (pointer=0x0) at
>> ./build/../src/backend/utils/mmgr/mcxt.c:1033
>> #2  0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at
>> ./build/../src/backend/access/common/heaptuple.c:1340
>> #3  0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118)
>> at ./build/../src/backend/executor/execTuples.c:652
>> #4  0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at
>> ./build/../src/include/executor/tuptable.h:428
>> #5  ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false)
>
> That's good - I was only looking at that because of the truncated backtrace. Did you do anything to make that work better this time?
>
> Are there any details that you can provide? Schema? Any extensions in use?
>
> Does the problem happen always, or just under concurrency?
>
> Andres
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.



Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Антон Власов
Tried only two concurrent queries and crash is still present, so it’s not about total query count.

> 4 окт. 2019 г., в 4:24, Антон Власов <[hidden email]> написал(а):
>
> Hello,
>
> This test script always causes a crash:
>
> postgres@gp-app01:~$ cat test.sh
> psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
> psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
> psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
> psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
> psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
> psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
> psql -d gdeposylka -c "update tracking.test_session set list_position = list_position + 1 where id = 85638;" &
>
> create table tracking.test_result
> (
> id serial not null
> constraint test_result_pkey
> primary key,
> session_id integer
> constraint test_result_session_id_fkey
> references tracking.test_session,
> reported_at timestamp(0) default now(),
> courier varchar(25) not null,
> tracking_number varchar(60) not null,
> status varchar(25) not null,
> fail_combo integer default 0 not null,
> response_test text,
> response_real text,
> response_time real,
> code smallint,
> message text,
> retry integer
> );
>
> create index test_result_session_id_idx
> on tracking.test_result (session_id);
>
> Didn’t finish isolating bug from my database so i’m not sure if it work out for you. Single queries work as usual, only concurrent ones cause crash.
>
>
>> 4 окт. 2019 г., в 4:20, Andres Freund <[hidden email]> написал(а):
>>
>> Hi,
>>
>> On October 3, 2019 6:14:33 PM PDT, "Антон Власов" <[hidden email]> wrote:
>>> Hello,
>>>
>>> Looks like disabling jit didn’t affect backtrace at all:
>>>
>>> 0  GetMemoryChunkContext (pointer=0x0) at
>>> ./build/../src/include/utils/memutils.h:127
>>> #1  pfree (pointer=0x0) at
>>> ./build/../src/backend/utils/mmgr/mcxt.c:1033
>>> #2  0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at
>>> ./build/../src/backend/access/common/heaptuple.c:1340
>>> #3  0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118)
>>> at ./build/../src/backend/executor/execTuples.c:652
>>> #4  0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at
>>> ./build/../src/include/executor/tuptable.h:428
>>> #5  ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false)
>>
>> That's good - I was only looking at that because of the truncated backtrace. Did you do anything to make that work better this time?
>>
>> Are there any details that you can provide? Schema? Any extensions in use?
>>
>> Does the problem happen always, or just under concurrency?
>>
>> Andres
>> --
>> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>



Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Антон Власов
In reply to this post by Andres Freund
Tried with default debian configuration - problem persists. I’ve got full backtrace when reduced shared_buffers from 18G to 1G. Tried to simulate the problem in artificial environment, but cannot repeat it.
On real data it is 100% repeatable.

4 окт. 2019 г., в 4:20, Andres Freund <[hidden email]> написал(а):

Hi, 

On October 3, 2019 6:14:33 PM PDT, "Антон Власов" <[hidden email]> wrote:
Hello,

Looks like disabling jit didn’t affect backtrace at all:

0  GetMemoryChunkContext (pointer=0x0) at
./build/../src/include/utils/memutils.h:127
#1  pfree (pointer=0x0) at
./build/../src/backend/utils/mmgr/mcxt.c:1033
#2  0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at
./build/../src/backend/access/common/heaptuple.c:1340
#3  0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118)
at ./build/../src/backend/executor/execTuples.c:652
#4  0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at
./build/../src/include/executor/tuptable.h:428
#5  ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false)

That's good - I was only looking at that because of the truncated backtrace. Did you do anything to make that work better this time?

Are there any details that you can provide? Schema? Any extensions in use? 

Does the problem happen always, or just under concurrency?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Антон Власов
In reply to this post by Andres Freund
At last, i’ve found the source of the problem.

I have a trigger:

create function test_session_bu() returns trigger
language plpgsql
as
$$
begin
if new.session_end is not null and old.session_end is null then
new.is_finished := true;
new.success_count := (select count(*) from tracking.test_result where session_id = new.id and status = 'success');
new.failing_count := (select count(*) from tracking.test_result where session_id = new.id and status = 'failing');
new.fail_count := (select count(*) from tracking.test_result where session_id = new.id and status = 'fail');
end if;
return new;
end;
$$;

create trigger test_session_bu
before update
on tracking.test_session
for each row
execute procedure tracking.test_session_bu();

Everything works fine without it, when i return it — problem returns too.

As you can see session_end is not affected in the queries. So i tried to modify the trigger to:

create or replace function test_session_bu() returns trigger
language plpgsql
as
$$
begin
return new;
end;
$$;
And problem was still in place.


4 окт. 2019 г., в 4:20, Andres Freund <[hidden email]> написал(а):

Hi, 

On October 3, 2019 6:14:33 PM PDT, "Антон Власов" <[hidden email]> wrote:
Hello,

Looks like disabling jit didn’t affect backtrace at all:

0  GetMemoryChunkContext (pointer=0x0) at
./build/../src/include/utils/memutils.h:127
#1  pfree (pointer=0x0) at
./build/../src/backend/utils/mmgr/mcxt.c:1033
#2  0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at
./build/../src/backend/access/common/heaptuple.c:1340
#3  0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118)
at ./build/../src/backend/executor/execTuples.c:652
#4  0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at
./build/../src/include/executor/tuptable.h:428
#5  ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false)

That's good - I was only looking at that because of the truncated backtrace. Did you do anything to make that work better this time?

Are there any details that you can provide? Schema? Any extensions in use? 

Does the problem happen always, or just under concurrency?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Антон Власов
In reply to this post by Andres Freund
Hello!

Reproducible schema:

create table test_table (
id serial, field int default 0 not null
);

create or replace function test_table_bu() returns trigger
language plpgsql
as
$$
begin
return new;
end;
$$;

create trigger test_table_bu
before update
on test_table
for each row
execute procedure test_table_bu();

insert into test_table (field) values (0);
test.sh:

psql -d gdeposylka -c "update test_table set field = field + 1 where id = 1;" &
psql -d gdeposylka -c "update test_table set field = field + 1 where id = 1;" &

4 окт. 2019 г., в 4:20, Andres Freund <[hidden email]> написал(а):

Hi, 

On October 3, 2019 6:14:33 PM PDT, "Антон Власов" <[hidden email]> wrote:
Hello,

Looks like disabling jit didn’t affect backtrace at all:

0  GetMemoryChunkContext (pointer=0x0) at
./build/../src/include/utils/memutils.h:127
#1  pfree (pointer=0x0) at
./build/../src/backend/utils/mmgr/mcxt.c:1033
#2  0x0000555d7276aaca in heap_freetuple (htup=<optimized out>) at
./build/../src/backend/access/common/heaptuple.c:1340
#3  0x0000555d72918889 in tts_buffer_heap_clear (slot=0x555d73dbb118)
at ./build/../src/backend/executor/execTuples.c:652
#4  0x0000555d72918cbe in ExecClearTuple (slot=0x555d73dbb118) at
./build/../src/include/executor/tuptable.h:428
#5  ExecResetTupleTable (tupleTable=0x555d73db5a10, shouldFree=false)

That's good - I was only looking at that because of the truncated backtrace. Did you do anything to make that work better this time?

Are there any details that you can provide? Schema? Any extensions in use? 

Does the problem happen always, or just under concurrency?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Andres Freund
Hi,

On 2019-10-04 05:11:16 +0300, Антон Власов wrote:

> Reproducible schema:
>
> create table test_table (
>     id serial, field int default 0 not null
> );
>
> create or replace function test_table_bu() returns trigger
>     language plpgsql
> as
> $$
> begin
>   return new;
> end;
> $$;
>
> create trigger test_table_bu
>     before update
>     on test_table
>     for each row
> execute procedure test_table_bu();
>
> insert into test_table (field) values (0);
> test.sh:
>
> psql -d gdeposylka -c "update test_table set field = field + 1 where id = 1;" &
> psql -d gdeposylka -c "update test_table set field = field + 1 where id = 1;" &

Thanks, that's very helpful!

I've started to narrow down the problem:

bool
ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
                                         ResultRelInfo *relinfo,
                                         ItemPointer tupleid,
                                         HeapTuple fdw_trigtuple,
                                         TupleTableSlot *newslot)
{
...

                /*
                 * In READ COMMITTED isolation level it's possible that target tuple
                 * was changed due to concurrent update.  In that case we have a raw
                 * subplan output tuple in newSlot, and need to run it through the
                 * junk filter to produce an insertable tuple.
                 *
                 * Caution: more than likely, the passed-in slot is the same as the
                 * junkfilter's output slot, so we are clobbering the original value
                 * of slottuple by doing the filtering.  This is OK since neither we
                 * nor our caller have any more interest in the prior contents of that
                 * slot.
                 */
                if (newSlot != NULL)
                {
                        TupleTableSlot *slot = ExecFilterJunk(relinfo->ri_junkFilter, newSlot);

                        ExecCopySlot(newslot, slot);
                }

the problem is that 'newslot' and 'slot' are the same. As copying first
clears the target slot, that results in tuple slot in a bogus state, with
BufferHeapTupleTableSlot->base.tuple = NULL, but TTS_FLAG_EMPTY not set,
but TTS_FLAG_SHOULDFREE set.

The only reason that doesn't run into the assert in
ExecCopySlotHeapTuple() that the slot bet not be empty, is that we've
already cleared that on the destination slot (which is also the source
slot) at that point...

                dstslot->tts_flags &= ~TTS_FLAG_EMPTY;
                oldContext = MemoryContextSwitchTo(dstslot->tts_mcxt);
                bdstslot->base.tuple = ExecCopySlotHeapTuple(srcslot);
                MemoryContextSwitchTo(oldContext);

So I think what we need to do is:
1) only call ExecCopySlot() if slot != newslot - this fixes the crash
2) Add an assert to ExecCopySlot() ensuring source and target slot are
   distinct
3) Figure out why our tests didn't catch this, this afaict should be a
   tested scenario
4) Fix confusing variable naming around newSlot/newslot in ExecBRUpdateTriggers


Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Tom Lane-2
Andres Freund <[hidden email]> writes:
> So I think what we need to do is:
> 1) only call ExecCopySlot() if slot != newslot - this fixes the crash
> 2) Add an assert to ExecCopySlot() ensuring source and target slot are
>    distinct
> 3) Figure out why our tests didn't catch this, this afaict should be a
>    tested scenario
> 4) Fix confusing variable naming around newSlot/newslot in ExecBRUpdateTriggers

Maybe instead of 1+2, change ExecCopySlot like this:

- dstslot->tts_ops->copyslot(dstslot, srcslot);
+ if (dstslot != srcslot)
+    dstslot->tts_ops->copyslot(dstslot, srcslot);

That would fix other similar instances, not just this one caller.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Andres Freund
Hi,

On 2019-10-03 23:26:01 -0400, Tom Lane wrote:

> Andres Freund <[hidden email]> writes:
> > So I think what we need to do is:
> > 1) only call ExecCopySlot() if slot != newslot - this fixes the crash
> > 2) Add an assert to ExecCopySlot() ensuring source and target slot are
> >    distinct
> > 3) Figure out why our tests didn't catch this, this afaict should be a
> >    tested scenario
> > 4) Fix confusing variable naming around newSlot/newslot in ExecBRUpdateTriggers
>
> Maybe instead of 1+2, change ExecCopySlot like this:
>
> - dstslot->tts_ops->copyslot(dstslot, srcslot);
> + if (dstslot != srcslot)
> +    dstslot->tts_ops->copyslot(dstslot, srcslot);
>
> That would fix other similar instances, not just this one caller.

Yea, I was wondering about that too - but asking for a slot to be copied
into itself seems to indicate a logic bug more often than not? Expecting
the slots to be independent, afterwards. So I'm a bit hesitant doing so.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Tom Lane-2
Andres Freund <[hidden email]> writes:
> Yea, I was wondering about that too - but asking for a slot to be copied
> into itself seems to indicate a logic bug more often than not? Expecting
> the slots to be independent, afterwards. So I'm a bit hesitant doing so.

Oh, good point.  The Assert sounds better then.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Andres Freund
In reply to this post by Tom Lane-2
Hi,

On 2019-10-03 23:26:01 -0400, Tom Lane wrote:

> Andres Freund <[hidden email]> writes:
> > So I think what we need to do is:
> > 1) only call ExecCopySlot() if slot != newslot - this fixes the crash
> > 2) Add an assert to ExecCopySlot() ensuring source and target slot are
> >    distinct
> > 3) Figure out why our tests didn't catch this, this afaict should be a
> >    tested scenario
> > 4) Fix confusing variable naming around newSlot/newslot in ExecBRUpdateTriggers
>
> Maybe instead of 1+2, change ExecCopySlot like this:
>
> - dstslot->tts_ops->copyslot(dstslot, srcslot);
> + if (dstslot != srcslot)
> +    dstslot->tts_ops->copyslot(dstslot, srcslot);
>
> That would fix other similar instances, not just this one caller.
Here's 1, 2 & 4, working on fixing 3).  While we now have OK-ish
coverage for nodeModifyTable.c and nodeLockRows.c triggered EPQ, there's
not much for the trigger case.

Greetings,

Andres Freund

wip-fix-16036.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Andres Freund
Hi,

On 2019-10-03 21:30:21 -0700, Andres Freund wrote:
> Here's 1, 2 & 4, working on fixing 3).  While we now have OK-ish
> coverage for nodeModifyTable.c and nodeLockRows.c triggered EPQ, there's
> not much for the trigger case.

Here's both the fix and tests.

I think the test adds a good bit coverage. What's lacking is primarily
testing foreign table related paths, but that seems separate enough, and
would need to be done in a separate contrib/ test anyway.

I think there's a good case to be made to backpatch the tests further
than 12, but I'm not sure about it? They do pass (with one error message
about a failure to delete changed to a failure to update, we didn't use
to be able to discern) back to 9.6, requiring

commit 38f8bdcac4982215beb9f65a19debecaf22fd470
Author: Robert Haas <[hidden email]>
Date:   2016-02-11 08:36:30 -0500

    Modify the isolation tester so that multiple sessions can wait.

Greetings,

Andres Freund

v2-0001-Fix-crash-caused-by-EPQ-happening-with-a-before-u.patch (5K) Download Attachment
v2-0002-Add-isolation-tests-for-the-combination-of-EPQ-an.patch (132K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Andres Freund
Hi,

Thanks again for the bug report! I've pushed the fix, and tests. They'll
be included in 12.1 (which I assume will happen relatively soon), or you
can compile a slightly newer version yourself.

Regards,

Andres


On 2019-10-04 12:28:48 -0700, Andres Freund wrote:
> On 2019-10-03 21:30:21 -0700, Andres Freund wrote:
> > Here's 1, 2 & 4, working on fixing 3).  While we now have OK-ish
> > coverage for nodeModifyTable.c and nodeLockRows.c triggered EPQ, there's
> > not much for the trigger case.

> Here's both the fix and tests.
>
> I think the test adds a good bit coverage. What's lacking is primarily
> testing foreign table related paths, but that seems separate enough, and
> would need to be done in a separate contrib/ test anyway.
>
> I think there's a good case to be made to backpatch the tests further
> than 12, but I'm not sure about it? They do pass (with one error message
> about a failure to delete changed to a failure to update, we didn't use
> to be able to discern) back to 9.6, requiring

I did push the tests back to 9.6.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Andres Freund
Hi,

On 2019-10-04 14:43:00 -0700, Andres Freund wrote:
> > I think there's a good case to be made to backpatch the tests further
> > than 12, but I'm not sure about it? They do pass (with one error message
> > about a failure to delete changed to a failure to update, we didn't use
> > to be able to discern) back to 9.6, requiring
>
> I did push the tests back to 9.6.

There's a few ordering violations in the tests, e.g.:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-10-04%2021%3A51%3A13

 key-a          val-a-s1-ups1-ups1
 step s1_c: COMMIT;
-s3: NOTICE:  upd: text key-a = text key-a: t
-s3: NOTICE:  upk: text val-a-s1-ups1-ups1 <> text mismatch: t
-s3: NOTICE:  trigger: name rep_b_u; when: BEFORE; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new: (key-a,val-a-s1-ups1-ups1-ups3)
-s3: NOTICE:  trigger: name rep_a_u; when: AFTER; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new: (key-a,val-a-s1-ups1-ups1-ups3)
-step s3_upd_a_data: <... completed>
+s2: NOTICE:  upd: text key-a = text key-a: t
+s2: NOTICE:  upk: text val-a-s1-ups1-ups1 <> text mismatch: t
+s2: NOTICE:  trigger: name rep_b_u; when: BEFORE; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new: (key-a,val-a-s1-ups1-ups1-upserts2)
+s2: NOTICE:  trigger: name rep_a_u; when: AFTER; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new: (key-a,val-a-s1-ups1-ups1-upserts2)
+step s2_upsert_a_data: <... completed>
 key            data          

I was under the assumption that it'd be deterministic who gets to
continue with a speculative insertion, but that ain't so.

Peter, do you see an easy way around that? Do you consider that a
problem? ISTM we ought to make this fair, but that doing so would
require a bit bigger changes that we'd want to make in the backbranches?

Unless somebody comes up with an idea, I'll disable (or rewrite) the
"three way" tests involving two waiters for one insertion.

Regards,

Andres


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Peter Geoghegan-4
On Fri, Oct 4, 2019 at 3:24 PM Andres Freund <[hidden email]> wrote:
> I was under the assumption that it'd be deterministic who gets to
> continue with a speculative insertion, but that ain't so.

There is no heap_lock_tuple() style lock arbitration built in to
speculative insertions:

https://wiki.postgresql.org/wiki/UPSERT#Theoretical_lock_starvation_hazards

I do not consider this to be a problem.

> Peter, do you see an easy way around that? Do you consider that a
> problem? ISTM we ought to make this fair, but that doing so would
> require a bit bigger changes that we'd want to make in the backbranches?

It would also require formally defining "fair", which doesn't seem
straightforward to me. The MVCC snapshot is barely involved at all.

--
Peter Geoghegan


Reply | Threaded
Open this post in threaded view
|

Re: BUG #16036: Segmentation fault while doing an update

Andres Freund
In reply to this post by Andres Freund
Hi,

On 2019-10-04 15:24:37 -0700, Andres Freund wrote:

> On 2019-10-04 14:43:00 -0700, Andres Freund wrote:
> > > I think there's a good case to be made to backpatch the tests further
> > > than 12, but I'm not sure about it? They do pass (with one error message
> > > about a failure to delete changed to a failure to update, we didn't use
> > > to be able to discern) back to 9.6, requiring
> >
> > I did push the tests back to 9.6.
>
> There's a few ordering violations in the tests, e.g.:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2019-10-04%2021%3A51%3A13

I've now disabled that portion of the test. It look to me like this
might be a subtle upsert bug.  What happens is that in the
isolationtester sequence (all operating on one row, and all already in a
transaction):
    "s1_upd_a_data" "s3_upd_a_data" "s2_upsert_a_data" "s1_upd_a_data" "s1_c" "s3_del_a" "s3_c" "s2_c"

sometimes the s2_upsert_a_data gets to update the row before the
s3_upd_a_data. That does *not* happen with a plain update, even though
there's at that point always a target row to operate on (i.e. this
happens before s3_del_a is reached). I think the cause for that might be
that the check_exclusion_or_unique_constraint() check done for upsert
(via ExecCheckIndexConstraints() in ExecInsert()) does a
XactLockTableWait() without, like heapam.c would, ordering via a tuple
lock.

Since it's clearly not 12 specific, and not really related to triggers /
EPQ, I think disabling it for now is a sensible choice.


>  key-a          val-a-s1-ups1-ups1
>  step s1_c: COMMIT;
> -s3: NOTICE:  upd: text key-a = text key-a: t
> -s3: NOTICE:  upk: text val-a-s1-ups1-ups1 <> text mismatch: t
> -s3: NOTICE:  trigger: name rep_b_u; when: BEFORE; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new: (key-a,val-a-s1-ups1-ups1-ups3)
> -s3: NOTICE:  trigger: name rep_a_u; when: AFTER; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new: (key-a,val-a-s1-ups1-ups1-ups3)
> -step s3_upd_a_data: <... completed>
> +s2: NOTICE:  upd: text key-a = text key-a: t
> +s2: NOTICE:  upk: text val-a-s1-ups1-ups1 <> text mismatch: t
> +s2: NOTICE:  trigger: name rep_b_u; when: BEFORE; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new: (key-a,val-a-s1-ups1-ups1-upserts2)
> +s2: NOTICE:  trigger: name rep_a_u; when: AFTER; lev: ROWs; op: UPDATE; old: (key-a,val-a-s1-ups1-ups1) new: (key-a,val-a-s1-ups1-ups1-upserts2)
> +step s2_upsert_a_data: <... completed>
>  key            data          
>
> I was under the assumption that it'd be deterministic who gets to
> continue with a speculative insertion, but that ain't so.

This was a preliminary theory, a wrong one: Speculative insertions
aren't actually involved, see above...

Greetings,

Andres Freund


12