BUG #15720: `executor could not find named tuplestore ABC` in AFTER DELETE trigger referencing OLD TABLE as ABC

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

BUG #15720: `executor could not find named tuplestore ABC` in AFTER DELETE trigger referencing OLD TABLE as ABC

apt.postgresql.org Repository Update
The following bug has been logged on the website:

Bug reference:      15720
Logged by:          Jason Madden
Email address:      [hidden email]
PostgreSQL version: 11.2
Operating system:   Red Hat Enterprise Linux 7.6
Description:        

Version: PostgreSQL 11.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623 (Red Hat 4.8.5-36), 64-bit

As described in the subject, we seem to have produced a rare, intermittent
issue in a trigger when calling a procedure that modifies a declaratively
hash partitioned table which results in executing a trigger referencing the
transition table. In several weeks of fairly heavy testing of our
application and this specific code path we've only produced this error once
(during a large automated concurrent run, which succeeded on the second
attempt), so sadly I don't have clear reproduction steps. Unfortunately, by
the time this got reported to me backend log information was no longer
available so I can only report what we saw on the fronted.

Here's the error, as reported from the JDBC driver (names shortened for
clarity):

```
SQL failure calling: SELECT * FROM wip_upload_finish(...)
...
Caused by: org.postgresql.util.PSQLException: ERROR: executor could not find
named tuplestore "updated_positions"
  Where: SQL statement "UPDATE w_instances pi
    SET last_modified = NOW()
    FROM updated_positions up
    WHERE pi.object_id = up.p_instance_id"
PL/pgSQL function wip_update_AC_trigger_func() line 24 at SQL statement
SQL statement "
      DELETE FROM wip
      WHERE p_instance_id = (
        SELECT object_id
        FROM w_instances
        WHERE c1 = $1 and c2 = $2
      )
    "
```

`wip` is a declaratively hash partitioned table with ten partitions:
```
CREATE TABLE wip (
    object_id object_id_type NOT NULL DEFAULT nextval('seq_object_id'),
    p_instance_id object_id_type NOT NULL,
    sm_id object_id_type NOT NULL,
    csn TEXT,
    tags jsonb,
    CONSTRAINT wip_pkey
             PRIMARY KEY (object_id, p_instance_id),
)
  PARTITION BY HASH(p_instance_id);
```

It has four statement triggers on it, two each for INSERT and DELETE, all of
which reference a transition table:

```
CREATE TRIGGER wip_update_AC_delete_trigger
  AFTER DELETE
  ON wip
  REFERENCING OLD TABLE AS updated_positions
  FOR EACH STATEMENT
  EXECUTE PROCEDURE wip_update_AC_trigger_func();

CREATE TRIGGER wip_update_AC_insert_trigger
  AFTER INSERT
  ON wip
  REFERENCING NEW TABLE AS updated_positions
  FOR EACH STATEMENT
  EXECUTE PROCEDURE wip_update_AC_trigger_func();
```

The trigger function references the transition table in a loop, and that
worked. After the loop there's a statement that resulted in the error (line
numbers should match up):

```
CREATE OR REPLACE FUNCTION wip_update_AC_trigger_func()
RETURNS TRIGGER
LANGUAGE plpgsql
AS
$$
DECLARE
  rec RECORD;
BEGIN
  -- comment
  -- comment
  -- comment
  -- comment
  FOR rec IN SELECT DISTINCT p_instance_id FROM updated_positions LOOP

    UPDATE w_instances pi
    SET c1 = COALESCE(
      (SELECT c1
       FROM view1 ap
       WHERE ap.p_instance_id = rec.p_instance_id),
     pi.c1)
    WHERE object_id = rec.p_instance_id
    AND EXISTS (SELECT 1 FROM wip WHERE p_instance_id =
rec.p_instance_id);
  END LOOP;

  -- comment
  -- comment
  -- comment
  UPDATE w_instances pi
  SET last_modified = NOW()
  FROM updated_positions up -- ERROR line
  WHERE pi.object_id = up.p_instance_id;

  RETURN NULL;
END;
$$;
```

The other pair of triggers (named `wip_stats_delete_trigger` and
`wip_stats_delete_trigger`) are similarly defined as AFTER STATEMENT
triggers that call a function passing the transition table. This function
*only* contains a `FOR rec IN SELECT ... FROM updated_positions` LOOP.

I believe the JIT was off.

I apologize for the dearth of information I'm able to provide and for the
lack of a simple reproducible example; I know that makes for an annoying bug
report. I also apologize if our app is doing something wrong or if this is a
known issue of some sort; I wasn't able to find anything related outside the
PostgreSQL source code itself. I'm happy to try to provide any other helpful
information.

Thanks,
Jason

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15720: `executor could not find named tuplestore ABC` in AFTER DELETE trigger referencing OLD TABLE as ABC

Thomas Munro-5
On Fri, Mar 29, 2019 at 4:49 AM PG Bug reporting form
<[hidden email]> wrote:

> SQL failure calling: SELECT * FROM wip_upload_finish(...)
> ...
> Caused by: org.postgresql.util.PSQLException: ERROR: executor could not find
> named tuplestore "updated_positions"
>   Where: SQL statement "UPDATE w_instances pi
>     SET last_modified = NOW()
>     FROM updated_positions up
>     WHERE pi.object_id = up.p_instance_id"
> PL/pgSQL function wip_update_AC_trigger_func() line 24 at SQL statement
> SQL statement "
>       DELETE FROM wip
>       WHERE p_instance_id = (
>         SELECT object_id
>         FROM w_instances
>         WHERE c1 = $1 and c2 = $2
>       )
>     "

Hi,

I haven't had time to look into this properly yet but a couple of
ideas occurred to me just from the description: I wonder if there is a
rare case of row movement in your workload (ie an update that moves a
row from one partition to another), and we have a bug in the
transition table code in that path.  Then again you said you reran the
job and it worked, so unless the data that drives that jobs changed
underneath it, that might not work.  Or perhaps some combination of ON
CONFLICT, partitions and transition tables is going awry...  What does
wip_upload_finish() do?

--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15720: `executor could not find named tuplestore ABC` in AFTER DELETE trigger referencing OLD TABLE as ABC

Jason Madden

> On Mar 29, 2019, at 05:12, Thomas Munro <[hidden email]> wrote:
>
> On Fri, Mar 29, 2019 at 4:49 AM PG Bug reporting form
> <[hidden email]> wrote:
>> SQL failure calling: SELECT * FROM wip_upload_finish(...)
>> ...
>> Caused by: org.postgresql.util.PSQLException: ERROR: executor could not find
>> named tuplestore "updated_positions"
>>  Where: SQL statement "UPDATE w_instances pi
>>    SET last_modified = NOW()
>>    FROM updated_positions up
>>    WHERE pi.object_id = up.p_instance_id"
>> PL/pgSQL function wip_update_AC_trigger_func() line 24 at SQL statement
>> SQL statement "
>>      DELETE FROM wip
>>      WHERE p_instance_id = (
>>        SELECT object_id
>>        FROM w_instances
>>        WHERE c1 = $1 and c2 = $2
>>      )
>>    "

Hi!

>
> I haven't had time to look into this properly yet but a couple of
> ideas occurred to me just from the description: I wonder if there is a
> rare case of row movement in your workload (ie an update that moves a
> row from one partition to another), and we have a bug in the
> transition table code in that path.

In this application, the rows that make up `wip` are immutable. Rows may be added and rows may be deleted, but no individual row is ever updated so I think row movement can be ruled out.


>  Then again you said you reran the job and it worked, so unless the data that drives that jobs changed
> underneath it, that might not work.

Data changing underneath between runs is entirely possible. This was during a testing phase where data was being migrated from a previous system into this one while people were also interactively using the system, potentially editing that same data as it arrived.

>  Or perhaps some combination of ON
> CONFLICT, partitions and transition tables is going awry...  What does
> wip_upload_finish() do?

At the point this error was thrown, it hadn't had a chance to do very much yet :)

We have a set of rows sitting in a temporary table, `tmp_wip`. Those rows are detail rows ultimately belonging to a master row (1-many) in the `w_instances` table, which in turn are detail rows (1-many) of a master row in the `w` table (all relationships are supported with foreign key constraints; `wip.p_instance_id` -> `w_instances.object_id` (primary key) -> `w.object_id` (primary key)). (Somewhat like Vehicle -> Model -> Manufacturer tables for a car registry.)

`wip_upload_finish()` accepts a set of parameters that point it to the row in the `w` table and the row in the `w_instances` table that the details in `tmp_wip` belong to. At this point in the procedure, it has made sure the row in the `w` table exists (INSERT INTO w ON CONFLICT DO UPDATE SET last_modified = NOW() RETURNING object_id). It's just about to make sure that the row in the `w_instances` table exists, but the user has specified that the incoming rows in `tmp_wip` are to completely replace any existing `wip` detail rows (rather than append to them), so before we do that, we go ahead and remove any existing `wip` rows with the `DELETE FROM wip` statement (which failed). (Why delete first? The incoming data may not be consistent with the old data we're replacing, so this avoids any temporary situation where we'd have constraint violations, since CHECK constraints can't be deferred.) Ultimately, if the procedure had continued, the `w_instance` row would have been created/updated and the temporary data from `tmp_wip` would have been moved into `wip` (which perhaps would have necessitated making detail entries in yet more tables).

`wip_update_AC_trigger_func` is used to maintain summary data in `w_instances` based on the details in `wip`. The fact that `wip_update_AC_trigger_func` fired means that there were rows in `wip` that got deleted (all of them for this particular `p_instance_id`, which would reside in one partition of `wip`), which in turn means that the row in `w_instances` existed already, and so the `FOR rec in SELECT DISTINCT p_instance_id FROM updated_positions LOOP` (the first statement in `wip_update_AC_trigger_func`, before the UPDATE that failed) should have executed exactly exactly once.

I hope that at least begins to answer your question. Thanks for thinking about this!


Jason

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15720: `executor could not find named tuplestore ABC` in AFTER DELETE trigger referencing OLD TABLE as ABC

Tom Lane-2
Jason Madden <[hidden email]> writes:
> I hope that at least begins to answer your question. Thanks for thinking about this!

FYI, we're now thinking that the problem here is unrelated to partitions
but instead is a bug in EPQ, which is a subsystem that's entered only when
a row to be locked/updated is found to have just been updated by some
concurrent transaction.  See

https://www.postgresql.org/message-id/flat/15900-bc482754fe8d7415%40postgresql.org

If you're in a position to build a custom version of Postgres, you
might try whether the patch proposed in that thread resolves your
problem.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: BUG #15720: `executor could not find named tuplestore ABC` in AFTER DELETE trigger referencing OLD TABLE as ABC

Jason Madden


> On Jul 9, 2019, at 10:45, Tom Lane <[hidden email]> wrote:
>
>
> FYI, we're now thinking that the problem here is unrelated to partitions
> but instead is a bug in EPQ, which is a subsystem that's entered only when
> a row to be locked/updated is found to have just been updated by some
> concurrent transaction.  See
>
> https://www.postgresql.org/message-id/flat/15900-bc482754fe8d7415%40postgresql.org
>
> If you're in a position to build a custom version of Postgres, you
> might try whether the patch proposed in that thread resolves your
> problem.

Thank you for the update. Unfortunately, I no longer work with that system and aren't able to try to reproduce the issue.

~Jason

Reply | Threaded
Open this post in threaded view
|

Re: BUG #15720: `executor could not find named tuplestore ABC` in AFTER DELETE trigger referencing OLD TABLE as ABC

Thomas Munro-5
On Wed, Jul 10, 2019 at 3:55 AM Jason Madden
<[hidden email]> wrote:

> > On Jul 9, 2019, at 10:45, Tom Lane <[hidden email]> wrote:
> > FYI, we're now thinking that the problem here is unrelated to partitions
> > but instead is a bug in EPQ, which is a subsystem that's entered only when
> > a row to be locked/updated is found to have just been updated by some
> > concurrent transaction.  See
> >
> > https://www.postgresql.org/message-id/flat/15900-bc482754fe8d7415%40postgresql.org
> >
> > If you're in a position to build a custom version of Postgres, you
> > might try whether the patch proposed in that thread resolves your
> > problem.
>
> Thank you for the update. Unfortunately, I no longer work with that system and aren't able to try to reproduce the issue.

For the record, that was committed and will appear in the upcoming
maintenance releases.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=f5825853e3afdb4df4767d30cbfdd375b45d1f2a

--
Thomas Munro
https://enterprisedb.com