Failure in contrib test _int on loach

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Failure in contrib test _int on loach

Thomas Munro-5
Hi,

This is a strange failure:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=loach&dt=2019-04-05%2005%3A15%3A00

test _int                         ... FAILED      649 ms

================= pgsql.build/contrib/intarray/regression.diffs
===================
diff -U3 /usr/home/pgbf/buildroot/HEAD/pgsql.build/contrib/intarray/expected/_int.out
/usr/home/pgbf/buildroot/HEAD/pgsql.build/contrib/intarray/results/_int.out
--- /usr/home/pgbf/buildroot/HEAD/pgsql.build/contrib/intarray/expected/_int.out
2019-03-21 12:16:30.514677000 +0100
+++ /usr/home/pgbf/buildroot/HEAD/pgsql.build/contrib/intarray/results/_int.out
2019-04-05 07:23:10.005914000 +0200
@@ -453,13 +453,13 @@
 SELECT count(*) from test__int WHERE a && '{23,50}';
  count
 -------
-   403
+   402
 (1 row)

 SELECT count(*) from test__int WHERE a @@ '23|50';
  count
 -------
-   403
+   402
 (1 row)

Those two queries are run immediately after:

CREATE INDEX text_idx on test__int using gist ( a gist__int_ops );

--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Failure in contrib test _int on loach

Andrew Dunstan-8
On Fri, Apr 5, 2019 at 2:02 AM Thomas Munro <[hidden email]> wrote:

>
> Hi,
>
> This is a strange failure:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=loach&dt=2019-04-05%2005%3A15%3A00
>
> test _int                         ... FAILED      649 ms
>
> ================= pgsql.build/contrib/intarray/regression.diffs
> ===================
> diff -U3 /usr/home/pgbf/buildroot/HEAD/pgsql.build/contrib/intarray/expected/_int.out
> /usr/home/pgbf/buildroot/HEAD/pgsql.build/contrib/intarray/results/_int.out
> --- /usr/home/pgbf/buildroot/HEAD/pgsql.build/contrib/intarray/expected/_int.out
> 2019-03-21 12:16:30.514677000 +0100
> +++ /usr/home/pgbf/buildroot/HEAD/pgsql.build/contrib/intarray/results/_int.out
> 2019-04-05 07:23:10.005914000 +0200
> @@ -453,13 +453,13 @@
>  SELECT count(*) from test__int WHERE a && '{23,50}';
>   count
>  -------
> -   403
> +   402
>  (1 row)
>
>  SELECT count(*) from test__int WHERE a @@ '23|50';
>   count
>  -------
> -   403
> +   402
>  (1 row)
>
> Those two queries are run immediately after:
>
> CREATE INDEX text_idx on test__int using gist ( a gist__int_ops );
>



There are a couple of other recent instances of this failure, on
francolin and whelk.

cheers

andrew


--
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Failure in contrib test _int on loach

Tom Lane-2
Andrew Dunstan <[hidden email]> writes:
> On Fri, Apr 5, 2019 at 2:02 AM Thomas Munro <[hidden email]> wrote:
>> This is a strange failure:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=loach&dt=2019-04-05%2005%3A15%3A00
>> [ wrong answers from queries using a GIST index ]

> There are a couple of other recent instances of this failure, on
> francolin and whelk.

Yeah.  Given three failures in a couple of days, we can reasonably
guess that the problem was introduced within a day or two prior to
the first one.  Looking at what's touched GIST in that time frame,
suspicion has to fall heavily on 9155580fd5fc2a0cbb23376dfca7cd21f59c2c7b.

If I had to bet, I'd bet that there's something wrong with the
machinations described in the commit message:
   
    For GiST, the LSN-NSN interlock makes this a little tricky. All pages must
    be marked with a valid (i.e. non-zero) LSN, so that the parent-child
    LSN-NSN interlock works correctly. We now use magic value 1 for that during
    index build. Change the fake LSN counter to begin from 1000, so that 1 is
    safely smaller than any real or fake LSN. 2 would've been enough for our
    purposes, but let's reserve a bigger range, in case we need more special
    values in the future.

I'll go add this as an open issue.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Failure in contrib test _int on loach

Anastasia Lubennikova

05.04.2019 18:01, Tom Lane writes:

> Andrew Dunstan <[hidden email]> writes:
>> On Fri, Apr 5, 2019 at 2:02 AM Thomas Munro <[hidden email]> wrote:
>>> This is a strange failure:
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=loach&dt=2019-04-05%2005%3A15%3A00
>>> [ wrong answers from queries using a GIST index ]
>> There are a couple of other recent instances of this failure, on
>> francolin and whelk.
> Yeah.  Given three failures in a couple of days, we can reasonably
> guess that the problem was introduced within a day or two prior to
> the first one.  Looking at what's touched GIST in that time frame,
> suspicion has to fall heavily on 9155580fd5fc2a0cbb23376dfca7cd21f59c2c7b.
>
> If I had to bet, I'd bet that there's something wrong with the
> machinations described in the commit message:
>      
>      For GiST, the LSN-NSN interlock makes this a little tricky. All pages must
>      be marked with a valid (i.e. non-zero) LSN, so that the parent-child
>      LSN-NSN interlock works correctly. We now use magic value 1 for that during
>      index build. Change the fake LSN counter to begin from 1000, so that 1 is
>      safely smaller than any real or fake LSN. 2 would've been enough for our
>      purposes, but let's reserve a bigger range, in case we need more special
>      values in the future.
>
> I'll go add this as an open issue.
>
> regards, tom lane
>
Hi,
I've already noticed the same failure in our company buildfarm and
started the research.

You are right, it's the " Generate less WAL during GiST, GIN and SP-GiST
index build. " patch to blame.
Because of using the GistBuildLSN some pages are not linked correctly,
so index scan cannot find some entries, while seqscan finds them.

In attachment, you can find patch with a test that allows to reproduce
the bug not randomly, but on every run.
Now I'm trying to find a way to fix the issue.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


gist_optimized_wal_intarray_test.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Failure in contrib test _int on loach

Anastasia Lubennikova
05.04.2019 19:41, Anastasia Lubennikova writes:

>
> 05.04.2019 18:01, Tom Lane writes:
>> Andrew Dunstan <[hidden email]> writes:
>>> On Fri, Apr 5, 2019 at 2:02 AM Thomas Munro <[hidden email]>
>>> wrote:
>>>> This is a strange failure:
>>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=loach&dt=2019-04-05%2005%3A15%3A00 
>>>>
>>>> [ wrong answers from queries using a GIST index ]
>>> There are a couple of other recent instances of this failure, on
>>> francolin and whelk.
>> Yeah.  Given three failures in a couple of days, we can reasonably
>> guess that the problem was introduced within a day or two prior to
>> the first one.  Looking at what's touched GIST in that time frame,
>> suspicion has to fall heavily on
>> 9155580fd5fc2a0cbb23376dfca7cd21f59c2c7b.
>>
>> If I had to bet, I'd bet that there's something wrong with the
>> machinations described in the commit message:
>>           For GiST, the LSN-NSN interlock makes this a little tricky.
>> All pages must
>>      be marked with a valid (i.e. non-zero) LSN, so that the
>> parent-child
>>      LSN-NSN interlock works correctly. We now use magic value 1 for
>> that during
>>      index build. Change the fake LSN counter to begin from 1000, so
>> that 1 is
>>      safely smaller than any real or fake LSN. 2 would've been enough
>> for our
>>      purposes, but let's reserve a bigger range, in case we need more
>> special
>>      values in the future.
>>
>> I'll go add this as an open issue.
>>
>>             regards, tom lane
>>
>
> Hi,
> I've already noticed the same failure in our company buildfarm and
> started the research.
>
> You are right, it's the " Generate less WAL during GiST, GIN and
> SP-GiST index build. " patch to blame.
> Because of using the GistBuildLSN some pages are not linked correctly,
> so index scan cannot find some entries, while seqscan finds them.
>
> In attachment, you can find patch with a test that allows to reproduce
> the bug not randomly, but on every run.
> Now I'm trying to find a way to fix the issue.
The problem was caused by incorrect detection of the page to insert new
tuple after split.
If gistinserttuple() of the tuple formed by gistgetadjusted() had to
split the page, we must to go back to the parent and
descend back to the child that's a better fit for the new tuple.

Previously this was handled by the code block with the following comment:

* Concurrent split detected. There's no guarantee that the
* downlink for this page is consistent with the tuple we're
* inserting anymore, so go back to parent and rechoose the best
* child.

After introducing GistBuildNSN this code path became unreachable.
To fix it, I added new flag to detect such splits during indexbuild.

The patches with the test and fix are attached.

Many thanks to Teodor Sigaev, who helped to find the bug.

--

Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


gist_optimized_wal_intarray_fix_v1.patch (2K) Download Attachment
gist_optimized_wal_intarray_test_v1.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Failure in contrib test _int on loach

Heikki Linnakangas
On 09/04/2019 19:11, Anastasia Lubennikova wrote:

> 05.04.2019 19:41, Anastasia Lubennikova writes:
>> In attachment, you can find patch with a test that allows to reproduce
>> the bug not randomly, but on every run.
>> Now I'm trying to find a way to fix the issue.
>
> The problem was caused by incorrect detection of the page to insert new
> tuple after split.
> If gistinserttuple() of the tuple formed by gistgetadjusted() had to
> split the page, we must to go back to the parent and
> descend back to the child that's a better fit for the new tuple.
>
> Previously this was handled by the code block with the following comment:
>
> * Concurrent split detected. There's no guarantee that the
> * downlink for this page is consistent with the tuple we're
> * inserting anymore, so go back to parent and rechoose the best
> * child.
>
> After introducing GistBuildNSN this code path became unreachable.
> To fix it, I added new flag to detect such splits during indexbuild.

Isn't it possible that the grandparent page is also split, so that we'd
need to climb further up?

- Heikki


Reply | Threaded
Open this post in threaded view
|

Re: Failure in contrib test _int on loach

Andrey Lepikhov


On 10/04/2019 20:25, Heikki Linnakangas wrote:

> On 09/04/2019 19:11, Anastasia Lubennikova wrote:
>> 05.04.2019 19:41, Anastasia Lubennikova writes:
>>> In attachment, you can find patch with a test that allows to reproduce
>>> the bug not randomly, but on every run.
>>> Now I'm trying to find a way to fix the issue.
>>
>> The problem was caused by incorrect detection of the page to insert new
>> tuple after split.
>> If gistinserttuple() of the tuple formed by gistgetadjusted() had to
>> split the page, we must to go back to the parent and
>> descend back to the child that's a better fit for the new tuple.
>>
>> Previously this was handled by the code block with the following comment:
>>
>> * Concurrent split detected. There's no guarantee that the
>> * downlink for this page is consistent with the tuple we're
>> * inserting anymore, so go back to parent and rechoose the best
>> * child.
>>
>> After introducing GistBuildNSN this code path became unreachable.
>> To fix it, I added new flag to detect such splits during indexbuild.
>
> Isn't it possible that the grandparent page is also split, so that we'd
> need to climb further up?
Based on Anastasia's idea i prepare alternative solution to fix the bug
(see attachment).
It utilizes the idea of linear increment of LSN/NSN. WAL write process
is used for change NSN value to 1 for each block of index relation.
I hope this can be a fairly clear and safe solution.

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

0001-Alt-fix-for-gist_optimized_wal_intarray_test-bug.patch (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Failure in contrib test _int on loach

Heikki Linnakangas
On 11/04/2019 09:10, Andrey Lepikhov wrote:

> On 10/04/2019 20:25, Heikki Linnakangas wrote:
>> On 09/04/2019 19:11, Anastasia Lubennikova wrote:
>>> After introducing GistBuildNSN this code path became unreachable.
>>> To fix it, I added new flag to detect such splits during indexbuild.
>>
>> Isn't it possible that the grandparent page is also split, so that we'd
>> need to climb further up?
>
> Based on Anastasia's idea i prepare alternative solution to fix the bug
> (see attachment).
> It utilizes the idea of linear increment of LSN/NSN. WAL write process
> is used for change NSN value to 1 for each block of index relation.
> I hope this can be a fairly clear and safe solution.

That's basically the same idea as always using the "fake LSN" during
index build, like the original version of this patch did. It's got the
problem that I mentioned at
https://www.postgresql.org/message-id/090fb3cb-1ca4-e173-ecf7-47d41ebac620@...:

> * Using "fake" unlogged LSNs for GiST index build seemed fishy. I could
> not convince myself that it was safe in all corner cases. In a recently
> initdb'd cluster, it's theoretically possible that the fake LSN counter
> overtakes the real LSN value, and that could lead to strange behavior.
> For example, how would the buffer manager behave, if there was a dirty
> page in the buffer cache with an LSN value that's greater than the
> current WAL flush pointer? I think you'd get "ERROR: xlog flush request
> %X/%X is not satisfied --- flushed only to %X/%X".

Perhaps the risk is theoretical; the real WAL begins at XLOG_SEG_SIZE,
so with defaults WAL segment size, the index build would have to do
about 16 million page splits. The index would have to be at least 150 GB
for that. But it seems possible, and with non-default segment and page
size settings more so.

Perhaps we could start at 1, but instead of using a global counter,
whenever a page is split, we take the parent's LSN value and increment
it by one. So different branches of the tree could use the same values,
which would reduce the consumption of the counter values.

Yet another idea would be to start the counter at 1, but check that it
doesn't overtake the WAL insert pointer. If it's about to overtake it,
just generate some dummy WAL.

But it seems best to deal with this in gistdoinsert(). I think
Anastasia's approach of adding a flag to GISTInsertStack can be made to
work, if we set the flag somewhere in gistinserttuples() or
gistplacetopage(), whenever a page is split. That way, if it needs to
split multiple levels, the flag is set on all of the corresponding
GISTInsertStack entries.

Yet another trivial fix would be just always start the tree descend from
the root in gistdoinsert(), if a page is split. Not as efficient, but
probably negligible in practice.

- Heikki


Reply | Threaded
Open this post in threaded view
|

Re: Failure in contrib test _int on loach

Andrey Lepikhov


On 11/04/2019 13:14, Heikki Linnakangas wrote:

> On 11/04/2019 09:10, Andrey Lepikhov wrote:
>> On 10/04/2019 20:25, Heikki Linnakangas wrote:
>>> On 09/04/2019 19:11, Anastasia Lubennikova wrote:
>>>> After introducing GistBuildNSN this code path became unreachable.
>>>> To fix it, I added new flag to detect such splits during indexbuild.
>>>
>>> Isn't it possible that the grandparent page is also split, so that we'd
>>> need to climb further up?
>>
>> Based on Anastasia's idea i prepare alternative solution to fix the bug
>> (see attachment).
>> It utilizes the idea of linear increment of LSN/NSN. WAL write process
>> is used for change NSN value to 1 for each block of index relation.
>> I hope this can be a fairly clear and safe solution.
>
> That's basically the same idea as always using the "fake LSN" during
> index build, like the original version of this patch did. It's got the
> problem that I mentioned at
> https://www.postgresql.org/message-id/090fb3cb-1ca4-e173-ecf7-47d41ebac620@...: 
>
>
>> * Using "fake" unlogged LSNs for GiST index build seemed fishy. I
>> could not convince myself that it was safe in all corner cases. In a
>> recently initdb'd cluster, it's theoretically possible that the fake
>> LSN counter overtakes the real LSN value, and that could lead to
>> strange behavior. For example, how would the buffer manager behave, if
>> there was a dirty page in the buffer cache with an LSN value that's
>> greater than the current WAL flush pointer? I think you'd get "ERROR:
>> xlog flush request %X/%X is not satisfied --- flushed only to %X/%X".
>
> Perhaps the risk is theoretical; the real WAL begins at XLOG_SEG_SIZE,
> so with defaults WAL segment size, the index build would have to do
> about 16 million page splits. The index would have to be at least 150 GB
> for that. But it seems possible, and with non-default segment and page
> size settings more so.
As i see in bufmgr.c, XLogFlush() can't called during index build. In
the log_newpage_range() call we can use mask to set value of NSN (and
LSN) to 1.

>
> Perhaps we could start at 1, but instead of using a global counter,
> whenever a page is split, we take the parent's LSN value and increment
> it by one. So different branches of the tree could use the same values,
> which would reduce the consumption of the counter values.
>
> Yet another idea would be to start the counter at 1, but check that it
> doesn't overtake the WAL insert pointer. If it's about to overtake it,
> just generate some dummy WAL.
>
> But it seems best to deal with this in gistdoinsert(). I think
> Anastasia's approach of adding a flag to GISTInsertStack can be made to
> work, if we set the flag somewhere in gistinserttuples() or
> gistplacetopage(), whenever a page is split. That way, if it needs to
> split multiple levels, the flag is set on all of the corresponding
> GISTInsertStack entries.
>
> Yet another trivial fix would be just always start the tree descend from
> the root in gistdoinsert(), if a page is split. Not as efficient, but
> probably negligible in practice.
Agree

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company


Reply | Threaded
Open this post in threaded view
|

Re: Failure in contrib test _int on loach

Anastasia Lubennikova
In reply to this post by Heikki Linnakangas
10.04.2019 18:25, Heikki Linnakangas writes:

> On 09/04/2019 19:11, Anastasia Lubennikova wrote:
>> 05.04.2019 19:41, Anastasia Lubennikova writes:
>>> In attachment, you can find patch with a test that allows to reproduce
>>> the bug not randomly, but on every run.
>>> Now I'm trying to find a way to fix the issue.
>>
>> The problem was caused by incorrect detection of the page to insert new
>> tuple after split.
>> If gistinserttuple() of the tuple formed by gistgetadjusted() had to
>> split the page, we must to go back to the parent and
>> descend back to the child that's a better fit for the new tuple.
>>
>> Previously this was handled by the code block with the following
>> comment:
>>
>> * Concurrent split detected. There's no guarantee that the
>> * downlink for this page is consistent with the tuple we're
>> * inserting anymore, so go back to parent and rechoose the best
>> * child.
>>
>> After introducing GistBuildNSN this code path became unreachable.
>> To fix it, I added new flag to detect such splits during indexbuild.
>
> Isn't it possible that the grandparent page is also split, so that
> we'd need to climb further up?
>
 From what I understand,
the only reason for grandparent's split during gistbuild is the
insertion of the newtup returned by gistgetadjusted().

After we stepped up the stack, we will do gistchoose() to choose new
correct child,
adjust the downlink key and insert it into grandparent. If this
insertion caused split, we will recursively follow the same codepath
and set stack->retry_from_parent again.

So it is possible, but it doesn't require any extra algorithm changes.
I didn't manage to generate dataset to reproduce grandparent split.
Though, I do agree that it's worth checking out. Do you have any ideas?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Reply | Threaded
Open this post in threaded view
|

Re: Failure in contrib test _int on loach

Tom Lane-2
Anastasia Lubennikova <[hidden email]> writes:
> So it is possible, but it doesn't require any extra algorithm changes.
> I didn't manage to generate dataset to reproduce grandparent split.
> Though, I do agree that it's worth checking out. Do you have any ideas?

Ping?  This thread has gone cold, but the bug is still there, and
IMV it's a beta blocker.

                        regards, tom lane


Reply | Threaded
Open this post in threaded view
|

Re: Failure in contrib test _int on loach

Anastasia Lubennikova
27.04.2019 22:05, Tom Lane wrote:
> Anastasia Lubennikova <[hidden email]> writes:
>> So it is possible, but it doesn't require any extra algorithm changes.
>> I didn't manage to generate dataset to reproduce grandparent split.
>> Though, I do agree that it's worth checking out. Do you have any ideas?
> Ping?  This thread has gone cold, but the bug is still there, and
> IMV it's a beta blocker.

Hi,

Thank you for the reminder.
In a nutshell, this fix is ready for committer.

In previous emails, I have sent two patches with test and bugfix (see
attached).
After Heikki shared his concerns, I've rechecked the algorithm and
haven't found any potential error.
So, if other hackers are agreed with my reasoning, the suggested fix is
sufficient and can be committed.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


gist_optimized_wal_intarray_fix_v1.patch (2K) Download Attachment
gist_optimized_wal_intarray_test_v1.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Failure in contrib test _int on loach

Teodor Sigaev
Hi!

> So, if other hackers are agreed with my reasoning, the suggested fix is
> sufficient and can be committed.
>

Patch looks right, but I think that comment should be improved in follow piece:

         if (stack->blkno != GIST_ROOT_BLKNO &&
-           stack->parent->lsn < GistPageGetNSN(stack->page))
+           ((stack->parent->lsn < GistPageGetNSN(stack->page)) ||
+            stack->retry_from_parent == true))
         {
             /*
              * Concurrent split detected. There's no guarantee that the
                ....
Not only concurrent split could be deteced here and it was missed long ago. But
this patch seems a good chance to change this comment.

--
Teodor Sigaev                                   E-mail: [hidden email]
                                                    WWW: http://www.sigaev.ru/


Reply | Threaded
Open this post in threaded view
|

Re: Failure in contrib test _int on loach

Heikki Linnakangas
In reply to this post by Anastasia Lubennikova
(resending, previous attempt didn't make it to pgsql-hackers)

On 29/04/2019 16:16, Anastasia Lubennikova wrote:
> In previous emails, I have sent two patches with test and bugfix (see
> attached).
> After Heikki shared his concerns, I've rechecked the algorithm and
> haven't found any potential error.
> So, if other hackers are agreed with my reasoning, the suggested fix is
> sufficient and can be committed.

I still believe there is a problem with grandparent splits with this.
I'll try to construct a test case later this week, unless you manage to
create one before that.

- Heikki


Reply | Threaded
Open this post in threaded view
|

Re: Failure in contrib test _int on loach

Heikki Linnakangas
On 02/05/2019 10:37, Heikki Linnakangas wrote:

> (resending, previous attempt didn't make it to pgsql-hackers)
>
> On 29/04/2019 16:16, Anastasia Lubennikova wrote:
>> In previous emails, I have sent two patches with test and bugfix (see
>> attached).
>> After Heikki shared his concerns, I've rechecked the algorithm and
>> haven't found any potential error.
>> So, if other hackers are agreed with my reasoning, the suggested fix is
>> sufficient and can be committed.
>
> I still believe there is a problem with grandparent splits with this.
> I'll try to construct a test case later this week, unless you manage to
> create one before that.
Here you go. If you apply the two patches from
https://www.postgresql.org/message-id/5d48ce28-34cf-9b03-5d42-dbd5457926bf%40postgrespro.ru,
and run the attached script, it will print out something like this:

postgres=# \i grandparent.sql
DROP TABLE
CREATE TABLE
INSERT 0 150000
CREATE INDEX
psql:grandparent.sql:27: NOTICE:  working on 10000
psql:grandparent.sql:27: NOTICE:  working on 20000
psql:grandparent.sql:27: NOTICE:  working on 30000
psql:grandparent.sql:27: NOTICE:  working on 40000
psql:grandparent.sql:27: NOTICE:  working on 50000
psql:grandparent.sql:27: NOTICE:  working on 60000
psql:grandparent.sql:27: NOTICE:  working on 70000
psql:grandparent.sql:27: NOTICE:  working on 80000
psql:grandparent.sql:27: NOTICE:  working on 90000
psql:grandparent.sql:27: NOTICE:  working on 100000
psql:grandparent.sql:27: NOTICE:  working on 110000
psql:grandparent.sql:27: NOTICE:  failed for 114034
psql:grandparent.sql:27: NOTICE:  working on 120000
DO

That "failed for 114034" should not happen.

Fortunately, that's not too hard to fix. We just need to arrange things
so that the "retry_from_parent" flag also gets set for the grandparent,
when the grandparent is split. Like in the attached patch.

- Heikki

0001-Detect-internal-GiST-page-splits-correctly-during-in.patch (5K) Download Attachment
grandparent.sql (826 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Failure in contrib test _int on loach

Heikki Linnakangas
On 08/05/2019 01:31, Heikki Linnakangas wrote:

> On 02/05/2019 10:37, Heikki Linnakangas wrote:
>> On 29/04/2019 16:16, Anastasia Lubennikova wrote:
>>> In previous emails, I have sent two patches with test and bugfix (see
>>> attached).
>>> After Heikki shared his concerns, I've rechecked the algorithm and
>>> haven't found any potential error.
>>> So, if other hackers are agreed with my reasoning, the suggested fix is
>>> sufficient and can be committed.
>>
>> I still believe there is a problem with grandparent splits with this.
>> I'll try to construct a test case later this week, unless you manage to
>> create one before that.
>
> Here you go. If you apply the two patches from
> https://www.postgresql.org/message-id/5d48ce28-34cf-9b03-5d42-dbd5457926bf%40postgrespro.ru,
> and run the attached script, it will print out something like this:
>
> postgres=# \i grandparent.sql
> DROP TABLE
> CREATE TABLE
> INSERT 0 150000
> CREATE INDEX
> psql:grandparent.sql:27: NOTICE:  working on 10000
> psql:grandparent.sql:27: NOTICE:  working on 20000
> psql:grandparent.sql:27: NOTICE:  working on 30000
> psql:grandparent.sql:27: NOTICE:  working on 40000
> psql:grandparent.sql:27: NOTICE:  working on 50000
> psql:grandparent.sql:27: NOTICE:  working on 60000
> psql:grandparent.sql:27: NOTICE:  working on 70000
> psql:grandparent.sql:27: NOTICE:  working on 80000
> psql:grandparent.sql:27: NOTICE:  working on 90000
> psql:grandparent.sql:27: NOTICE:  working on 100000
> psql:grandparent.sql:27: NOTICE:  working on 110000
> psql:grandparent.sql:27: NOTICE:  failed for 114034
> psql:grandparent.sql:27: NOTICE:  working on 120000
> DO
>
> That "failed for 114034" should not happen.
>
> Fortunately, that's not too hard to fix. We just need to arrange things
> so that the "retry_from_parent" flag also gets set for the grandparent,
> when the grandparent is split. Like in the attached patch.

I hear no objections, so pushed that. But if you have a chance to review
this later, just to double-check, I'd still appreciate that.

- Heikki