PG 9.6.20 -- query misbehaves in replica

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

PG 9.6.20 -- query misbehaves in replica

Ernesto Hernández-Novich
Hello,

We have a PG 9.6.19/9.6.20 streaming replication set using PGDG
packages over Debian 9 and 10. We started experiencing an interesting
issue yesterday. This is the layout; all arrows are stream replication
built using `pg_basebackup` over replication slots.

Master N0 (9.6.19 / Debian 9) -> Replica A (9.6.19 / Debian 9)
    |        |
    |        +-----------------> Replica B (9.6.20 / Debian 10)
    |
    v
Replica N1 (9.6.20 / Debian 9) -> Replica C (9.6.20 / Debian 9)
             |
             +------------------> Replica D (9.6.20 / Debian 10)

We have a query that is a simple `select * from table where pk = 'fixed
value'`. It fetches a *single* row vía a PK Index Scan using `=` on a
TEXT value, no implicit conversions. Said query used to work fine
across the replication set.

After updating from 9.6.19 to 9.6.20, we noticed the query was not
working on D. It did NOT bring any rows. If the query is attempted
using LIKE or ~ (with a left-anchored pattern), the correct row is
fetched BUT using a Sequential Scan. We checked B, same behavior.
However, the query works normally (single row, PK IDX scan) on N0, A,
N1, and C.

We've tried:

1. REINDEX table / VACUUM ANALYZE table -- nothing changed.

2. REINDEX DATABASE / REINDEX SYSTEM on N0 -- nothing changed.

3. VACUUM FULL ANALYZE on N0 -- nothing changed.

3. Re-creating replicas B (and D) from scratch (remove $PGDATA,
   `pg_basebackup`, use replication slot) -- nothing changed.

4. Creating a NEW replica X (9.6.20 / Debian 10) on a DIFFERENT
   machine. First, using N1 as master, then using N0 as master.
   Nothing changes.

The query NEVER works on replicas B, D, nor the NEW X.

Looks like a the 9.6.20 over Debian 10 is the culprit, but I have
nothing else to work on.

Suggestions?
--
Ernesto Hernández-Novich - @iamemhn - Unix: Live free or die!
Geek by nature, Linux by choice, Debian of course.
If you can't aptitude it, it isn't useful or doesn't exist.
GPG Key Fingerprint = 0064 ADF5 EB5C DE16 99C1 6C56 F2A3 86B5 A757 E5A1



Reply | Threaded
Open this post in threaded view
|

Re: PG 9.6.20 -- query misbehaves in replica

Magnus Hagander-2


On Sat, Nov 14, 2020 at 12:17 AM Ernesto Hernández-Novich <[hidden email]> wrote:
Hello,

We have a PG 9.6.19/9.6.20 streaming replication set using PGDG
packages over Debian 9 and 10. We started experiencing an interesting
issue yesterday. This is the layout; all arrows are stream replication
built using `pg_basebackup` over replication slots.

Master N0 (9.6.19 / Debian 9) -> Replica A (9.6.19 / Debian 9)
    |        |
    |        +-----------------> Replica B (9.6.20 / Debian 10)
    |
    v
Replica N1 (9.6.20 / Debian 9) -> Replica C (9.6.20 / Debian 9)
             |
             +------------------> Replica D (9.6.20 / Debian 10)

We have a query that is a simple `select * from table where pk = 'fixed
value'`. It fetches a *single* row vía a PK Index Scan using `=` on a
TEXT value, no implicit conversions. Said query used to work fine
across the replication set.

After updating from 9.6.19 to 9.6.20, we noticed the query was not
working on D. It did NOT bring any rows. If the query is attempted
using LIKE or ~ (with a left-anchored pattern), the correct row is
fetched BUT using a Sequential Scan. We checked B, same behavior.
However, the query works normally (single row, PK IDX scan) on N0, A,
N1, and C.

We've tried:

1. REINDEX table / VACUUM ANALYZE table -- nothing changed.

2. REINDEX DATABASE / REINDEX SYSTEM on N0 -- nothing changed.

3. VACUUM FULL ANALYZE on N0 -- nothing changed.

3. Re-creating replicas B (and D) from scratch (remove $PGDATA,
   `pg_basebackup`, use replication slot) -- nothing changed.

4. Creating a NEW replica X (9.6.20 / Debian 10) on a DIFFERENT
   machine. First, using N1 as master, then using N0 as master.
   Nothing changes.

The query NEVER works on replicas B, D, nor the NEW X.

Looks like a the 9.6.20 over Debian 10 is the culprit, but I have
nothing else to work on.

Suggestions?

These issues are almost certainly because of the glibc locale changes between Debian 9 and Debian 10, and not because of the PostgreSQL upgrade.

If you have master and standby on different glibc versions (so debian 9 vs 10), all text based indexes can behave differently. All the nodes must run the same version for this to work. So in this scenario, you need to reinstall B and D with debian 9, or you need to upgrade your other nodes to debian 10 (and if you do that, then you have to reindex the master node once it has been upgraded).

 
--
Reply | Threaded
Open this post in threaded view
|

Re: PG 9.6.20 -- query misbehaves in replica

Andres Freund
In reply to this post by Ernesto Hernández-Novich
Hi,

On 2020-11-13 15:14:23 -0800, Ernesto Hernández-Novich wrote:

> We have a PG 9.6.19/9.6.20 streaming replication set using PGDG
> packages over Debian 9 and 10. We started experiencing an interesting
> issue yesterday. This is the layout; all arrows are stream replication
> built using `pg_basebackup` over replication slots.
>
> Master N0 (9.6.19 / Debian 9) -> Replica A (9.6.19 / Debian 9)
>     |        |
>     |        +-----------------> Replica B (9.6.20 / Debian 10)
>     |
>     v
> Replica N1 (9.6.20 / Debian 9) -> Replica C (9.6.20 / Debian 9)
>              |
>              +------------------> Replica D (9.6.20 / Debian 10)
>
> We have a query that is a simple `select * from table where pk = 'fixed
> value'`. It fetches a *single* row vía a PK Index Scan using `=` on a
> TEXT value, no implicit conversions. Said query used to work fine
> across the replication set.
>
> After updating from 9.6.19 to 9.6.20, we noticed the query was not
> working on D. It did NOT bring any rows. If the query is attempted
> using LIKE or ~ (with a left-anchored pattern), the correct row is
> fetched BUT using a Sequential Scan. We checked B, same behavior.
> However, the query works normally (single row, PK IDX scan) on N0, A,
> N1, and C.

That likely is related to the collation definitions differing between
the systems. Note that the systems where it doesn't work are Debian 10,
whereas the rest is Debian 9.

See https://wiki.postgresql.org/wiki/Locale_data_changes

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: PG 9.6.20 -- query misbehaves in replica

Andres Freund
In reply to this post by Magnus Hagander-2
On 2020-11-14 00:59:01 +0100, Magnus Hagander wrote:
> So in this scenario, you need to reinstall B and D with debian 9, or
> you need to upgrade your other nodes to debian 10 (and if you do that,
> then you have to reindex the master node once it has been upgraded).

Changing the collation of all columns to "C" would probably also work.


Reply | Threaded
Open this post in threaded view
|

Re: PG 9.6.20 -- query misbehaves in replica

Ernesto Hernández-Novich
In reply to this post by Magnus Hagander-2
On Sat, 2020-11-14 at 00:59 +0100, Magnus Hagander wrote:
> On Sat, Nov 14, 2020 at 12:17 AM Ernesto Hernández-Novich <
> [hidden email]> wrote:
[...]
> > After updating from 9.6.19 to 9.6.20, we noticed the query was not
> > working on D.
[...]

> > Looks like a the 9.6.20 over Debian 10 is the culprit, but I have
> > nothing else to work on.
>
> These issues are almost certainly because of the glibc locale changes
> between Debian 9 and Debian 10, and not because of the PostgreSQL
> upgrade.
>
> If you have master and standby on different glibc versions (so debian
> 9 vs 10), all text based indexes can behave differently. All the
> nodes must run the same version for this to work. So in this
> scenario, you need to reinstall B and D with debian 9, or you need to
> upgrade your other nodes to debian 10 (and if you do that, then you
> have to reindex the master node once it has been upgraded).

I agree on the locale matching and will try your suggestion.

I must say the Debian 10 machines were working just fine with PG 9.6.19
for over two months or so. We noticed the issue *just* this week after
upgrading from 9.6.19 onto 9.6.20.

Regards,
--
Ernesto Hernández-Novich - @iamemhn - Unix: Live free or die!
Geek by nature, Linux by choice, Debian of course.
If you can't aptitude it, it isn't useful or doesn't exist.
GPG Key Fingerprint = 0064 ADF5 EB5C DE16 99C1 6C56 F2A3 86B5 A757 E5A1



Reply | Threaded
Open this post in threaded view
|

Re: PG 9.6.20 -- query misbehaves in replica

Magnus Hagander-2


On Sat, Nov 14, 2020 at 1:10 AM Ernesto Hernández-Novich <[hidden email]> wrote:
On Sat, 2020-11-14 at 00:59 +0100, Magnus Hagander wrote:
> On Sat, Nov 14, 2020 at 12:17 AM Ernesto Hernández-Novich <
> [hidden email]> wrote:
[...]
> > After updating from 9.6.19 to 9.6.20, we noticed the query was not
> > working on D.
[...]
> > Looks like a the 9.6.20 over Debian 10 is the culprit, but I have
> > nothing else to work on.
>
> These issues are almost certainly because of the glibc locale changes
> between Debian 9 and Debian 10, and not because of the PostgreSQL
> upgrade.
>
> If you have master and standby on different glibc versions (so debian
> 9 vs 10), all text based indexes can behave differently. All the
> nodes must run the same version for this to work. So in this
> scenario, you need to reinstall B and D with debian 9, or you need to
> upgrade your other nodes to debian 10 (and if you do that, then you
> have to reindex the master node once it has been upgraded).

I agree on the locale matching and will try your suggestion.

I must say the Debian 10 machines were working just fine with PG 9.6.19
for over two months or so. We noticed the issue *just* this week after
upgrading from 9.6.19 onto 9.6.20.

Most likely they weren't, but they looked like they were. It could be that the problem wasn't triggered until now, but it could also be that it "sometimes worked and sometimes didn't", and you just didn't notice -- I've come across several installations that had exactly that problem.
 
--