Errors with schema migration and logical replication — expected?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Errors with schema migration and logical replication — expected?

Mike Lissner
Hi, first time poster.

I just ran into a rather messy problem when doing a schema migration with logical replication. I'm not entirely sure what went wrong, why, or how to prevent it in the future. The migration I ran was pretty simple (though auto-generated by Django):

BEGIN;
ALTER TABLE "search_docketentry" ADD COLUMN "pacer_sequence_number" smallint NULL;
ALTER TABLE "search_docketentry" ALTER COLUMN "pacer_sequence_number" DROP DEFAULT;
ALTER TABLE "search_docketentry" ADD COLUMN "recap_sequence_number" varchar(50) DEFAULT '' NOT NULL;
ALTER TABLE "search_docketentry" ALTER COLUMN "recap_sequence_number" DROP DEFAULT;
ALTER TABLE "search_docketentry" ALTER COLUMN "entry_number" DROP NOT NULL;
ALTER TABLE "search_recapdocument" ALTER COLUMN "document_number" SET DEFAULT '';
ALTER TABLE "search_recapdocument" ALTER COLUMN "document_number" DROP DEFAULT;
ALTER TABLE "search_docketentry" DROP CONSTRAINT "search_docketentry_docket_id_12fd448b9aa007ca_uniq";
CREATE INDEX "search_docketentry_recap_sequence_number_1c82e51988e2d89f_idx" ON "search_docketentry" ("recap_sequence_number", "entry_number");
CREATE INDEX "search_docketentry_eb19fcf7" ON "search_docketentry" ("pacer_sequence_number");
CREATE INDEX "search_docketentry_bff4d47b" ON "search_docketentry" ("recap_sequence_number");
CREATE INDEX "search_docketentry_recap_sequence_number_d700f0391e8213a_like" ON "search_docketentry" ("recap_sequence_number" varchar_pattern_ops);

COMMIT;

BEGIN;
ALTER TABLE "search_docketentry"
ALTER COLUMN "pacer_sequence_number" TYPE integer;

COMMIT;

And after running this migration, I started getting this error on the subscriber:

2018-12-09 05:59:45 UTC::@:[13373]:LOG: logical replication apply worker for subscription "replicasubscription" has started
2018-12-09 05:59:45 UTC::@:[13373]:ERROR: null value in column "recap_sequence_number" violates not-null constraint
2018-12-09 05:59:45 UTC::@:[13373]:DETAIL: Failing row contains (48064261, 2018-12-07 04:48:40.388377+00, 2018-12-07 04:48:40.388402+00, null, 576, , 4571214, null, null).
2018-12-09 05:59:45 UTC::@:[6342]:LOG: worker process: logical replication worker for subscription 18390 (PID 13373) exited with exit code 1

So, my migration created a new column with a null constraint and somehow the subscriber got data that violated that. I don't know how that's possible since this was a new column and it was never nullable.

I applied the above migration simultaneously on my publisher and subscriber thinking that postgresql was smart enough to do the right thing. I think the subscriber finished first (it has less traffic).

The docs hint that postgresql might be smart enough to not worry about the order you do migrations:

> Logical replication is robust when schema definitions change in a live database: When the schema is changed on the publisher and replicated data starts arriving at the subscriber but does not fit into the table schema, replication will error until the schema is updated.

And it even hints that doing a migration on the subscriber first is a good thing in some cases:

> In many cases, intermittent errors can be avoided by applying additive schema changes to the subscriber first.

But I'm now supremely skeptical that doing anything at the subscriber first is a good idea. Are the docs wrong? Does the above error make sense? Is the process for schema migrations documented somewhere beyond the above?

I have lots of questions because I thought this would have gone smoother than it did.

As for the fix: I made the column nullable on the subscriber and I'm waiting for it to catch up. Once it does I'll re-sync its schema with the publisher. Anybody interested in following along with all this (or finding this later and having questions) can follow the issue here:


Thank you for the lovely database! I hope this is helpful.

Mike
Reply | Threaded
Open this post in threaded view
|

Re: Errors with schema migration and logical replication — expected?

Adrian Klaver-4
On 12/8/18 11:26 PM, Mike Lissner wrote:
> Hi, first time poster.
>
> I just ran into a rather messy problem when doing a schema migration
> with logical replication. I'm not entirely sure what went wrong, why, or
> how to prevent it in the future. The migration I ran was pretty simple
> (though auto-generated by Django):

>
> And after running this migration, I started getting this error on the
> subscriber:
>
> |2018-12-09 05:59:45 UTC::@:[13373]:LOG: logical replication apply
> worker for subscription "replicasubscription" has started 2018-12-09
> 05:59:45 UTC::@:[13373]:ERROR: null value in column
> "recap_sequence_number" violates not-null constraint 2018-12-09 05:59:45
> UTC::@:[13373]:DETAIL: Failing row contains (48064261, 2018-12-07
> 04:48:40.388377+00, 2018-12-07 04:48:40.388402+00, null, 576, , 4571214,
> null, null). 2018-12-09 05:59:45 UTC::@:[6342]:LOG: worker process:
> logical replication worker for subscription 18390 (PID 13373) exited
> with exit code 1|
>
>
> So, my migration created a new column with a null constraint and somehow
> the subscriber got data that violated that. I don't know how that's
> possible since this was a new column and it was never nullable.


The above seems to be the crux of the problem, how did NULL get into the
column data?

The DDL migration did what it was supposed to do.

>
> I applied the above migration simultaneously on my publisher and
> subscriber thinking that postgresql was smart enough to do the right
> thing. I think the subscriber finished first (it has less traffic).

And it did as far as the schema changes where concerned.

>
> The docs hint that postgresql might be smart enough to not worry about
> the order you do migrations:
>
>  > /Logical replication is robust when schema definitions change in a
> live database:/ When the schema is changed on the publisher and
> replicated data starts arriving at the subscriber but does not fit into
> the table schema, replication will error until the schema is updated.
>
> And it even hints that doing a migration on the subscriber first is a
> good thing in some cases:
>
>  > In many cases, intermittent errors can be avoided by applying
> additive schema changes to the subscriber first.
>
> But I'm now supremely skeptical that doing anything at the subscriber
> first is a good idea. Are the docs wrong? Does the above error make
> sense? Is the process for schema migrations documented somewhere beyond
> the above?
>
> I have lots of questions because I thought this would have gone smoother
> than it did.
>
> As for the fix: I made the column nullable on the subscriber and I'm
> waiting for it to catch up. Once it does I'll re-sync its schema with
> the publisher. Anybody interested in following along with all this (or
> finding this later and having questions) can follow the issue here:
>
> https://github.com/freelawproject/courtlistener/issues/919
>
> Thank you for the lovely database! I hope this is helpful.
>
> Mike


--
Adrian Klaver
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Errors with schema migration and logical replication — expected?

Mike Lissner

The above seems to be the crux of the problem, how did NULL get into the
column data?


I agree. My queries are generated by Django (so I never write SQL myself), but:

 - the column has always been NOT NULL for its entire lifetime
 - we don't send *any* SQL commands to the replica yet, so that's not a factor (for now it's just a live backup)
 - the publisher now has a NOT NULL constraint on that column. I never had to clear out null values to put it in place. I assume that if that column ever had a null value and I tried to run a DDL to add a null constraint, the DDL would have failed, right?

Something feels wrong here, the more I think about it.
Reply | Threaded
Open this post in threaded view
|

Re: Errors with schema migration and logical replication — expected?

Adrian Klaver-4
On 12/9/18 8:03 AM, Mike Lissner wrote:

>
>     The above seems to be the crux of the problem, how did NULL get into
>     the
>     column data?
>
>
> I agree. My queries are generated by Django (so I never write SQL
> myself), but:
>
>   - the column has always been NOT NULL for its entire lifetime

The lifetime being since the migration did this?:

ALTER TABLE "search_docketentry" ADD COLUMN "recap_sequence_number"
varchar(50) DEFAULT '' NOT NULL;
ALTER TABLE "search_docketentry" ALTER COLUMN "recap_sequence_number"
DROP DEFAULT;

Also does the column recap_sequence_number appear in any other tables.
Just wondering if the error was on another table?


>   - we don't send *any* SQL commands to the replica yet, so that's not a
> factor (for now it's just a live backup)
>   - the publisher now has a NOT NULL constraint on that column. I never
> had to clear out null values to put it in place. I assume that if that

This part confuses me. You seem to imply that the column existed before
the migration and you just added a NOT NULL constraint. The migration
shows the column being created with a NOT NULL constraint.

> column ever had a null value and I tried to run a DDL to add a null
> constraint, the DDL would have failed, right?
>
> Something feels wrong here, the more I think about it.

A start would be to figure out what generated?:

failing row contains (48064261, 2018-12-07 04:48:40.388377+00,
2018-12-07 04:48:40.388402+00, null, 576, , 4571214, null, null)


--
Adrian Klaver
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Errors with schema migration and logical replication — expected?

Mike Lissner


On Sun, Dec 9, 2018 at 8:43 AM Adrian Klaver <[hidden email]> wrote:
On 12/9/18 8:03 AM, Mike Lissner wrote:
>
>     The above seems to be the crux of the problem, how did NULL get into
>     the
>     column data?
>
>
> I agree. My queries are generated by Django (so I never write SQL
> myself), but:
>
>   - the column has always been NOT NULL for its entire lifetime

The lifetime being since the migration did this?:

ALTER TABLE "search_docketentry" ADD COLUMN "recap_sequence_number"
varchar(50) DEFAULT '' NOT NULL;
ALTER TABLE "search_docketentry" ALTER COLUMN "recap_sequence_number"
DROP DEFAULT;

"Lifetime" meaning that there was never a time when this column allowed nulls.
 
Also does the column recap_sequence_number appear in any other tables.
Just wondering if the error was on another table?

Good idea, but no. This column only exists in one table.
 
>   - we don't send *any* SQL commands to the replica yet, so that's not a
> factor (for now it's just a live backup)
>   - the publisher now has a NOT NULL constraint on that column. I never
> had to clear out null values to put it in place. I assume that if that

This part confuses me. You seem to imply that the column existed before
the migration and you just added a NOT NULL constraint. The migration
shows the column being created with a NOT NULL constraint.

Sorry, what I mean is that if *somehow* the master had null values in that column at some point, which I don't know how would even be possible because it only came into existence with the command above — if somehow that happened, I'd know, because I wouldn't have been able to add a NULL constraint without first fixing the data in that column, which I never did. 

My contention is that for all these reasons, there should *never* have been a null value in that column on master.
 

> column ever had a null value and I tried to run a DDL to add a null
> constraint, the DDL would have failed, right?
>
> Something feels wrong here, the more I think about it.

A start would be to figure out what generated?:

failing row contains (48064261, 2018-12-07 04:48:40.388377+00,
2018-12-07 04:48:40.388402+00, null, 576, , 4571214, null, null)

Yes, I completely agree. I can't think of any way that that should have ever been created.
 


--
Adrian Klaver
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Errors with schema migration and logical replication — expected?

Adrian Klaver-4
On 12/9/18 10:47 AM, Mike Lissner wrote:
>

> My contention is that for all these reasons, there should *never* have
> been a null value in that column on master.
>
>
>      > column ever had a null value and I tried to run a DDL to add a null
>      > constraint, the DDL would have failed, right?
>      >
>      > Something feels wrong here, the more I think about it.
>
>     A start would be to figure out what generated?:
>
>     failing row contains (48064261, 2018-12-07 04:48:40.388377+00,
>     2018-12-07 04:48:40.388402+00, null, 576, , 4571214, null, null)
>
>
> Yes, I completely agree. I can't think of any way that that should have
> ever been created.

1) Using psql have you verified that NOT NULL is set on that column on
the publisher?

2) And that the row that failed in the subscriber is in the publisher table.

3) That there are no NULL values in the publisher column?

Whatever the answers to 1), 2) and 3) are the next question is:

4) Do you want/need recap_sequence_number to be NOT NULL.

a) If not then you could leave things as they are.

b) If so then you:
       
        1) Have to figure out what is sending NULL values to the column.
       
            Maybe a model that has null=True set when it shouldn't be?
       
            A Form/ModelForm that is allowing None/Null?

            Some code that is operating outside the ORM e.g. doing a
           direct query using from django.db import connection.

         2) Clean up the NULL values in the column in the subscriber
            and/or publisher.

>
>
>
>     --
>     Adrian Klaver
>     [hidden email] <mailto:[hidden email]>
>


--
Adrian Klaver
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Errors with schema migration and logical replication — expected?

Mike Lissner
On Sun, Dec 9, 2018 at 12:42 PM Adrian Klaver <[hidden email]> wrote:

1) Using psql have you verified that NOT NULL is set on that column on
the publisher?

Yes, on the publisher and the subscriber. That was my first step when I saw the log lines about this.

2) And that the row that failed in the subscriber is in the publisher table.

Yep, it's there (though it doesn't show a null for that column, and I don't know how it ever could have).
 
3) That there are no NULL values in the publisher column?

This on the publisher:

select * from search_docketentry where recap_sequence_number is null;

returns zero rows, so yeah, no nulls in there (which makes sense since they're not allowed).

Whatever the answers to 1), 2) and 3) are the next question is:

4) Do you want/need recap_sequence_number to be NOT NULL.

Yes, and indeed that's how it always has been.

a) If not then you could leave things as they are.

Well, I was able to fix this by briefly allowing nulls on the subscriber, letting it catch up with the publisher, setting all nulls to empty strings (a Django convention), and then disallowing nulls again. After letting it catch up, there were 118 nulls on the subscriber in this column:


That shouldn't be possible since nulls were never allowed in this column on the publisher.
 
b) If so then you:

        1) Have to figure out what is sending NULL values to the column.

            Maybe a model that has null=True set when it shouldn't be?

Nope, never had that. I'm 100% certain.
 
            A Form/ModelForm that is allowing None/Null?

Even if that was the case, the error wouldn't have shown up on the subscriber since that null would have never been allowed in the publisher. But anyway, I don't use any forms with this column.
 
            Some code that is operating outside the ORM e.g. doing a
           direct query using from django.db import connection.

That's an idea, but like I said, nothing sends SQL to the subscriber (not even read requests), and this shouldn't have been possible in the publisher due to the NOT NULL constraint that has *always* been on that column.

         2) Clean up the NULL values in the column in the subscriber
            and/or publisher.

There were only NULL values in the subscriber, never in the publisher. Something is amiss here.

I appreciate all the responses. I'm scared to say so, but I think this is a bug in logical replication. Somehow a null value appeared at the subscriber that was never in the publisher.

I also still have this question/suggestion from my first email:

> Is the process for schema migrations documented somewhere beyond the above?

Thank you again,

Mike

Reply | Threaded
Open this post in threaded view
|

Re: Errors with schema migration and logical replication — expected?

Mike Lissner
Reupping this since it was over the weekend and looks like a bug in logical replication. My problems are solved, but some very weird things happened when doing a schema migration.

On Sun, Dec 9, 2018 at 5:48 PM Mike Lissner <[hidden email]> wrote:
On Sun, Dec 9, 2018 at 12:42 PM Adrian Klaver <[hidden email]> wrote:

1) Using psql have you verified that NOT NULL is set on that column on
the publisher?

Yes, on the publisher and the subscriber. That was my first step when I saw the log lines about this.

2) And that the row that failed in the subscriber is in the publisher table.

Yep, it's there (though it doesn't show a null for that column, and I don't know how it ever could have).
 
3) That there are no NULL values in the publisher column?

This on the publisher:

select * from search_docketentry where recap_sequence_number is null;

returns zero rows, so yeah, no nulls in there (which makes sense since they're not allowed).

Whatever the answers to 1), 2) and 3) are the next question is:

4) Do you want/need recap_sequence_number to be NOT NULL.

Yes, and indeed that's how it always has been.

a) If not then you could leave things as they are.

Well, I was able to fix this by briefly allowing nulls on the subscriber, letting it catch up with the publisher, setting all nulls to empty strings (a Django convention), and then disallowing nulls again. After letting it catch up, there were 118 nulls on the subscriber in this column:


That shouldn't be possible since nulls were never allowed in this column on the publisher.
 
b) If so then you:

        1) Have to figure out what is sending NULL values to the column.

            Maybe a model that has null=True set when it shouldn't be?

Nope, never had that. I'm 100% certain.
 
            A Form/ModelForm that is allowing None/Null?

Even if that was the case, the error wouldn't have shown up on the subscriber since that null would have never been allowed in the publisher. But anyway, I don't use any forms with this column.
 
            Some code that is operating outside the ORM e.g. doing a
           direct query using from django.db import connection.

That's an idea, but like I said, nothing sends SQL to the subscriber (not even read requests), and this shouldn't have been possible in the publisher due to the NOT NULL constraint that has *always* been on that column.

         2) Clean up the NULL values in the column in the subscriber
            and/or publisher.

There were only NULL values in the subscriber, never in the publisher. Something is amiss here.

I appreciate all the responses. I'm scared to say so, but I think this is a bug in logical replication. Somehow a null value appeared at the subscriber that was never in the publisher.

I also still have this question/suggestion from my first email:

> Is the process for schema migrations documented somewhere beyond the above?

Thank you again,

Mike

Reply | Threaded
Open this post in threaded view
|

Re: Errors with schema migration and logical replication — expected?

Adrian Klaver-4
On 12/11/18 2:21 PM, Mike Lissner wrote:

> Reupping this since it was over the weekend and looks like a bug in
> logical replication. My problems are solved, but some very weird things
> happened when doing a schema migration.
>
> On Sun, Dec 9, 2018 at 5:48 PM Mike Lissner
> <[hidden email] <mailto:[hidden email]>>
> wrote:
>
>     On Sun, Dec 9, 2018 at 12:42 PM Adrian Klaver
>     <[hidden email] <mailto:[hidden email]>> wrote:
>
>
>         1) Using psql have you verified that NOT NULL is set on that
>         column on
>         the publisher?
>
>
>     Yes, on the publisher and the subscriber. That was my first step
>     when I saw the log lines about this.
>
>         2) And that the row that failed in the subscriber is in the
>         publisher table.
>
>
>     Yep, it's there (though it doesn't show a null for that column, and
>     I don't know how it ever could have).
>
>         3) That there are no NULL values in the publisher column?
>
>
>     This on the publisher:
>
>     select * from search_docketentry where recap_sequence_number is null;
>
>     returns zero rows, so yeah, no nulls in there (which makes sense
>     since they're not allowed).
>
>         Whatever the answers to 1), 2) and 3) are the next question is:
>
>         4) Do you want/need recap_sequence_number to be NOT NULL.
>
>
>     Yes, and indeed that's how it always has been.
>
>         a) If not then you could leave things as they are.
>
>
>     Well, I was able to fix this by briefly allowing nulls on the
>     subscriber, letting it catch up with the publisher, setting all
>     nulls to empty strings (a Django convention), and then disallowing
>     nulls again. After letting it catch up, there were 118 nulls on the
>     subscriber in this column:

So recap_sequence_number is not actually a number, it is a code?

>
>     I appreciate all the responses. I'm scared to say so, but I think
>     this is a bug in logical replication. Somehow a null value appeared
>     at the subscriber that was never in the publisher.
>
>     I also still have this question/suggestion from my first email:
>
>      > Is the process for schema migrations documented somewhere beyond
>     the above?

Not that I know of. It might help, if possible, to detail the steps in
the migration. Also what program you used to do it. Given that is Django
I am assuming some combination of migrate, makemigrations and/or sqlmigrate.

>
>     Thank you again,
>
>     Mike
>


--
Adrian Klaver
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Errors with schema migration and logical replication — expected?

Mike Lissner


On Tue, Dec 11, 2018 at 3:10 PM Adrian Klaver <[hidden email]> wrote:

>     Well, I was able to fix this by briefly allowing nulls on the
>     subscriber, letting it catch up with the publisher, setting all
>     nulls to empty strings (a Django convention), and then disallowing
>     nulls again. After letting it catch up, there were 118 nulls on the
>     subscriber in this column:

So recap_sequence_number is not actually a number, it is a code?

It has sequential values, but they're not necessarily numbers.
 

>
>     I appreciate all the responses. I'm scared to say so, but I think
>     this is a bug in logical replication. Somehow a null value appeared
>     at the subscriber that was never in the publisher.
>
>     I also still have this question/suggestion from my first email:
>
>      > Is the process for schema migrations documented somewhere beyond
>     the above?

Not that I know of. It might help, if possible, to detail the steps in
the migration. Also what program you used to do it. Given that is Django
I am assuming some combination of migrate, makemigrations and/or sqlmigrate.

Pretty simple/standard, I think:
 - Changed the code.
 - Generated the migration using manage.py makemigration
 - Generated the SQL using sqlmigrate
 - Ran the migration using manage.py migrate on the master and using psql on the replica
 
Mike
Reply | Threaded
Open this post in threaded view
|

Re: Errors with schema migration and logical replication — expected?

Adrian Klaver-4
On 12/12/18 12:15 AM, Mike Lissner wrote:

>
>
> On Tue, Dec 11, 2018 at 3:10 PM Adrian Klaver <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>      >     Well, I was able to fix this by briefly allowing nulls on the
>      >     subscriber, letting it catch up with the publisher, setting all
>      >     nulls to empty strings (a Django convention), and then
>     disallowing
>      >     nulls again. After letting it catch up, there were 118 nulls
>     on the
>      >     subscriber in this column:
>
>     So recap_sequence_number is not actually a number, it is a code?
>
>
> It has sequential values, but they're not necessarily numbers.
>
>
>      >
>      >     I appreciate all the responses. I'm scared to say so, but I think
>      >     this is a bug in logical replication. Somehow a null value
>     appeared
>      >     at the subscriber that was never in the publisher.
>      >
>      >     I also still have this question/suggestion from my first email:
>      >
>      >      > Is the process for schema migrations documented somewhere
>     beyond
>      >     the above?
>
>     Not that I know of. It might help, if possible, to detail the steps in
>     the migration. Also what program you used to do it. Given that is
>     Django
>     I am assuming some combination of migrate, makemigrations and/or
>     sqlmigrate.
>
>
> Pretty simple/standard, I think:
>   - Changed the code.
>   - Generated the migration using manage.py makemigration
>   - Generated the SQL using sqlmigrate
>   - Ran the migration using manage.py migrate on the master and using
> psql on the replica

The only thing I can think of involves this sequence on the subscriber:

ALTER TABLE "search_docketentry" ADD COLUMN "recap_sequence_number"
varchar(50) DEFAULT '' NOT NULL;
ALTER TABLE "search_docketentry" ALTER COLUMN "recap_sequence_number"
DROP DEFAULT;

and then this:

https://www.postgresql.org/docs/11/logical-replication-subscription.html

"Columns of a table are also matched by name. A different order of
columns in the target table is allowed, but the column types have to
match. The target table can have additional columns not provided by the
published table. Those will be filled with their default values."

https://www.postgresql.org/docs/10/sql-createtable.html

"If there is no default for a column, then the default is null."

So the subscriber finished the migration first, as alluded to in an
earlier post. There is no data for recap_sequence_number coming from the
provider so Postgres place holds the data with NULL until such time as
the migration on the provider finishes and actual data for
recap_sequence_number starts flowing.

Going forward, options I see:

1) Making sure there is a DEFAULT other then NULL for a NOT NULL column.

2) Stop the replication and do the migration scripts on both provider
and subscriber until they both complete and then start replication again.




> Mike


--
Adrian Klaver
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Errors with schema migration and logical replication — expected?

Mike Lissner
This sounds *very* plausible. So I think there are a few takeaways:

1. Should the docs mention that additive changes with NOT NULL constraints are bad?

2. Is there a way this could work without completely breaking replication? For example, should Postgresql realize replication can't work in this instance and then stop it until schemas are back in sync, like it does with other incompatible schema changes? That'd be better than failing in this way and is what I'd expect to happen.

3. Are there other edge cases like this that aren't well documented that we can expect to creep up on us? If so, should we try to spell out exactly *which* additive changes *are* OK?

This feels like a major "gotcha" to me, and I'm trying to avoid those. I feel like the docs are pretty lacking here and that others will find themselves in similarly bad positions.

Better schema migration docs would surely help, too.

Mike


On Wed, Dec 12, 2018 at 7:11 AM Adrian Klaver <[hidden email]> wrote:
On 12/12/18 12:15 AM, Mike Lissner wrote:
>
>
> On Tue, Dec 11, 2018 at 3:10 PM Adrian Klaver <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>      >     Well, I was able to fix this by briefly allowing nulls on the
>      >     subscriber, letting it catch up with the publisher, setting all
>      >     nulls to empty strings (a Django convention), and then
>     disallowing
>      >     nulls again. After letting it catch up, there were 118 nulls
>     on the
>      >     subscriber in this column:
>
>     So recap_sequence_number is not actually a number, it is a code?
>
>
> It has sequential values, but they're not necessarily numbers.
>
>
>      >
>      >     I appreciate all the responses. I'm scared to say so, but I think
>      >     this is a bug in logical replication. Somehow a null value
>     appeared
>      >     at the subscriber that was never in the publisher.
>      >
>      >     I also still have this question/suggestion from my first email:
>      >
>      >      > Is the process for schema migrations documented somewhere
>     beyond
>      >     the above?
>
>     Not that I know of. It might help, if possible, to detail the steps in
>     the migration. Also what program you used to do it. Given that is
>     Django
>     I am assuming some combination of migrate, makemigrations and/or
>     sqlmigrate.
>
>
> Pretty simple/standard, I think:
>   - Changed the code.
>   - Generated the migration using manage.py makemigration
>   - Generated the SQL using sqlmigrate
>   - Ran the migration using manage.py migrate on the master and using
> psql on the replica

The only thing I can think of involves this sequence on the subscriber:

ALTER TABLE "search_docketentry" ADD COLUMN "recap_sequence_number"
varchar(50) DEFAULT '' NOT NULL;
ALTER TABLE "search_docketentry" ALTER COLUMN "recap_sequence_number"
DROP DEFAULT;

and then this:

https://www.postgresql.org/docs/11/logical-replication-subscription.html

"Columns of a table are also matched by name. A different order of
columns in the target table is allowed, but the column types have to
match. The target table can have additional columns not provided by the
published table. Those will be filled with their default values."

https://www.postgresql.org/docs/10/sql-createtable.html

"If there is no default for a column, then the default is null."

So the subscriber finished the migration first, as alluded to in an
earlier post. There is no data for recap_sequence_number coming from the
provider so Postgres place holds the data with NULL until such time as
the migration on the provider finishes and actual data for
recap_sequence_number starts flowing.

Going forward, options I see:

1) Making sure there is a DEFAULT other then NULL for a NOT NULL column.

2) Stop the replication and do the migration scripts on both provider
and subscriber until they both complete and then start replication again.




> Mike


--
Adrian Klaver
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Errors with schema migration and logical replication — expected?

Adrian Klaver-4
On 12/12/18 3:19 PM, Mike Lissner wrote:
> This sounds *very* plausible. So I think there are a few takeaways:
>
> 1. Should the docs mention that additive changes with NOT NULL
> constraints are bad?

It's not the NOT NULL it's the lack of a DEFAULT. In general a column
with a NOT NULL and no DEFAULT is going to to bite you sooner or later:)
At this point I have gathered enough of those bite marks to just make it
my policy to always provide a DEFAULT for a NOT NULL column.

>
> 2. Is there a way this could work without completely breaking
> replication? For example, should Postgresql realize replication can't
> work in this instance and then stop it until schemas are back in sync,
> like it does with other incompatible schema changes? That'd be better
> than failing in this way and is what I'd expect to happen.

Not sure as there is no requirement that a column has a specified
DEFAULT. This is unlike PK and FK constraint violations where the
relationship is spelled out. Trying to parse all the possible ways a
user could get into trouble would require something on the order of an
AI and I don't see that happening anytime soon.

>
> 3. Are there other edge cases like this that aren't well documented that
> we can expect to creep up on us? If so, should we try to spell out
> exactly *which* additive changes *are* OK?

Not that I know of. By their nature edge cases are rare and often are
dealt with in the moment and not pushed out to everybody. The only
solution I know of is pretesting your schema change/replication setup on
a dev installation.

>
> This feels like a major "gotcha" to me, and I'm trying to avoid those. I
> feel like the docs are pretty lacking here and that others will find
> themselves in similarly bad positions.

Logical replication in core(not the pglogical extension) appeared for
the first time in version 10. On the crawl/walk/run spectrum it is
moving from crawl to walk. The docs will take some time to be more
complete. Just for the record my previous post was sketching out a
possible scenario not an ironclad answer. If you think the answer is
plausible and a 'gotcha' I would file a bug:

https://www.postgresql.org/account/login/?next=/account/submitbug/

>
> Better schema migration docs would surely help, too.
>
> Mike
>
>


--
Adrian Klaver
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Errors with schema migration and logical replication — expected?

Mike Lissner
Thanks Adrian for all the help. I filed this as bug #15549. I hope this all helps get logical replication into the "Running" stage.

On Wed, Dec 12, 2018 at 5:06 PM Adrian Klaver <[hidden email]> wrote:
On 12/12/18 3:19 PM, Mike Lissner wrote:
> This sounds *very* plausible. So I think there are a few takeaways:
>
> 1. Should the docs mention that additive changes with NOT NULL
> constraints are bad?

It's not the NOT NULL it's the lack of a DEFAULT. In general a column
with a NOT NULL and no DEFAULT is going to to bite you sooner or later:)
At this point I have gathered enough of those bite marks to just make it
my policy to always provide a DEFAULT for a NOT NULL column.

>
> 2. Is there a way this could work without completely breaking
> replication? For example, should Postgresql realize replication can't
> work in this instance and then stop it until schemas are back in sync,
> like it does with other incompatible schema changes? That'd be better
> than failing in this way and is what I'd expect to happen.

Not sure as there is no requirement that a column has a specified
DEFAULT. This is unlike PK and FK constraint violations where the
relationship is spelled out. Trying to parse all the possible ways a
user could get into trouble would require something on the order of an
AI and I don't see that happening anytime soon.

>
> 3. Are there other edge cases like this that aren't well documented that
> we can expect to creep up on us? If so, should we try to spell out
> exactly *which* additive changes *are* OK?

Not that I know of. By their nature edge cases are rare and often are
dealt with in the moment and not pushed out to everybody. The only
solution I know of is pretesting your schema change/replication setup on
a dev installation.

>
> This feels like a major "gotcha" to me, and I'm trying to avoid those. I
> feel like the docs are pretty lacking here and that others will find
> themselves in similarly bad positions.

Logical replication in core(not the pglogical extension) appeared for
the first time in version 10. On the crawl/walk/run spectrum it is
moving from crawl to walk. The docs will take some time to be more
complete. Just for the record my previous post was sketching out a
possible scenario not an ironclad answer. If you think the answer is
plausible and a 'gotcha' I would file a bug:

https://www.postgresql.org/account/login/?next=/account/submitbug/

>
> Better schema migration docs would surely help, too.
>
> Mike
>
>


--
Adrian Klaver
[hidden email]