please help! losing my subscriber node

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

please help! losing my subscriber node

Achilleas Mantzios

Hello

my topology/environment is :

test linux , pgsql 10.13 (subscription) <--- prod linux, pgsql 10.13 (publication)

In order to connect to our accounting system (ms sql) , I needed to do cross-db queries, so I setup freetds (1.1.42) and tds_fdw-2.0.1 and tested them on the test system (subscription node). This involved the IMPORT of some 1000 ms sql tables in a dedicated schema, plus a few others in a second dedicated schema. All went fine, most tests worked nice, so I thought of installing tds_fdw in the prod host. Needless to say, the new mssql dedicated schemas did not involve any logical replication . After importing the 1000 tables in prod with,

IMPORT FOREIGN SCHEMA dbo EXCEPT ("....") FROM SERVER mssql_bdynacom_srv INTO mssql_bdynacom OPTIONS (import_default 'true');

after 15 mins (wal_sender_timeout / wal_receiver_timeout) I started getting in the test node :

 [120394] 5ee085dc.1d64a 2020-06-16 17:10:26.450 EEST  @ line:5 ERROR:  terminating logical replication worker due to timeout
 [87028] 5ee8d2d2.153f4 2020-06-16 17:10:26.467 EEST  @ line:1 LOG:  logical replication apply worker for subscription "data_for_testsmadb_pub" has started
 [87028] 5ee8d2d2.153f4 2020-06-16 17:10:26.545 EEST  @ line:2 ERROR:  could not start WAL streaming: ERROR:  replication slot "data_for_testsmadb_pub" is active for PID 10991

.....


in the meantime the "wal sender" on the prod node is idle. And seems stuck. Apparently the addition of those 1000 foreign tables messed up with the replication slot. I tried to move forward the replication slot, but even right after restarting the two systems , I can't seem to find any handle to call : pg_replication_origin_advance   

dynacom=# select * from pg_replication_origin;
 roident | roname
---------+--------
(0 rows)

dynacom=# select * from pg_replication_origin_status ;
 local_id | external_id | remote_lsn | local_lsn
----------+-------------+------------+-----------
(0 rows)

dynacom=#
 

So, I am pretty much at loss here . Go the guarantied way : drop the two mssql schemas (i.e. revert to pre-incident state) and re-doing the whole subscription node initialization and setup from scratch seems like a possible scenario but I'd like to avoid that, logical replication works fine for two years, and it will take maybe days to catch up, I am not at the office, I work from home and this makes things worse. Another option is to leave the tds_fdw and the foreign table definitions and re-do the logical replication setup. Please help!

Reply | Threaded
Open this post in threaded view
|

Re: please help! losing my subscriber node

Achilleas Mantzios

just an update to this.

On 16/6/20 7:04 μ.μ., Achilleas Mantzios wrote:

Hello

my topology/environment is :

test linux , pgsql 10.13 (subscription) <--- prod linux, pgsql 10.13 (publication)

In order to connect to our accounting system (ms sql) , I needed to do cross-db queries, so I setup freetds (1.1.42) and tds_fdw-2.0.1 and tested them on the test system (subscription node). This involved the IMPORT of some 1000 ms sql tables in a dedicated schema, plus a few others in a second dedicated schema. All went fine, most tests worked nice, so I thought of installing tds_fdw in the prod host. Needless to say, the new mssql dedicated schemas did not involve any logical replication . After importing the 1000 tables in prod with,

IMPORT FOREIGN SCHEMA dbo EXCEPT ("....") FROM SERVER mssql_bdynacom_srv INTO mssql_bdynacom OPTIONS (import_default 'true');

after 15 mins (wal_sender_timeout / wal_receiver_timeout) I started getting in the test node :

 [120394] 5ee085dc.1d64a 2020-06-16 17:10:26.450 EEST  @ line:5 ERROR:  terminating logical replication worker due to timeout
 [87028] 5ee8d2d2.153f4 2020-06-16 17:10:26.467 EEST  @ line:1 LOG:  logical replication apply worker for subscription "data_for_testsmadb_pub" has started
 [87028] 5ee8d2d2.153f4 2020-06-16 17:10:26.545 EEST  @ line:2 ERROR:  could not start WAL streaming: ERROR:  replication slot "data_for_testsmadb_pub" is active for PID 10991

.....


in the meantime the "wal sender" on the prod node is idle. And seems stuck. Apparently the addition of those 1000 foreign tables messed up with the replication slot. I tried to move forward the replication slot, but even right after restarting the two systems , I can't seem to find any handle to call : pg_replication_origin_advance   

dynacom=# select * from pg_replication_origin;
 roident | roname
---------+--------
(0 rows)

dynacom=# select * from pg_replication_origin_status ;
 local_id | external_id | remote_lsn | local_lsn
----------+-------------+------------+-----------
(0 rows)

dynacom=#
 

apparently I was looking in the wrong node arghhh. It should have been the subscriber node. Anyways.

So, I am pretty much at loss here . Go the guarantied way : drop the two mssql schemas (i.e. revert to pre-incident state) and re-doing the whole subscription node initialization and setup from scratch seems like a possible scenario but I'd like to avoid that, logical replication works fine for two years, and it will take maybe days to catch up, I am not at the office, I work from home and this makes things worse. Another option is to leave the tds_fdw and the foreign table definitions and re-do the logical replication setup. Please help!


I checked the number of tables the IMPORT FOREIGN SCHEMA defined : over 13000 . So in one transaction, I got 13000+ implicit CREATE FOREIGN TABLE definitions. I overcame this by altering the subscriber to use a new replication slot, created by hand the new replication slot on the publisher side, and had to kill -9 the old wal sender , (otherwise the db would not restart, and of course the old repl slot would not be dropped).  Then I was able to drop the old replication slot, and finally see the space in data/pg_wal decreasing at last. So the 1000 tables figure was a lie, because \d lied, this is what it reported, this is what I naively reported. The true figure was about 13000+ tables.

So all in all I lost about a little less than a day worth of data (not the best experience I gotta say).

Bottom line :  this is clearly a bug.  Also wal_sender_timeout didn't seem to work either. This wal sender should have exited (after 15 mins) as no receiver was connected to it. I didn't have the resources to test how it would behave in a more generic foreign table conf (e.g. pgsql_fdw), to test if it would still give problems.