Aurora Postgresql RDS DB Latency

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Aurora Postgresql RDS DB Latency

github kran
Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted few million rows from the database and ran into a issue in one of our dev account where the 
DB was not normal after this deletion. We did re index, vacuuming entire database but we couldnt bring it to the same state as earlier. So next steps we deleted the database and 
recreated the database by copying the snapshot from a production instance. Further did vacumming, re-index on the database. 

After this now the dev database seems to be in a better state than earlier but we are seeing few of our DB calls are taking more than 1 minute when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the query whereas a similar query on prod instance is not doing any hash join and is returning faster. 

Also we did not want to experiment by modifing the DB settings by doing enable_hash_join to off or random_page_count to 1 as we dont have these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct here to reduce the query latency. 

Thanks 
githubKran
Reply | Threaded
Open this post in threaded view
|

Re: Aurora Postgresql RDS DB Latency

Michael Lewis
Are default statistics target the same on both prod and AWS? Have you analyzed all tables being used in this query to ensure stats are up proper? If the optimizer is choosing a different plan, then the stats must be different IMO.


Michael Lewis |  Software Engineer
Entrata


On Mon, Feb 11, 2019 at 2:15 PM github kran <[hidden email]> wrote:
Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted few million rows from the database and ran into a issue in one of our dev account where the 
DB was not normal after this deletion. We did re index, vacuuming entire database but we couldnt bring it to the same state as earlier. So next steps we deleted the database and 
recreated the database by copying the snapshot from a production instance. Further did vacumming, re-index on the database. 

After this now the dev database seems to be in a better state than earlier but we are seeing few of our DB calls are taking more than 1 minute when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the query whereas a similar query on prod instance is not doing any hash join and is returning faster. 

Also we did not want to experiment by modifing the DB settings by doing enable_hash_join to off or random_page_count to 1 as we dont have these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct here to reduce the query latency. 

Thanks 
githubKran
Reply | Threaded
Open this post in threaded view
|

Re: Aurora Postgresql RDS DB Latency

github kran


On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <[hidden email]> wrote:
Are default statistics target the same on both prod and AWS? Have you analyzed all tables being used in this query to ensure stats are up proper? If the optimizer is choosing a different plan, then the stats must be different IMO.


Michael Lewis |  Software Engineer
Entrata


Thanks for your reply  I have verified few of the tables and their default statistics target and they seem to be same but is there anything in particular you want me to look at it to differentiate Prod and Non prod databases ?. ( Also the DB instance size is same but there is little 
more data in the Non prod Aurora RDS instance compared to Prod instance).

Query used.
   select * from pg_stats where tablename = 'tableName' 


On Mon, Feb 11, 2019 at 2:15 PM github kran <[hidden email]> wrote:
Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted few million rows from the database and ran into a issue in one of our dev account where the 
DB was not normal after this deletion. We did re index, vacuuming entire database but we couldnt bring it to the same state as earlier. So next steps we deleted the database and 
recreated the database by copying the snapshot from a production instance. Further did vacumming, re-index on the database. 

After this now the dev database seems to be in a better state than earlier but we are seeing few of our DB calls are taking more than 1 minute when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the query whereas a similar query on prod instance is not doing any hash join and is returning faster. 

Also we did not want to experiment by modifing the DB settings by doing enable_hash_join to off or random_page_count to 1 as we dont have these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct here to reduce the query latency. 

Thanks 
githubKran
Reply | Threaded
Open this post in threaded view
|

Re: Aurora Postgresql RDS DB Latency

github kran


On Mon, Feb 11, 2019 at 6:00 PM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <[hidden email]> wrote:
Are default statistics target the same on both prod and AWS? Have you analyzed all tables being used in this query to ensure stats are up proper? If the optimizer is choosing a different plan, then the stats must be different IMO.


Michael Lewis |  Software Engineer
Entrata


Thanks for your reply  I have verified few of the tables and their default statistics target and they seem to be same but is there anything in particular you want me to look at it to differentiate Prod and Non prod databases ?. ( Also the DB instance size is same but there is little 
more data in the Non prod Aurora RDS instance compared to Prod instance).

Query used.
   select * from pg_stats where tablename = 'tableName' 


On Mon, Feb 11, 2019 at 2:15 PM github kran <[hidden email]> wrote:
Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted few million rows from the database and ran into a issue in one of our dev account where the 
DB was not normal after this deletion. We did re index, vacuuming entire database but we couldnt bring it to the same state as earlier. So next steps we deleted the database and 
recreated the database by copying the snapshot from a production instance. Further did vacumming, re-index on the database. 

After this now the dev database seems to be in a better state than earlier but we are seeing few of our DB calls are taking more than 1 minute when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the query whereas a similar query on prod instance is not doing any hash join and is returning faster. 

Also we did not want to experiment by modifing the DB settings by doing enable_hash_join to off or random_page_count to 1 as we dont have these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct here to reduce the query latency. 

Thanks 
githubKran
Reply | Threaded
Open this post in threaded view
|

Postgresql RDS DB Latency Chossing Hash join Plan

github kran


On Tue, Feb 12, 2019 at 7:07 AM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 6:00 PM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <[hidden email]> wrote:
Are default statistics target the same on both prod and AWS? Have you analyzed all tables being used in this query to ensure stats are up proper? If the optimizer is choosing a different plan, then the stats must be different IMO.


Michael Lewis |  Software Engineer
Entrata


Thanks for your reply  I have verified few of the tables and their default statistics target and they seem to be same but is there anything in particular you want me to look at it to differentiate Prod and Non prod databases ?. ( Also the DB instance size is same but there is little 
more data in the Non prod Aurora RDS instance compared to Prod instance).

Query used.
   select * from pg_stats where tablename = 'tableName' 


On Mon, Feb 11, 2019 at 2:15 PM github kran <[hidden email]> wrote:
Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted few million rows from the database and ran into a issue in one of our dev account where the 
DB was not normal after this deletion. We did re index, vacuuming entire database but we couldnt bring it to the same state as earlier. So next steps we deleted the database and 
recreated the database by copying the snapshot from a production instance. Further did vacumming, re-index on the database. 

After this now the dev database seems to be in a better state than earlier but we are seeing few of our DB calls are taking more than 1 minute when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the query whereas a similar query on prod instance is not doing any hash join and is returning faster. 

Also we did not want to experiment by modifing the DB settings by doing enable_hash_join to off or random_page_count to 1 as we dont have these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct here to reduce the query latency. 

Thanks 
githubKran
Reply | Threaded
Open this post in threaded view
|

Re: Postgresql RDS DB Latency Chossing Hash join Plan

Michael Lewis
Did you update the stats by running ANALYZE on the tables involved, or perhaps the entire database on the 'Non prod Aurora RDS instance'? Can you share the two execution plans?

Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697


On Tue, Feb 12, 2019 at 11:27 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 7:07 AM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 6:00 PM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <[hidden email]> wrote:
Are default statistics target the same on both prod and AWS? Have you analyzed all tables being used in this query to ensure stats are up proper? If the optimizer is choosing a different plan, then the stats must be different IMO.


Michael Lewis |  Software Engineer
Entrata


Thanks for your reply  I have verified few of the tables and their default statistics target and they seem to be same but is there anything in particular you want me to look at it to differentiate Prod and Non prod databases ?. ( Also the DB instance size is same but there is little 
more data in the Non prod Aurora RDS instance compared to Prod instance).

Query used.
   select * from pg_stats where tablename = 'tableName' 


On Mon, Feb 11, 2019 at 2:15 PM github kran <[hidden email]> wrote:
Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted few million rows from the database and ran into a issue in one of our dev account where the 
DB was not normal after this deletion. We did re index, vacuuming entire database but we couldnt bring it to the same state as earlier. So next steps we deleted the database and 
recreated the database by copying the snapshot from a production instance. Further did vacumming, re-index on the database. 

After this now the dev database seems to be in a better state than earlier but we are seeing few of our DB calls are taking more than 1 minute when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the query whereas a similar query on prod instance is not doing any hash join and is returning faster. 

Also we did not want to experiment by modifing the DB settings by doing enable_hash_join to off or random_page_count to 1 as we dont have these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct here to reduce the query latency. 

Thanks 
githubKran
Reply | Threaded
Open this post in threaded view
|

Re: Postgresql RDS DB Latency Chossing Hash join Plan

github kran


On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <[hidden email]> wrote:
Did you update the stats by running ANALYZE on the tables involved, or perhaps the entire database on the 'Non prod Aurora RDS instance'? Can you share the two execution plans?

Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

Here is the plan for both of the DB instances. 


On Tue, Feb 12, 2019 at 11:27 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 7:07 AM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 6:00 PM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <[hidden email]> wrote:
Are default statistics target the same on both prod and AWS? Have you analyzed all tables being used in this query to ensure stats are up proper? If the optimizer is choosing a different plan, then the stats must be different IMO.


Michael Lewis |  Software Engineer
Entrata


Thanks for your reply  I have verified few of the tables and their default statistics target and they seem to be same but is there anything in particular you want me to look at it to differentiate Prod and Non prod databases ?. ( Also the DB instance size is same but there is little 
more data in the Non prod Aurora RDS instance compared to Prod instance).

Query used.
   select * from pg_stats where tablename = 'tableName' 


On Mon, Feb 11, 2019 at 2:15 PM github kran <[hidden email]> wrote:
Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted few million rows from the database and ran into a issue in one of our dev account where the 
DB was not normal after this deletion. We did re index, vacuuming entire database but we couldnt bring it to the same state as earlier. So next steps we deleted the database and 
recreated the database by copying the snapshot from a production instance. Further did vacumming, re-index on the database. 

After this now the dev database seems to be in a better state than earlier but we are seeing few of our DB calls are taking more than 1 minute when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the query whereas a similar query on prod instance is not doing any hash join and is returning faster. 

Also we did not want to experiment by modifing the DB settings by doing enable_hash_join to off or random_page_count to 1 as we dont have these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct here to reduce the query latency. 

Thanks 
githubKran

prodInstance_working.txt (110K) Download Attachment
hashJoin_nonProd.txt (110K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Postgresql RDS DB Latency Chossing Hash join Plan

github kran


On Tue, Feb 12, 2019 at 12:55 PM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <[hidden email]> wrote:
Did you update the stats by running ANALYZE on the tables involved, or perhaps the entire database on the 'Non prod Aurora RDS instance'? Can you share the two execution plans?

Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

Here is the plan for both of the DB instances. 

Michael - Did you get any chance to look at this issue. Also we see there is a sequential scan being done instead of index scan. 


On Tue, Feb 12, 2019 at 11:27 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 7:07 AM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 6:00 PM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <[hidden email]> wrote:
Are default statistics target the same on both prod and AWS? Have you analyzed all tables being used in this query to ensure stats are up proper? If the optimizer is choosing a different plan, then the stats must be different IMO.


Michael Lewis |  Software Engineer
Entrata


Thanks for your reply  I have verified few of the tables and their default statistics target and they seem to be same but is there anything in particular you want me to look at it to differentiate Prod and Non prod databases ?. ( Also the DB instance size is same but there is little 
more data in the Non prod Aurora RDS instance compared to Prod instance).

Query used.
   select * from pg_stats where tablename = 'tableName' 


On Mon, Feb 11, 2019 at 2:15 PM github kran <[hidden email]> wrote:
Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted few million rows from the database and ran into a issue in one of our dev account where the 
DB was not normal after this deletion. We did re index, vacuuming entire database but we couldnt bring it to the same state as earlier. So next steps we deleted the database and 
recreated the database by copying the snapshot from a production instance. Further did vacumming, re-index on the database. 

After this now the dev database seems to be in a better state than earlier but we are seeing few of our DB calls are taking more than 1 minute when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the query whereas a similar query on prod instance is not doing any hash join and is returning faster. 

Also we did not want to experiment by modifing the DB settings by doing enable_hash_join to off or random_page_count to 1 as we dont have these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct here to reduce the query latency. 

Thanks 
githubKran
Reply | Threaded
Open this post in threaded view
|

Re: Postgresql RDS DB Latency Chossing Hash join Plan

Michael Lewis
I didn't see your email yesterday, sorry about that. Index scans instead of sequential scans and nested loop instead of hash join means that you have bad row count estimates on "Non prod Aurora RDS instance" as far as I can figure. Have you run commands like-

analyze asset_info_2019_2_part4;
analyze asset_info_2019_2_part2;

etc? If data are very similar, indexes all exist, and default_statistics_target are the same, then you should be getting the same plans.


Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697


On Wed, Feb 13, 2019 at 8:49 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 12:55 PM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <[hidden email]> wrote:
Did you update the stats by running ANALYZE on the tables involved, or perhaps the entire database on the 'Non prod Aurora RDS instance'? Can you share the two execution plans?

Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

Here is the plan for both of the DB instances. 

Michael - Did you get any chance to look at this issue. Also we see there is a sequential scan being done instead of index scan. 


On Tue, Feb 12, 2019 at 11:27 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 7:07 AM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 6:00 PM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <[hidden email]> wrote:
Are default statistics target the same on both prod and AWS? Have you analyzed all tables being used in this query to ensure stats are up proper? If the optimizer is choosing a different plan, then the stats must be different IMO.


Michael Lewis |  Software Engineer
Entrata


Thanks for your reply  I have verified few of the tables and their default statistics target and they seem to be same but is there anything in particular you want me to look at it to differentiate Prod and Non prod databases ?. ( Also the DB instance size is same but there is little 
more data in the Non prod Aurora RDS instance compared to Prod instance).

Query used.
   select * from pg_stats where tablename = 'tableName' 


On Mon, Feb 11, 2019 at 2:15 PM github kran <[hidden email]> wrote:
Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted few million rows from the database and ran into a issue in one of our dev account where the 
DB was not normal after this deletion. We did re index, vacuuming entire database but we couldnt bring it to the same state as earlier. So next steps we deleted the database and 
recreated the database by copying the snapshot from a production instance. Further did vacumming, re-index on the database. 

After this now the dev database seems to be in a better state than earlier but we are seeing few of our DB calls are taking more than 1 minute when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the query whereas a similar query on prod instance is not doing any hash join and is returning faster. 

Also we did not want to experiment by modifing the DB settings by doing enable_hash_join to off or random_page_count to 1 as we dont have these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct here to reduce the query latency. 

Thanks 
githubKran
Reply | Threaded
Open this post in threaded view
|

Re: Postgresql RDS DB Latency Chossing Hash join Plan

github kran


On Wed, Feb 13, 2019 at 11:38 AM Michael Lewis <[hidden email]> wrote:
I didn't see your email yesterday, sorry about that. Index scans instead of sequential scans and nested loop instead of hash join means that you have bad row count estimates on "Non prod Aurora RDS instance" as far as I can figure. Have you run commands like-

analyze asset_info_2019_2_part4;
analyze asset_info_2019_2_part2;

etc? If data are very similar, indexes all exist, and default_statistics_target are the same, then you should be getting the same plans.


Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

1)  Yes did the VACUUM for all the tables like asset_info_2019_2_part1, part2 , part4 and also for location_info_xxx  to remove any dead tuples and also rebuilt the indexes..
2)   REINDEX table  location_data_2018_12_part4;( Like wise ran for all the tables and also VACUUM for all the tables).
3)  The data in Non prod instance is more.  One thing to mention here when we built the Non prod instance we copied SNAPSHOT from Prod instance and on top of that inserted data about 100 million rows and then did VACUUM and re-indexed the tables.

I cant think of anything we can do here but let us know if you need any more details on this problem. Iam happy to share more details.
 

On Wed, Feb 13, 2019 at 8:49 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 12:55 PM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <[hidden email]> wrote:
Did you update the stats by running ANALYZE on the tables involved, or perhaps the entire database on the 'Non prod Aurora RDS instance'? Can you share the two execution plans?

Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

Here is the plan for both of the DB instances. 

Michael - Did you get any chance to look at this issue. Also we see there is a sequential scan being done instead of index scan. 


On Tue, Feb 12, 2019 at 11:27 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 7:07 AM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 6:00 PM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <[hidden email]> wrote:
Are default statistics target the same on both prod and AWS? Have you analyzed all tables being used in this query to ensure stats are up proper? If the optimizer is choosing a different plan, then the stats must be different IMO.


Michael Lewis |  Software Engineer
Entrata


Thanks for your reply  I have verified few of the tables and their default statistics target and they seem to be same but is there anything in particular you want me to look at it to differentiate Prod and Non prod databases ?. ( Also the DB instance size is same but there is little 
more data in the Non prod Aurora RDS instance compared to Prod instance).

Query used.
   select * from pg_stats where tablename = 'tableName' 


On Mon, Feb 11, 2019 at 2:15 PM github kran <[hidden email]> wrote:
Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted few million rows from the database and ran into a issue in one of our dev account where the 
DB was not normal after this deletion. We did re index, vacuuming entire database but we couldnt bring it to the same state as earlier. So next steps we deleted the database and 
recreated the database by copying the snapshot from a production instance. Further did vacumming, re-index on the database. 

After this now the dev database seems to be in a better state than earlier but we are seeing few of our DB calls are taking more than 1 minute when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the query whereas a similar query on prod instance is not doing any hash join and is returning faster. 

Also we did not want to experiment by modifing the DB settings by doing enable_hash_join to off or random_page_count to 1 as we dont have these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct here to reduce the query latency. 

Thanks 
githubKran
Reply | Threaded
Open this post in threaded view
|

Re: Postgresql RDS DB Latency Chossing Hash join Plan

Michael Lewis
How many total rows in these tables? I am assuming these are partitions and those 100 million rows got distributed. If the data difference is significant, then you aren't guaranteed similar performance. You may want to follow more of the suggested steps on.



Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697


On Thu, Feb 14, 2019 at 8:48 AM github kran <[hidden email]> wrote:


On Wed, Feb 13, 2019 at 11:38 AM Michael Lewis <[hidden email]> wrote:
I didn't see your email yesterday, sorry about that. Index scans instead of sequential scans and nested loop instead of hash join means that you have bad row count estimates on "Non prod Aurora RDS instance" as far as I can figure. Have you run commands like-

analyze asset_info_2019_2_part4;
analyze asset_info_2019_2_part2;

etc? If data are very similar, indexes all exist, and default_statistics_target are the same, then you should be getting the same plans.


Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

1)  Yes did the VACUUM for all the tables like asset_info_2019_2_part1, part2 , part4 and also for location_info_xxx  to remove any dead tuples and also rebuilt the indexes..
2)   REINDEX table  location_data_2018_12_part4;( Like wise ran for all the tables and also VACUUM for all the tables).
3)  The data in Non prod instance is more.  One thing to mention here when we built the Non prod instance we copied SNAPSHOT from Prod instance and on top of that inserted data about 100 million rows and then did VACUUM and re-indexed the tables.

I cant think of anything we can do here but let us know if you need any more details on this problem. Iam happy to share more details.
 

On Wed, Feb 13, 2019 at 8:49 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 12:55 PM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <[hidden email]> wrote:
Did you update the stats by running ANALYZE on the tables involved, or perhaps the entire database on the 'Non prod Aurora RDS instance'? Can you share the two execution plans?

Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

Here is the plan for both of the DB instances. 

Michael - Did you get any chance to look at this issue. Also we see there is a sequential scan being done instead of index scan. 


On Tue, Feb 12, 2019 at 11:27 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 7:07 AM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 6:00 PM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <[hidden email]> wrote:
Are default statistics target the same on both prod and AWS? Have you analyzed all tables being used in this query to ensure stats are up proper? If the optimizer is choosing a different plan, then the stats must be different IMO.


Michael Lewis |  Software Engineer
Entrata


Thanks for your reply  I have verified few of the tables and their default statistics target and they seem to be same but is there anything in particular you want me to look at it to differentiate Prod and Non prod databases ?. ( Also the DB instance size is same but there is little 
more data in the Non prod Aurora RDS instance compared to Prod instance).

Query used.
   select * from pg_stats where tablename = 'tableName' 


On Mon, Feb 11, 2019 at 2:15 PM github kran <[hidden email]> wrote:
Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted few million rows from the database and ran into a issue in one of our dev account where the 
DB was not normal after this deletion. We did re index, vacuuming entire database but we couldnt bring it to the same state as earlier. So next steps we deleted the database and 
recreated the database by copying the snapshot from a production instance. Further did vacumming, re-index on the database. 

After this now the dev database seems to be in a better state than earlier but we are seeing few of our DB calls are taking more than 1 minute when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the query whereas a similar query on prod instance is not doing any hash join and is returning faster. 

Also we did not want to experiment by modifing the DB settings by doing enable_hash_join to off or random_page_count to 1 as we dont have these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct here to reduce the query latency. 

Thanks 
githubKran
Reply | Threaded
Open this post in threaded view
|

Re: Postgresql RDS DB Latency Chossing Hash join Plan

github kran


On Thu, Feb 14, 2019 at 12:43 PM Michael Lewis <[hidden email]> wrote:
How many total rows in these tables? I am assuming these are partitions and those 100 million rows got distributed. If the data difference is significant, then you aren't guaranteed similar performance. You may want to follow more of the suggested steps on.



Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

Michael - Yes correct the data of 100 million rows is distributed to all the partitions.
FInally I feel we have come to conclusion after we changed the random_page_cost from 4 to 2 in Non prod instance and we see improvements in the query which use to take from 1 minute to 1 -2 seconds.  
That's correct we have around 490 million rows in few of our partition tables. The partition tables are created for every 7 days. 

We ran our API test which hits the backend database Aurora RDS PostgreSQL and see our query response times , requests/sec are better than before. Do you recommend this setting on a Production instance? Right now we are planning to go 
implement this option of random_page_cost to 2. ( We know it also works if we turn off the hash_join but we dont want to implement this change but rather use random_page_cost to 2).

Questions.
1) What is your recommendation on this ? Can we modify this change on Prod instance which is performing better today or only keep this change to Non prod instance ?. ( Actually we want to implement this change on Non Prod instance)
2) Are there are any downside in changing this value ?.
3) Also another question we have a 64TB storage and running on r4 16 x large EC2 instance. Few of our partitions are bigger and few of them are smaller. We have data from last  9 months and we are planning to keep the data for about 
close to 1 year till May. Do you see any concerns on this ?. Eventually we are thinking to archive this data in next 2 months by dropping of older partitions. 
3) What could be the problems of keeping the data longer if there is a regular maintenance like  VACUUM and other maintenace activities

 

On Thu, Feb 14, 2019 at 8:48 AM github kran <[hidden email]> wrote:


On Wed, Feb 13, 2019 at 11:38 AM Michael Lewis <[hidden email]> wrote:
I didn't see your email yesterday, sorry about that. Index scans instead of sequential scans and nested loop instead of hash join means that you have bad row count estimates on "Non prod Aurora RDS instance" as far as I can figure. Have you run commands like-

analyze asset_info_2019_2_part4;
analyze asset_info_2019_2_part2;

etc? If data are very similar, indexes all exist, and default_statistics_target are the same, then you should be getting the same plans.


Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

1)  Yes did the VACUUM for all the tables like asset_info_2019_2_part1, part2 , part4 and also for location_info_xxx  to remove any dead tuples and also rebuilt the indexes..
2)   REINDEX table  location_data_2018_12_part4;( Like wise ran for all the tables and also VACUUM for all the tables).
3)  The data in Non prod instance is more.  One thing to mention here when we built the Non prod instance we copied SNAPSHOT from Prod instance and on top of that inserted data about 100 million rows and then did VACUUM and re-indexed the tables.

I cant think of anything we can do here but let us know if you need any more details on this problem. Iam happy to share more details.
 

On Wed, Feb 13, 2019 at 8:49 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 12:55 PM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <[hidden email]> wrote:
Did you update the stats by running ANALYZE on the tables involved, or perhaps the entire database on the 'Non prod Aurora RDS instance'? Can you share the two execution plans?

Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

Here is the plan for both of the DB instances. 

Michael - Did you get any chance to look at this issue. Also we see there is a sequential scan being done instead of index scan. 


On Tue, Feb 12, 2019 at 11:27 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 7:07 AM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 6:00 PM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <[hidden email]> wrote:
Are default statistics target the same on both prod and AWS? Have you analyzed all tables being used in this query to ensure stats are up proper? If the optimizer is choosing a different plan, then the stats must be different IMO.


Michael Lewis |  Software Engineer
Entrata


Thanks for your reply  I have verified few of the tables and their default statistics target and they seem to be same but is there anything in particular you want me to look at it to differentiate Prod and Non prod databases ?. ( Also the DB instance size is same but there is little 
more data in the Non prod Aurora RDS instance compared to Prod instance).

Query used.
   select * from pg_stats where tablename = 'tableName' 


On Mon, Feb 11, 2019 at 2:15 PM github kran <[hidden email]> wrote:
Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted few million rows from the database and ran into a issue in one of our dev account where the 
DB was not normal after this deletion. We did re index, vacuuming entire database but we couldnt bring it to the same state as earlier. So next steps we deleted the database and 
recreated the database by copying the snapshot from a production instance. Further did vacumming, re-index on the database. 

After this now the dev database seems to be in a better state than earlier but we are seeing few of our DB calls are taking more than 1 minute when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the query whereas a similar query on prod instance is not doing any hash join and is returning faster. 

Also we did not want to experiment by modifing the DB settings by doing enable_hash_join to off or random_page_count to 1 as we dont have these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct here to reduce the query latency. 

Thanks 
githubKran
Reply | Threaded
Open this post in threaded view
|

Re: Postgresql RDS DB Latency Chossing Hash join Plan

Michael Lewis
This is beyond my expertise except to say that if your storage is SSDs in AWS, then you definitely want random_page_cost close to the same as seq_page_cost (1 by default) assuming your data is likely to be in cache as discussed in the documentation. As it says- "Reducing this value relative to seq_page_cost will cause the system to prefer index scans" as you saw. Changing the value on production would again depend on the storage type used, and how good the cache hit rate is.

As far as I know, dropping old partitions should not be significantly impactful to the system other than no longer needing to store that data (cost, time for full backups, etc).

Again, as I understand things, there is not a big impact from having old unused tables in terms of maintenance. They should be ignored by normal processes.

Glad you got your issue resolved.


Michael Lewis

On Thu, Feb 14, 2019 at 3:11 PM github kran <[hidden email]> wrote:


On Thu, Feb 14, 2019 at 12:43 PM Michael Lewis <[hidden email]> wrote:
How many total rows in these tables? I am assuming these are partitions and those 100 million rows got distributed. If the data difference is significant, then you aren't guaranteed similar performance. You may want to follow more of the suggested steps on.



Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

Michael - Yes correct the data of 100 million rows is distributed to all the partitions.
FInally I feel we have come to conclusion after we changed the random_page_cost from 4 to 2 in Non prod instance and we see improvements in the query which use to take from 1 minute to 1 -2 seconds.  
That's correct we have around 490 million rows in few of our partition tables. The partition tables are created for every 7 days. 

We ran our API test which hits the backend database Aurora RDS PostgreSQL and see our query response times , requests/sec are better than before. Do you recommend this setting on a Production instance? Right now we are planning to go 
implement this option of random_page_cost to 2. ( We know it also works if we turn off the hash_join but we dont want to implement this change but rather use random_page_cost to 2).

Questions.
1) What is your recommendation on this ? Can we modify this change on Prod instance which is performing better today or only keep this change to Non prod instance ?. ( Actually we want to implement this change on Non Prod instance)
2) Are there are any downside in changing this value ?.
3) Also another question we have a 64TB storage and running on r4 16 x large EC2 instance. Few of our partitions are bigger and few of them are smaller. We have data from last  9 months and we are planning to keep the data for about 
close to 1 year till May. Do you see any concerns on this ?. Eventually we are thinking to archive this data in next 2 months by dropping of older partitions. 
3) What could be the problems of keeping the data longer if there is a regular maintenance like  VACUUM and other maintenace activities

 

On Thu, Feb 14, 2019 at 8:48 AM github kran <[hidden email]> wrote:


On Wed, Feb 13, 2019 at 11:38 AM Michael Lewis <[hidden email]> wrote:
I didn't see your email yesterday, sorry about that. Index scans instead of sequential scans and nested loop instead of hash join means that you have bad row count estimates on "Non prod Aurora RDS instance" as far as I can figure. Have you run commands like-

analyze asset_info_2019_2_part4;
analyze asset_info_2019_2_part2;

etc? If data are very similar, indexes all exist, and default_statistics_target are the same, then you should be getting the same plans.


Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

1)  Yes did the VACUUM for all the tables like asset_info_2019_2_part1, part2 , part4 and also for location_info_xxx  to remove any dead tuples and also rebuilt the indexes..
2)   REINDEX table  location_data_2018_12_part4;( Like wise ran for all the tables and also VACUUM for all the tables).
3)  The data in Non prod instance is more.  One thing to mention here when we built the Non prod instance we copied SNAPSHOT from Prod instance and on top of that inserted data about 100 million rows and then did VACUUM and re-indexed the tables.

I cant think of anything we can do here but let us know if you need any more details on this problem. Iam happy to share more details.
 

On Wed, Feb 13, 2019 at 8:49 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 12:55 PM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <[hidden email]> wrote:
Did you update the stats by running ANALYZE on the tables involved, or perhaps the entire database on the 'Non prod Aurora RDS instance'? Can you share the two execution plans?

Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

Here is the plan for both of the DB instances. 

Michael - Did you get any chance to look at this issue. Also we see there is a sequential scan being done instead of index scan. 


On Tue, Feb 12, 2019 at 11:27 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 7:07 AM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 6:00 PM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <[hidden email]> wrote:
Are default statistics target the same on both prod and AWS? Have you analyzed all tables being used in this query to ensure stats are up proper? If the optimizer is choosing a different plan, then the stats must be different IMO.


Michael Lewis |  Software Engineer
Entrata


Thanks for your reply  I have verified few of the tables and their default statistics target and they seem to be same but is there anything in particular you want me to look at it to differentiate Prod and Non prod databases ?. ( Also the DB instance size is same but there is little 
more data in the Non prod Aurora RDS instance compared to Prod instance).

Query used.
   select * from pg_stats where tablename = 'tableName' 


On Mon, Feb 11, 2019 at 2:15 PM github kran <[hidden email]> wrote:
Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted few million rows from the database and ran into a issue in one of our dev account where the 
DB was not normal after this deletion. We did re index, vacuuming entire database but we couldnt bring it to the same state as earlier. So next steps we deleted the database and 
recreated the database by copying the snapshot from a production instance. Further did vacumming, re-index on the database. 

After this now the dev database seems to be in a better state than earlier but we are seeing few of our DB calls are taking more than 1 minute when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the query whereas a similar query on prod instance is not doing any hash join and is returning faster. 

Also we did not want to experiment by modifing the DB settings by doing enable_hash_join to off or random_page_count to 1 as we dont have these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct here to reduce the query latency. 

Thanks 
githubKran
Reply | Threaded
Open this post in threaded view
|

Re: Postgresql RDS DB Latency Chossing Hash join Plan

github kran


On Thu, Feb 14, 2019 at 4:58 PM Michael Lewis <[hidden email]> wrote:
This is beyond my expertise except to say that if your storage is SSDs in AWS, then you definitely want random_page_cost close to the same as seq_page_cost (1 by default) assuming your data is likely to be in cache as discussed in the documentation. As it says- "Reducing this value relative to seq_page_cost will cause the system to prefer index scans" as you saw. Changing the value on production would again depend on the storage type used, and how good the cache hit rate is.

As far as I know, dropping old partitions should not be significantly impactful to the system other than no longer needing to store that data (cost, time for full backups, etc).

Again, as I understand things, there is not a big impact from having old unused tables in terms of maintenance. They should be ignored by normal processes.

Glad you got your issue resolved.


Michael Lewis

     Thanks for the feedback.You have been giving your thoughts/suggestions since the beginning of the case. It was helpful.  I think I realized later based on your suggestion to increase the default statistics target from 100. It was not correctly initially
     as I had that set at session level without setting them on the partition tables. As next steps I have the stats to 1000 on all of the partitiion tables we use from Nov 2018 to Feb 19 and then did ANALYZE. Iam currently running a load test to test  
    how the DB performance is behaving right now and so far its running good than before. ( I have reset all the previous changes done except the statistics change).  I will keep you posted after the test finishes

    Questions.
    1)  Can i further increase the Setting to 3000 and see the system behaves. ?. How do I know the best value to be used for my database in terms of the sampling limit with the default statistics setting ?.
     2) Apart from analyzing the tables do I need to do any other changes  with the statistics setting ?  
     3)  Also the current work mem is set to 4 MB and we didnt play with this value so far. For future needs can I increase the WORK MEM setting ?. 

Appreciate your reply.

Thanks 


On Thu, Feb 14, 2019 at 3:11 PM github kran <[hidden email]> wrote:


On Thu, Feb 14, 2019 at 12:43 PM Michael Lewis <[hidden email]> wrote:
How many total rows in these tables? I am assuming these are partitions and those 100 million rows got distributed. If the data difference is significant, then you aren't guaranteed similar performance. You may want to follow more of the suggested steps on.



Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

Michael - Yes correct the data of 100 million rows is distributed to all the partitions.
FInally I feel we have come to conclusion after we changed the random_page_cost from 4 to 2 in Non prod instance and we see improvements in the query which use to take from 1 minute to 1 -2 seconds.  
That's correct we have around 490 million rows in few of our partition tables. The partition tables are created for every 7 days. 

We ran our API test which hits the backend database Aurora RDS PostgreSQL and see our query response times , requests/sec are better than before. Do you recommend this setting on a Production instance? Right now we are planning to go 
implement this option of random_page_cost to 2. ( We know it also works if we turn off the hash_join but we dont want to implement this change but rather use random_page_cost to 2).

Questions.
1) What is your recommendation on this ? Can we modify this change on Prod instance which is performing better today or only keep this change to Non prod instance ?. ( Actually we want to implement this change on Non Prod instance)
2) Are there are any downside in changing this value ?.
3) Also another question we have a 64TB storage and running on r4 16 x large EC2 instance. Few of our partitions are bigger and few of them are smaller. We have data from last  9 months and we are planning to keep the data for about 
close to 1 year till May. Do you see any concerns on this ?. Eventually we are thinking to archive this data in next 2 months by dropping of older partitions. 
3) What could be the problems of keeping the data longer if there is a regular maintenance like  VACUUM and other maintenace activities

 

On Thu, Feb 14, 2019 at 8:48 AM github kran <[hidden email]> wrote:


On Wed, Feb 13, 2019 at 11:38 AM Michael Lewis <[hidden email]> wrote:
I didn't see your email yesterday, sorry about that. Index scans instead of sequential scans and nested loop instead of hash join means that you have bad row count estimates on "Non prod Aurora RDS instance" as far as I can figure. Have you run commands like-

analyze asset_info_2019_2_part4;
analyze asset_info_2019_2_part2;

etc? If data are very similar, indexes all exist, and default_statistics_target are the same, then you should be getting the same plans.


Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

1)  Yes did the VACUUM for all the tables like asset_info_2019_2_part1, part2 , part4 and also for location_info_xxx  to remove any dead tuples and also rebuilt the indexes..
2)   REINDEX table  location_data_2018_12_part4;( Like wise ran for all the tables and also VACUUM for all the tables).
3)  The data in Non prod instance is more.  One thing to mention here when we built the Non prod instance we copied SNAPSHOT from Prod instance and on top of that inserted data about 100 million rows and then did VACUUM and re-indexed the tables.

I cant think of anything we can do here but let us know if you need any more details on this problem. Iam happy to share more details.
 

On Wed, Feb 13, 2019 at 8:49 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 12:55 PM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <[hidden email]> wrote:
Did you update the stats by running ANALYZE on the tables involved, or perhaps the entire database on the 'Non prod Aurora RDS instance'? Can you share the two execution plans?

Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

Here is the plan for both of the DB instances. 

Michael - Did you get any chance to look at this issue. Also we see there is a sequential scan being done instead of index scan. 


On Tue, Feb 12, 2019 at 11:27 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 7:07 AM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 6:00 PM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <[hidden email]> wrote:
Are default statistics target the same on both prod and AWS? Have you analyzed all tables being used in this query to ensure stats are up proper? If the optimizer is choosing a different plan, then the stats must be different IMO.


Michael Lewis |  Software Engineer
Entrata


Thanks for your reply  I have verified few of the tables and their default statistics target and they seem to be same but is there anything in particular you want me to look at it to differentiate Prod and Non prod databases ?. ( Also the DB instance size is same but there is little 
more data in the Non prod Aurora RDS instance compared to Prod instance).

Query used.
   select * from pg_stats where tablename = 'tableName' 


On Mon, Feb 11, 2019 at 2:15 PM github kran <[hidden email]> wrote:
Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted few million rows from the database and ran into a issue in one of our dev account where the 
DB was not normal after this deletion. We did re index, vacuuming entire database but we couldnt bring it to the same state as earlier. So next steps we deleted the database and 
recreated the database by copying the snapshot from a production instance. Further did vacumming, re-index on the database. 

After this now the dev database seems to be in a better state than earlier but we are seeing few of our DB calls are taking more than 1 minute when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the query whereas a similar query on prod instance is not doing any hash join and is returning faster. 

Also we did not want to experiment by modifing the DB settings by doing enable_hash_join to off or random_page_count to 1 as we dont have these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct here to reduce the query latency. 

Thanks 
githubKran
Reply | Threaded
Open this post in threaded view
|

Re: Postgresql RDS DB Latency Chossing Hash join Plan

Michael Lewis
1) You can increase it as much as you want but (auto)analyze will take longer to examine the values of default_stat_target * 300 rows and compute the most common values and the frequencies of those values. How much variation does you data actually have? If your data only has 50 distinct values with fairly even distribution, then no need to increase it from 100 even. Oh, deciding on the best query plan will take a little more time for the optimizer since it will be examining bigger collection of stats on the tables that you have increased.

2) I am not aware.

3) I am not aware of anything about your application so I can't recommend any number outright, but 4MB for work_mem definitely seems low to me assuming you have 16GB or more memory available unless you have very high concurrency. It will depend on how many sorts per statement, how many users, etc. If you spill over to disk on routine operations, then things are definitely going to be much slower than if you are able to keep things in memory. You could try running explain analyze and just verify that you are keeping things in memory. You could also turn on automatic gathering of explain analyze plans on live if you have the room for logging and can tolerate just a little latency.


I'm glad your performance is improved in AWS/dev environment. It can be a big hassle to test things in an environment that performs significantly different.

Michael Lewis


On Sun, Feb 17, 2019 at 10:01 AM github kran <[hidden email]> wrote:


On Thu, Feb 14, 2019 at 4:58 PM Michael Lewis <[hidden email]> wrote:
This is beyond my expertise except to say that if your storage is SSDs in AWS, then you definitely want random_page_cost close to the same as seq_page_cost (1 by default) assuming your data is likely to be in cache as discussed in the documentation. As it says- "Reducing this value relative to seq_page_cost will cause the system to prefer index scans" as you saw. Changing the value on production would again depend on the storage type used, and how good the cache hit rate is.

As far as I know, dropping old partitions should not be significantly impactful to the system other than no longer needing to store that data (cost, time for full backups, etc).

Again, as I understand things, there is not a big impact from having old unused tables in terms of maintenance. They should be ignored by normal processes.

Glad you got your issue resolved.


Michael Lewis

     Thanks for the feedback.You have been giving your thoughts/suggestions since the beginning of the case. It was helpful.  I think I realized later based on your suggestion to increase the default statistics target from 100. It was not correctly initially
     as I had that set at session level without setting them on the partition tables. As next steps I have the stats to 1000 on all of the partitiion tables we use from Nov 2018 to Feb 19 and then did ANALYZE. Iam currently running a load test to test  
    how the DB performance is behaving right now and so far its running good than before. ( I have reset all the previous changes done except the statistics change).  I will keep you posted after the test finishes

    Questions.
    1)  Can i further increase the Setting to 3000 and see the system behaves. ?. How do I know the best value to be used for my database in terms of the sampling limit with the default statistics setting ?.
     2) Apart from analyzing the tables do I need to do any other changes  with the statistics setting ?  
     3)  Also the current work mem is set to 4 MB and we didnt play with this value so far. For future needs can I increase the WORK MEM setting ?. 

Appreciate your reply.

Thanks 


On Thu, Feb 14, 2019 at 3:11 PM github kran <[hidden email]> wrote:


On Thu, Feb 14, 2019 at 12:43 PM Michael Lewis <[hidden email]> wrote:
How many total rows in these tables? I am assuming these are partitions and those 100 million rows got distributed. If the data difference is significant, then you aren't guaranteed similar performance. You may want to follow more of the suggested steps on.



Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

Michael - Yes correct the data of 100 million rows is distributed to all the partitions.
FInally I feel we have come to conclusion after we changed the random_page_cost from 4 to 2 in Non prod instance and we see improvements in the query which use to take from 1 minute to 1 -2 seconds.  
That's correct we have around 490 million rows in few of our partition tables. The partition tables are created for every 7 days. 

We ran our API test which hits the backend database Aurora RDS PostgreSQL and see our query response times , requests/sec are better than before. Do you recommend this setting on a Production instance? Right now we are planning to go 
implement this option of random_page_cost to 2. ( We know it also works if we turn off the hash_join but we dont want to implement this change but rather use random_page_cost to 2).

Questions.
1) What is your recommendation on this ? Can we modify this change on Prod instance which is performing better today or only keep this change to Non prod instance ?. ( Actually we want to implement this change on Non Prod instance)
2) Are there are any downside in changing this value ?.
3) Also another question we have a 64TB storage and running on r4 16 x large EC2 instance. Few of our partitions are bigger and few of them are smaller. We have data from last  9 months and we are planning to keep the data for about 
close to 1 year till May. Do you see any concerns on this ?. Eventually we are thinking to archive this data in next 2 months by dropping of older partitions. 
3) What could be the problems of keeping the data longer if there is a regular maintenance like  VACUUM and other maintenace activities

 

On Thu, Feb 14, 2019 at 8:48 AM github kran <[hidden email]> wrote:


On Wed, Feb 13, 2019 at 11:38 AM Michael Lewis <[hidden email]> wrote:
I didn't see your email yesterday, sorry about that. Index scans instead of sequential scans and nested loop instead of hash join means that you have bad row count estimates on "Non prod Aurora RDS instance" as far as I can figure. Have you run commands like-

analyze asset_info_2019_2_part4;
analyze asset_info_2019_2_part2;

etc? If data are very similar, indexes all exist, and default_statistics_target are the same, then you should be getting the same plans.


Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

1)  Yes did the VACUUM for all the tables like asset_info_2019_2_part1, part2 , part4 and also for location_info_xxx  to remove any dead tuples and also rebuilt the indexes..
2)   REINDEX table  location_data_2018_12_part4;( Like wise ran for all the tables and also VACUUM for all the tables).
3)  The data in Non prod instance is more.  One thing to mention here when we built the Non prod instance we copied SNAPSHOT from Prod instance and on top of that inserted data about 100 million rows and then did VACUUM and re-indexed the tables.

I cant think of anything we can do here but let us know if you need any more details on this problem. Iam happy to share more details.
 

On Wed, Feb 13, 2019 at 8:49 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 12:55 PM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <[hidden email]> wrote:
Did you update the stats by running ANALYZE on the tables involved, or perhaps the entire database on the 'Non prod Aurora RDS instance'? Can you share the two execution plans?

Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

Here is the plan for both of the DB instances. 

Michael - Did you get any chance to look at this issue. Also we see there is a sequential scan being done instead of index scan. 


On Tue, Feb 12, 2019 at 11:27 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 7:07 AM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 6:00 PM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <[hidden email]> wrote:
Are default statistics target the same on both prod and AWS? Have you analyzed all tables being used in this query to ensure stats are up proper? If the optimizer is choosing a different plan, then the stats must be different IMO.


Michael Lewis |  Software Engineer
Entrata


Thanks for your reply  I have verified few of the tables and their default statistics target and they seem to be same but is there anything in particular you want me to look at it to differentiate Prod and Non prod databases ?. ( Also the DB instance size is same but there is little 
more data in the Non prod Aurora RDS instance compared to Prod instance).

Query used.
   select * from pg_stats where tablename = 'tableName' 


On Mon, Feb 11, 2019 at 2:15 PM github kran <[hidden email]> wrote:
Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted few million rows from the database and ran into a issue in one of our dev account where the 
DB was not normal after this deletion. We did re index, vacuuming entire database but we couldnt bring it to the same state as earlier. So next steps we deleted the database and 
recreated the database by copying the snapshot from a production instance. Further did vacumming, re-index on the database. 

After this now the dev database seems to be in a better state than earlier but we are seeing few of our DB calls are taking more than 1 minute when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the query whereas a similar query on prod instance is not doing any hash join and is returning faster. 

Also we did not want to experiment by modifing the DB settings by doing enable_hash_join to off or random_page_count to 1 as we dont have these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct here to reduce the query latency. 

Thanks 
githubKran
Reply | Threaded
Open this post in threaded view
|

Re: Postgresql RDS DB Latency Chossing Hash join Plan

github kran
Thanks for the feedback. 

On Tue, Feb 19, 2019 at 11:12 AM Michael Lewis <[hidden email]> wrote:
1) You can increase it as much as you want but (auto)analyze will take longer to examine the values of default_stat_target * 300 rows and compute the most common values and the frequencies of those values. How much variation does you data actually have? If your data only has 50 distinct values with fairly even distribution, then no need to increase it from 100 even. Oh, deciding on the best query plan will take a little more time for the optimizer since it will be examining bigger collection of stats on the tables that you have increased.

2) I am not aware.

3) I am not aware of anything about your application so I can't recommend any number outright, but 4MB for work_mem definitely seems low to me assuming you have 16GB or more memory available unless you have very high concurrency. It will depend on how many sorts per statement, how many users, etc. If you spill over to disk on routine operations, then things are definitely going to be much slower than if you are able to keep things in memory. You could try running explain analyze and just verify that you are keeping things in memory. You could also turn on automatic gathering of explain analyze plans on live if you have the room for logging and can tolerate just a little latency.


I'm glad your performance is improved in AWS/dev environment. It can be a big hassle to test things in an environment that performs significantly different.

Michael Lewis


On Sun, Feb 17, 2019 at 10:01 AM github kran <[hidden email]> wrote:


On Thu, Feb 14, 2019 at 4:58 PM Michael Lewis <[hidden email]> wrote:
This is beyond my expertise except to say that if your storage is SSDs in AWS, then you definitely want random_page_cost close to the same as seq_page_cost (1 by default) assuming your data is likely to be in cache as discussed in the documentation. As it says- "Reducing this value relative to seq_page_cost will cause the system to prefer index scans" as you saw. Changing the value on production would again depend on the storage type used, and how good the cache hit rate is.

As far as I know, dropping old partitions should not be significantly impactful to the system other than no longer needing to store that data (cost, time for full backups, etc).

Again, as I understand things, there is not a big impact from having old unused tables in terms of maintenance. They should be ignored by normal processes.

Glad you got your issue resolved.


Michael Lewis

     Thanks for the feedback.You have been giving your thoughts/suggestions since the beginning of the case. It was helpful.  I think I realized later based on your suggestion to increase the default statistics target from 100. It was not correctly initially
     as I had that set at session level without setting them on the partition tables. As next steps I have the stats to 1000 on all of the partitiion tables we use from Nov 2018 to Feb 19 and then did ANALYZE. Iam currently running a load test to test  
    how the DB performance is behaving right now and so far its running good than before. ( I have reset all the previous changes done except the statistics change).  I will keep you posted after the test finishes

    Questions.
    1)  Can i further increase the Setting to 3000 and see the system behaves. ?. How do I know the best value to be used for my database in terms of the sampling limit with the default statistics setting ?.
     2) Apart from analyzing the tables do I need to do any other changes  with the statistics setting ?  
     3)  Also the current work mem is set to 4 MB and we didnt play with this value so far. For future needs can I increase the WORK MEM setting ?. 

Appreciate your reply.

Thanks 


On Thu, Feb 14, 2019 at 3:11 PM github kran <[hidden email]> wrote:


On Thu, Feb 14, 2019 at 12:43 PM Michael Lewis <[hidden email]> wrote:
How many total rows in these tables? I am assuming these are partitions and those 100 million rows got distributed. If the data difference is significant, then you aren't guaranteed similar performance. You may want to follow more of the suggested steps on.



Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

Michael - Yes correct the data of 100 million rows is distributed to all the partitions.
FInally I feel we have come to conclusion after we changed the random_page_cost from 4 to 2 in Non prod instance and we see improvements in the query which use to take from 1 minute to 1 -2 seconds.  
That's correct we have around 490 million rows in few of our partition tables. The partition tables are created for every 7 days. 

We ran our API test which hits the backend database Aurora RDS PostgreSQL and see our query response times , requests/sec are better than before. Do you recommend this setting on a Production instance? Right now we are planning to go 
implement this option of random_page_cost to 2. ( We know it also works if we turn off the hash_join but we dont want to implement this change but rather use random_page_cost to 2).

Questions.
1) What is your recommendation on this ? Can we modify this change on Prod instance which is performing better today or only keep this change to Non prod instance ?. ( Actually we want to implement this change on Non Prod instance)
2) Are there are any downside in changing this value ?.
3) Also another question we have a 64TB storage and running on r4 16 x large EC2 instance. Few of our partitions are bigger and few of them are smaller. We have data from last  9 months and we are planning to keep the data for about 
close to 1 year till May. Do you see any concerns on this ?. Eventually we are thinking to archive this data in next 2 months by dropping of older partitions. 
3) What could be the problems of keeping the data longer if there is a regular maintenance like  VACUUM and other maintenace activities

 

On Thu, Feb 14, 2019 at 8:48 AM github kran <[hidden email]> wrote:


On Wed, Feb 13, 2019 at 11:38 AM Michael Lewis <[hidden email]> wrote:
I didn't see your email yesterday, sorry about that. Index scans instead of sequential scans and nested loop instead of hash join means that you have bad row count estimates on "Non prod Aurora RDS instance" as far as I can figure. Have you run commands like-

analyze asset_info_2019_2_part4;
analyze asset_info_2019_2_part2;

etc? If data are very similar, indexes all exist, and default_statistics_target are the same, then you should be getting the same plans.


Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

1)  Yes did the VACUUM for all the tables like asset_info_2019_2_part1, part2 , part4 and also for location_info_xxx  to remove any dead tuples and also rebuilt the indexes..
2)   REINDEX table  location_data_2018_12_part4;( Like wise ran for all the tables and also VACUUM for all the tables).
3)  The data in Non prod instance is more.  One thing to mention here when we built the Non prod instance we copied SNAPSHOT from Prod instance and on top of that inserted data about 100 million rows and then did VACUUM and re-indexed the tables.

I cant think of anything we can do here but let us know if you need any more details on this problem. Iam happy to share more details.
 

On Wed, Feb 13, 2019 at 8:49 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 12:55 PM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <[hidden email]> wrote:
Did you update the stats by running ANALYZE on the tables involved, or perhaps the entire database on the 'Non prod Aurora RDS instance'? Can you share the two execution plans?

Michael Lewis  |  Software Engineer
Entrata
c: <a href="tel:619-370-8697" value="+18018108697" style="color:rgb(17,85,204)" target="_blank">619.370.8697

Here is the plan for both of the DB instances. 

Michael - Did you get any chance to look at this issue. Also we see there is a sequential scan being done instead of index scan. 


On Tue, Feb 12, 2019 at 11:27 AM github kran <[hidden email]> wrote:


On Tue, Feb 12, 2019 at 7:07 AM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 6:00 PM github kran <[hidden email]> wrote:


On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <[hidden email]> wrote:
Are default statistics target the same on both prod and AWS? Have you analyzed all tables being used in this query to ensure stats are up proper? If the optimizer is choosing a different plan, then the stats must be different IMO.


Michael Lewis |  Software Engineer
Entrata


Thanks for your reply  I have verified few of the tables and their default statistics target and they seem to be same but is there anything in particular you want me to look at it to differentiate Prod and Non prod databases ?. ( Also the DB instance size is same but there is little 
more data in the Non prod Aurora RDS instance compared to Prod instance).

Query used.
   select * from pg_stats where tablename = 'tableName' 


On Mon, Feb 11, 2019 at 2:15 PM github kran <[hidden email]> wrote:
Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted few million rows from the database and ran into a issue in one of our dev account where the 
DB was not normal after this deletion. We did re index, vacuuming entire database but we couldnt bring it to the same state as earlier. So next steps we deleted the database and 
recreated the database by copying the snapshot from a production instance. Further did vacumming, re-index on the database. 

After this now the dev database seems to be in a better state than earlier but we are seeing few of our DB calls are taking more than 1 minute when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the query whereas a similar query on prod instance is not doing any hash join and is returning faster. 

Also we did not want to experiment by modifing the DB settings by doing enable_hash_join to off or random_page_count to 1 as we dont have these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct here to reduce the query latency. 

Thanks 
githubKran
Reply | Threaded
Open this post in threaded view
|

Re: Postgresql RDS DB Latency Chossing Hash join Plan

github kran
In reply to this post by Michael Lewis

Hello Team,

 

Hope everyone is doing great !!.


Background

We use PostgreSQL Version 10.6 version and heavily use PostgreSQL for our day to day activities to write and read data. We have 2 clusters running PostgreSQL engine , one cluster

keeps data up to 60 days and another cluster retains data beyond 1 year. The data is partitioned close to a week( ~evry 5 days a partition) and we have around 5 partitions per month per each table and we have 2 tables primarily so that will be 10 tables a week. So in the cluster-1 we have around  20 partitions and in cluster-2 we have around 160 partitions ( data from 2018). We also want to keep the data for up to 2 years in the cluster-2 to serve the data needs of the customer and so far we reached upto 1 year of maintaining this data.

 

Current activity

We have a custom weekly migration DB script job that moves data from 1 cluster to another cluster what it does is the below things.

1) COPY command to copy the data from cluster-1 and split that data into binary files

2) Writing the binary data into the cluster-2 table

3) Creating indexes after the data is copied.

 

Problem what we have right now.

When the migration activity runs(weekly) from past 2 times , we saw the cluster read replica instance has restarted as it fallen behind the master(writer instance). Everything

after that worked seamlessly but we want to avoid the replica getting restarted. To avoid from restart we started doing smaller binary files and copy those files to the cluster-2

instead of writing 1 big file of 450 million records. We were successful in the recent migration as the reader instance didn’t restart after we split 1 big file into multiple files to copy the data over but did restart after the indexes are created on the new table as it could be write intensive.

 

DB parameters set on migration job

work_mem set to 8 GB  and maintenace_work_mem=32 GB.

Indexes per table = 3

total indexes for 2 tables = 5

 

DB size

Cluster-2 = 8.6 TB

Cluster-1 = 3.6 TB

Peak Table relational rows = 400 - 480 million rows

Average table relational rows = 300 - 350 million rows.

Per table size = 90 -95 GB , per table index size is about 45 GB

 

Questions

1) Can we decrease the maintenace_work_mem to 16 GB and will it slow down the writes to the cluster , with that the reader instance can sync the data slowly ?.

2) Based on the above use case what are your recommendations to keep the data longer up to 2 years ?

3) What other recommendations you recommend ?.

 

 

Appreciate your replies.


THanks
githubkran
Reply | Threaded
Open this post in threaded view
|

PostGreSQL Replication and question on maintenance

github kran
sorry changing the subject line. 

On Thu, Nov 14, 2019 at 11:21 AM github kran <[hidden email]> wrote:

Hello Team,

 

Hope everyone is doing great !!.


Background

We use PostgreSQL Version 10.6 version and heavily use PostgreSQL for our day to day activities to write and read data. We have 2 clusters running PostgreSQL engine , one cluster

keeps data up to 60 days and another cluster retains data beyond 1 year. The data is partitioned close to a week( ~evry 5 days a partition) and we have around 5 partitions per month per each table and we have 2 tables primarily so that will be 10 tables a week. So in the cluster-1 we have around  20 partitions and in cluster-2 we have around 160 partitions ( data from 2018). We also want to keep the data for up to 2 years in the cluster-2 to serve the data needs of the customer and so far we reached upto 1 year of maintaining this data.

 

Current activity

We have a custom weekly migration DB script job that moves data from 1 cluster to another cluster what it does is the below things.

1) COPY command to copy the data from cluster-1 and split that data into binary files

2) Writing the binary data into the cluster-2 table

3) Creating indexes after the data is copied.

 

Problem what we have right now.

When the migration activity runs(weekly) from past 2 times , we saw the cluster read replica instance has restarted as it fallen behind the master(writer instance). Everything

after that worked seamlessly but we want to avoid the replica getting restarted. To avoid from restart we started doing smaller binary files and copy those files to the cluster-2

instead of writing 1 big file of 450 million records. We were successful in the recent migration as the reader instance didn’t restart after we split 1 big file into multiple files to copy the data over but did restart after the indexes are created on the new table as it could be write intensive.

 

DB parameters set on migration job

work_mem set to 8 GB  and maintenace_work_mem=32 GB.

Indexes per table = 3

total indexes for 2 tables = 5

 

DB size

Cluster-2 = 8.6 TB

Cluster-1 = 3.6 TB

Peak Table relational rows = 400 - 480 million rows

Average table relational rows = 300 - 350 million rows.

Per table size = 90 -95 GB , per table index size is about 45 GB

 

Questions

1) Can we decrease the maintenace_work_mem to 16 GB and will it slow down the writes to the cluster , with that the reader instance can sync the data slowly ?.

2) Based on the above use case what are your recommendations to keep the data longer up to 2 years ?

3) What other recommendations you recommend ?.

 

 

Appreciate your replies.


THanks
githubkran
Reply | Threaded
Open this post in threaded view
|

Fwd: PostGreSQL Replication and question on maintenance

github kran

Hello postGreSQL Community ,

 

Hope everyone is doing great !!.


Background

We use PostgreSQL Version 10.6 version and heavily use PostgreSQL for our day to day activities to write and read data. We have 2 clusters running PostgreSQL engine , one cluster

keeps data up to 60 days and another cluster retains data beyond 1 year. The data is partitioned close to a week( ~evry 5 days a partition) and we have around 5 partitions per month per each table and we have 2 tables primarily so that will be 10 tables a week. So in the cluster-1 we have around  20 partitions and in cluster-2 we have around 160 partitions ( data from 2018). We also want to keep the data for up to 2 years in the cluster-2 to serve the data needs of the customer and so far we reached upto 1 year of maintaining this data.

 

Current activity

We have a custom weekly migration DB script job that moves data from 1 cluster to another cluster what it does is the below things.

1) COPY command to copy the data from cluster-1 and split that data into binary files

2) Writing the binary data into the cluster-2 table

3) Creating indexes after the data is copied.

 

Problem what we have right now.

When the migration activity runs(weekly) from past 2 times , we saw the cluster read replica instance has restarted as it fallen behind the master(writer instance). Everything

after that worked seamlessly but we want to avoid the replica getting restarted. To avoid from restart we started doing smaller binary files and copy those files to the cluster-2

instead of writing 1 big file of 450 million records. We were successful in the recent migration as the reader instance didn’t restart after we split 1 big file into multiple files to copy the data over but did restart after the indexes are created on the new table as it could be write intensive.

 

DB parameters set on migration job

work_mem set to 8 GB  and maintenace_work_mem=32 GB.

Indexes per table = 3

total indexes for 2 tables = 5

 

DB size

Cluster-2 = 8.6 TB

Cluster-1 = 3.6 TB

Peak Table relational rows = 400 - 480 million rows

Average table relational rows = 300 - 350 million rows.

Per table size = 90 -95 GB , per table index size is about 45 GB

 

Questions

1) Can we decrease the maintenace_work_mem to 16 GB and will it slow down the writes to the cluster , with that the reader instance can sync the data slowly ?.

2) Based on the above use case what are your recommendations to keep the data longer up to 2 years ?

3) What other recommendations you recommend ?.

 

 

Appreciate your replies.


THanks
githubkran
Reply | Threaded
Open this post in threaded view
|

Re: PostGreSQL Replication and question on maintenance

Pavel Stehule


pá 15. 11. 2019 v 6:26 odesílatel github kran <[hidden email]> napsal:

Hello postGreSQL Community ,

 

Hope everyone is doing great !!.


Background

We use PostgreSQL Version 10.6 version and heavily use PostgreSQL for our day to day activities to write and read data. We have 2 clusters running PostgreSQL engine , one cluster

keeps data up to 60 days and another cluster retains data beyond 1 year. The data is partitioned close to a week( ~evry 5 days a partition) and we have around 5 partitions per month per each table and we have 2 tables primarily so that will be 10 tables a week. So in the cluster-1 we have around  20 partitions and in cluster-2 we have around 160 partitions ( data from 2018). We also want to keep the data for up to 2 years in the cluster-2 to serve the data needs of the customer and so far we reached upto 1 year of maintaining this data.

 

Current activity

We have a custom weekly migration DB script job that moves data from 1 cluster to another cluster what it does is the below things.

1) COPY command to copy the data from cluster-1 and split that data into binary files

2) Writing the binary data into the cluster-2 table

3) Creating indexes after the data is copied.

 

Problem what we have right now.

When the migration activity runs(weekly) from past 2 times , we saw the cluster read replica instance has restarted as it fallen behind the master(writer instance). Everything

after that worked seamlessly but we want to avoid the replica getting restarted. To avoid from restart we started doing smaller binary files and copy those files to the cluster-2

instead of writing 1 big file of 450 million records. We were successful in the recent migration as the reader instance didn’t restart after we split 1 big file into multiple files to copy the data over but did restart after the indexes are created on the new table as it could be write intensive.

 

DB parameters set on migration job

work_mem set to 8 GB  and maintenace_work_mem=32 GB.


these numbers looks crazy high - how much memory has your server - more than 1TB?


Indexes per table = 3

total indexes for 2 tables = 5

 

DB size

Cluster-2 = 8.6 TB

Cluster-1 = 3.6 TB

Peak Table relational rows = 400 - 480 million rows

Average table relational rows = 300 - 350 million rows.

Per table size = 90 -95 GB , per table index size is about 45 GB

 

Questions

1) Can we decrease the maintenace_work_mem to 16 GB and will it slow down the writes to the cluster , with that the reader instance can sync the data slowly ?.

2) Based on the above use case what are your recommendations to keep the data longer up to 2 years ?

3) What other recommendations you recommend ?.

 

 

Appreciate your replies.


THanks
githubkran
12