zabbix on postgresql - very slow delete of events

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

zabbix on postgresql - very slow delete of events

Kristian Ejvind

Hi

 

This will be a rather lengthy post, just to give the full (I hope) picture. We're using Zabbix for monitoring and I'm having problems

understanding why the deletion of rows in the events table is so slow.

 

Zabbix: 4.2 (never mind the name of the db - it is 4.2)

new values per second: ~400

hosts: ~600

items: ~45000

 

OS: CentOS Linux release 7.6.1810 (Core)

Postgresql was installed from the yum repo on postgresql.org

 

zabbix_34=> select version();

                                                 version

---------------------------------------------------------------------------------------------------------

PostgreSQL 10.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36), 64-bit

(1 row)

 

The database is analyzed + vacuumed nightly. The server runs Zabbix and the database, has 16 GB memory, 4 vCPUs (modern hardware).

Some parameters:

 

shared_buffers = 3GB

work_mem = 10MB    (I also tested with work_mem = 128MB - no difference)       

effective_cache_size = 6 GB

effective_io_concurrency = 40

checkpoint_timeout = 5 min (default)

max_wal_size = 1 GB (default)

checkpoint_completion_target = 0.8

 

pg_wal is already on a separate device.

 

events table: ~25 million rows / 2.9 GB

event_recovery table: ~12 million rows / 550 MB

alerts table: ~600000 rows / 530 MB

 

Generally the database is quite snappy and shows no indication of problems. But now I've seen that housekeeping of events is

very slow - a single (normally hourly) run can take more than one day to finish, so events keep stacking up in the table. A typical slow

delete statement, from the postgres log:

 

postgresql-10-20190717-031404.log:2019-07-17 03:37:43 CEST [80965]: [4-1] user=zabbix,db=zabbix_34,app=[unknown],client=[local]: LOG: duration: 27298798.930 ms statement: delete from events where (eventid between 5580621 and 5580681 or eventid between 5580689 and 5580762 or eventid between 5580769 and 5580844 or eventid between 5580851 and 5580867 or eventid between 5580869 and 5580926 or eventid between 5580933 and 5580949 or eventid between 5580963 and 5581024

--- 8< --- a lot of similar eventids snipped away -----

or eventid between 5586799 and 5586839 or eventid in (5581385,5581389,5581561,5581563,5581564,5581580,5 581582,5581584,5581585,5581635))

 

I've analyzed the deletion of a single row in events. First, some table information:

 

 

zabbix_34=> \d events

                                 Table "zabbix.events"

    Column    |          Type           | Collation | Nullable |        Default

--------------+-------------------------+-----------+----------+-----------------------

eventid      | numeric                 |           | not null |

source       | bigint                  |           | not null | '0'::bigint

object       | bigint                  |           | not null | '0'::bigint

objectid     | numeric                 |           | not null | '0'::numeric

clock        | bigint                  |           | not null | '0'::bigint

value        | bigint                  |           | not null | '0'::bigint

acknowledged | bigint                  |           | not null | '0'::bigint

ns           | bigint                  |           | not null | '0'::bigint

name         | character varying(2048) |           | not null | ''::character varying

severity     | integer                 |           | not null | 0

Indexes:

    "idx_29337_primary" PRIMARY KEY, btree (eventid)

    "events_1" btree (source, object, objectid, clock)

    "events_2" btree (source, object, clock)

    "events_clk_3" btree (clock)

Referenced by:

    TABLE "acknowledges" CONSTRAINT "c_acknowledges_2" FOREIGN KEY (eventid) REFERENCES events(eventid) ON UPDATE RESTRICT ON DELETE CASCADE

    TABLE "alerts" CONSTRAINT "c_alerts_2" FOREIGN KEY (eventid) REFERENCES events(eventid) ON UPDATE RESTRICT ON DELETE CASCADE

    TABLE "alerts" CONSTRAINT "c_alerts_5" FOREIGN KEY (p_eventid) REFERENCES events(eventid) ON DELETE CASCADE

    TABLE "event_recovery" CONSTRAINT "c_event_recovery_1" FOREIGN KEY (eventid) REFERENCES events(eventid) ON DELETE CASCADE

    TABLE "event_recovery" CONSTRAINT "c_event_recovery_2" FOREIGN KEY (r_eventid) REFERENCES events(eventid) ON DELETE CASCADE

    TABLE "event_recovery" CONSTRAINT "c_event_recovery_3" FOREIGN KEY (c_eventid) REFERENCES events(eventid) ON DELETE CASCADE

    TABLE "event_suppress" CONSTRAINT "c_event_suppress_1" FOREIGN KEY (eventid) REFERENCES events(eventid) ON DELETE CASCADE

    TABLE "event_tag" CONSTRAINT "c_event_tag_1" FOREIGN KEY (eventid) REFERENCES events(eventid) ON DELETE CASCADE

    TABLE "problem" CONSTRAINT "c_problem_1" FOREIGN KEY (eventid) REFERENCES events(eventid) ON DELETE CASCADE

    TABLE "problem" CONSTRAINT "c_problem_2" FOREIGN KEY (r_eventid) REFERENCES events(eventid) ON DELETE CASCADE

 

 

 

zabbix_34=> \d event_recovery

              Table "zabbix.event_recovery"

    Column     |  Type  | Collation | Nullable | Default

---------------+--------+-----------+----------+---------

eventid       | bigint |           | not null |

r_eventid     | bigint |           | not null |

c_eventid     | bigint |           |          |

correlationid | bigint |           |          |

userid        | bigint |           |          |

Indexes:

    "event_recovery_pkey" PRIMARY KEY, btree (eventid)

    "event_recovery_1" btree (r_eventid)

    "event_recovery_2" btree (c_eventid)

Foreign-key constraints:

    "c_event_recovery_1" FOREIGN KEY (eventid) REFERENCES events(eventid) ON DELETE CASCADE

    "c_event_recovery_2" FOREIGN KEY (r_eventid) REFERENCES events(eventid) ON DELETE CASCADE

    "c_event_recovery_3" FOREIGN KEY (c_eventid) REFERENCES events(eventid) ON DELETE CASCADE

 

 

 

zabbix_34=> \d alerts

                                 Table "zabbix.alerts"

    Column     |          Type           | Collation | Nullable |        Default

---------------+-------------------------+-----------+----------+-----------------------

alertid       | numeric                 |           | not null |

actionid      | numeric                 |           | not null |

eventid       | numeric                 |           | not null |

userid        | numeric                 |           |          |

clock         | bigint                  |           | not null | '0'::bigint

mediatypeid   | numeric                 |           |          |

sendto        | character varying(1024) |           | not null | ''::character varying

subject       | character varying(255)  |           | not null | ''::character varying

message       | text                    |           | not null | ''::text

status        | bigint                  |           | not null | '0'::bigint

retries       | bigint                  |           | not null | '0'::bigint

error         | character varying(2048) |           | not null | ''::character varying

esc_step      | bigint                  |           | not null | '0'::bigint

alerttype     | bigint                  |           | not null | '0'::bigint

p_eventid     | bigint                  |           |          |

acknowledgeid | bigint                  |           |          |

Indexes:

    "idx_29120_primary" PRIMARY KEY, btree (alertid)

    "alerts_1" btree (actionid)

    "alerts_2" btree (clock)

    "alerts_3" btree (eventid)

    "alerts_4" btree (status)

    "alerts_5" btree (mediatypeid)

    "alerts_6" btree (userid)

    "alerts_7" btree (p_eventid)

Foreign-key constraints:

    "c_alerts_1" FOREIGN KEY (actionid) REFERENCES actions(actionid) ON UPDATE RESTRICT ON DELETE CASCADE

    "c_alerts_2" FOREIGN KEY (eventid) REFERENCES events(eventid) ON UPDATE RESTRICT ON DELETE CASCADE

    "c_alerts_3" FOREIGN KEY (userid) REFERENCES users(userid) ON UPDATE RESTRICT ON DELETE CASCADE

    "c_alerts_4" FOREIGN KEY (mediatypeid) REFERENCES media_type(mediatypeid) ON UPDATE RESTRICT ON DELETE CASCADE

    "c_alerts_5" FOREIGN KEY (p_eventid) REFERENCES events(eventid) ON DELETE CASCADE

    "c_alerts_6" FOREIGN KEY (acknowledgeid) REFERENCES acknowledges(acknowledgeid) ON DELETE CASCADE

 

 

Let's look at what's in the tables for event 7123123:

 

zabbix_34=> select * from events where eventid=7123123;

eventid | source | object | objectid |   clock    | value | acknowledged |   ns    |                 name                 | severity

---------+--------+--------+----------+------------+-------+--------------+---------+--------------------------------------+----------

7123123 |      3 |      0 |    27562 | 1525264196 |     1 |            0 | 1980875 | Cannot calculate trigger expression. |        0

(1 row)

 

zabbix_34=> select * from event_recovery where eventid=7123123;

eventid | r_eventid | c_eventid | correlationid | userid

---------+-----------+-----------+---------------+--------

7123123 |   7124371 |           |               |

(1 row)

 

zabbix_34=> select * from alerts where eventid=7123123;

alertid | actionid | eventid | userid | clock | mediatypeid | sendto | subject | message | status | retries | error | esc_step | aler

ttype | p_eventid | acknowledgeid

---------+----------+---------+--------+-------+-------------+--------+---------+---------+--------+---------+-------+----------+-----

------+-----------+---------------

(0 rows)

 

 

All these queries execute well below 1 ms, using indexes.

 

Let's delete one row. See explain results here: https://explain.depesz.com/s/aycf . 5 seconds to delete a single row, wow!

This shows that it is the foreign key constraints on event_recovery and alerts that take a lot of time.

But why? I far as I can see, the delete is fully CPU bound during execution.

 

Deleting the corresponding row directly from event_recovery or alerts executes in less than 0.1 ms.  

 

Any ideas?

 

I've observed that alerts and event_recovery tables both have more than one foreign key that references events, if that matters.

 

Regards

Kristian Ejvind

 

 

 

Kristian Ejvind
Linux System Administrator
IT Operations | Technical Operations

 

Resurs Bank
Ekslingan 8
Box 222 09, SE-25467 Helsingborg

 

Mobil: +46 728571483
Växel: +46 42 38 20 00
E-post: [hidden email]
Webb: www.resursbank.se

 

Reply | Threaded
Open this post in threaded view
|

Re: zabbix on postgresql - very slow delete of events

Kenneth Marshall-3
On Tue, Jul 23, 2019 at 08:07:55AM +0000, Kristian Ejvind wrote:

> Hi
>
> This will be a rather lengthy post, just to give the full (I hope) picture. We're using Zabbix for monitoring and I'm having problems
> understanding why the deletion of rows in the events table is so slow.
>
> Zabbix: 4.2 (never mind the name of the db - it is 4.2)
> new values per second: ~400
> hosts: ~600
> items: ~45000
>

Hi Kristian,

Time series databases like Zabbix work poorly with the Housekeeper
service. We had many similar sorts of problems as our Zabbix usage
grew. Once we partitioned the big tables, turned off the Housekeeper,
and cleaned up by dropping partitions instead everything worked much,
much, much better. When we started using partitioning, we used the
old inheiritance style. Now you can use the native partitioning.

Regards,
Ken


Reply | Threaded
Open this post in threaded view
|

Re: zabbix on postgresql - very slow delete of events

Kristian Ejvind
Thanks Kenneth. In fact we've already partitioned the largest history* and trends* tables
and that has been running fine for a year. Performance was vastly improved. But since you
can't have a unique index on a partitioned table in postgres 10, we haven't worked on that.

Regards
Kristian


?On 2019-07-23, 14:58, "Kenneth Marshall" <[hidden email]> wrote:

    Hi Kristian,

    Time series databases like Zabbix work poorly with the Housekeeper
    service. We had many similar sorts of problems as our Zabbix usage
    grew. Once we partitioned the big tables, turned off the Housekeeper,
    and cleaned up by dropping partitions instead everything worked much,
    much, much better. When we started using partitioning, we used the
    old inheiritance style. Now you can use the native partitioning.

    Regards,
    Ken






Resurs Bank AB
Kristian Ejvind
Linux System Administrator
IT Operations | Technical Operations

Ekslingan 8
Box 222 09, SE-25467 Helsingborg

Direkt Tfn:
Mobil: +46 728571483
Vxl: +46 42 382000
Fax:
E-post: [hidden email]
Webb: http://www.resursbank.se



Reply | Threaded
Open this post in threaded view
|

Re: zabbix on postgresql - very slow delete of events

Kenneth Marshall-3
On Tue, Jul 23, 2019 at 01:41:53PM +0000, Kristian Ejvind wrote:
> Thanks Kenneth. In fact we've already partitioned the largest history* and trends* tables
> and that has been running fine for a year. Performance was vastly improved. But since you
> can't have a unique index on a partitioned table in postgres 10, we haven't worked on that.
>
> Regards
> Kristian

Hi Kristian,

Why are you not partitioning the events and alerts tables as well? That
would eliminate this problem and you already have the infrastructure in
place to support the management since you are using it for the history
and trends tables.

Regards,
Ken


Reply | Threaded
Open this post in threaded view
|

Re: zabbix on postgresql - very slow delete of events

Kristian Ejvind
Hi.

Well, the events table has both a primary key and foreign keys referencing it, which is not possible
on a partitioned table in postgresql 10. How did you work around this issue?

On the other hand, if we can get the deletion of rows from the events table run at normal speed, I
can't imagine we would have a problem with it in a long time. After all, although our Zabbix installation
definitely is larger than "small", it's still far from "large".

I think I would need assistance with debugging why postgresql behaves like it does.
Is there a defect with deleting data from a table that has multiple foreign keys referencing it from a  certain table?
Is there a problem with the query optimizer that chooses the wrong plan when working on the foreign key constraints?
How do I inspect how the db works on the deletion of rows from the referencing tables?

Regards
Kristian



?On 2019-07-23, 16:33, "Kenneth Marshall" <[hidden email]> wrote:

    On Tue, Jul 23, 2019 at 01:41:53PM +0000, Kristian Ejvind wrote:
    > Thanks Kenneth. In fact we've already partitioned the largest history* and trends* tables
    > and that has been running fine for a year. Performance was vastly improved. But since you
    > can't have a unique index on a partitioned table in postgres 10, we haven't worked on that.
    >
    > Regards
    > Kristian

    Hi Kristian,

    Why are you not partitioning the events and alerts tables as well? That
    would eliminate this problem and you already have the infrastructure in
    place to support the management since you are using it for the history
    and trends tables.

    Regards,
    Ken






Resurs Bank AB
Kristian Ejvind
Linux System Administrator
IT Operations | Technical Operations

Ekslingan 8
Box 222 09, SE-25467 Helsingborg

Direkt Tfn:
Mobil: +46 728571483
Vxl: +46 42 382000
Fax:
E-post: [hidden email]
Webb: http://www.resursbank.se



Reply | Threaded
Open this post in threaded view
|

Re: zabbix on postgresql - very slow delete of events

Maxim Boguk-2
Hi Kristian,

If you look for explain analyze results for delete,
you will see that 99% of time query spent on the foreign key triggers checks.
In the same time the database have indexes on foreign key side in place.


I recommend try this:

\timing on
BEGIN;
delete from zabbix.events where eventid = [some testing id];
select * from pg_stat_xact_user_tables where seq_scan>0 or idx_scan>0 order by seq_scan+idx_scan desc;
ABORT;

And provide result of the last query and how long delete runs.
It might help us understand whats going on.

Currently I have 3 ideas:
1)very very slow and overloaded IO subsystem
2)a lot of stuff being delete by ON DELETE CASCADE 
3)some locking prevent foreign key checks run fast



On Wed, Jul 24, 2019 at 11:12 AM Kristian Ejvind <[hidden email]> wrote:
Hi.

Well, the events table has both a primary key and foreign keys referencing it, which is not possible
on a partitioned table in postgresql 10. How did you work around this issue?

On the other hand, if we can get the deletion of rows from the events table run at normal speed, I
can't imagine we would have a problem with it in a long time. After all, although our Zabbix installation
definitely is larger than "small", it's still far from "large".

I think I would need assistance with debugging why postgresql behaves like it does.
Is there a defect with deleting data from a table that has multiple foreign keys referencing it from a  certain table?
Is there a problem with the query optimizer that chooses the wrong plan when working on the foreign key constraints?
How do I inspect how the db works on the deletion of rows from the referencing tables?

Regards
Kristian



?On 2019-07-23, 16:33, "Kenneth Marshall" <[hidden email]> wrote:

    On Tue, Jul 23, 2019 at 01:41:53PM +0000, Kristian Ejvind wrote:
    > Thanks Kenneth. In fact we've already partitioned the largest history* and trends* tables
    > and that has been running fine for a year. Performance was vastly improved. But since you
    > can't have a unique index on a partitioned table in postgres 10, we haven't worked on that.
    >
    > Regards
    > Kristian

    Hi Kristian,

    Why are you not partitioning the events and alerts tables as well? That
    would eliminate this problem and you already have the infrastructure in
    place to support the management since you are using it for the history
    and trends tables.

    Regards,
    Ken






Resurs Bank AB
Kristian Ejvind
Linux System Administrator
IT Operations | Technical Operations

Ekslingan 8
Box 222 09, SE-25467 Helsingborg

Direkt Tfn:
Mobil: +46 728571483
Vxl: +46 42 382000
Fax:
E-post: [hidden email]
Webb: http://www.resursbank.se





--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"

Reply | Threaded
Open this post in threaded view
|

Re: zabbix on postgresql - very slow delete of events

Maxim Boguk-2
In reply to this post by Kristian Ejvind

 

All these queries execute well below 1 ms, using indexes.

 

Let's delete one row. See explain results here: https://explain.depesz.com/s/aycf . 5 seconds to delete a single row, wow!

This shows that it is the foreign key constraints on event_recovery and alerts that take a lot of time.

But why? I far as I can see, the delete is fully CPU bound during execution.

 

Deleting the corresponding row directly from event_recovery or alerts executes in less than 0.1 ms.  

 

Any ideas?

 

I've observed that alerts and event_recovery tables both have more than one foreign key that references events, if that matters.



Hi Kristian,

After comparing structure of zabbix tables with same in my zabbix installation I found one very weird difference.
Why type of events.eventid had been changed from default bigint to numeric?

I suspect that the difference between events.eventid (numeric) type and event_recovery.*_eventid (bigint) types might lead to inability of use index during foreign key checks.
Anyway it will be clearly visible on the pg_stat_xact_user_tables results (I now expect to see 3 sequential scan on event_recovery and may be on some other tables as well).

Kind Regards,
Maxim

 
--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"


image8741bc.PNG (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: zabbix on postgresql - very slow delete of events

Kristian Ejvind
In reply to this post by Maxim Boguk-2

Hi Maxim

 

Thanks for your advice, and let me start with your second email, which I'll copy here:

 

=====

Hi Kristian,

 

After comparing structure of zabbix tables with same in my zabbix installation I found one very weird difference.

Why type of events.eventid had been changed from default bigint to numeric?

 

I suspect that the difference between events.eventid (numeric) type and event_recovery.*_eventid (bigint) types might lead to inability of use index during foreign key checks.

Anyway it will be clearly visible on the pg_stat_xact_user_tables results (I now expect to see 3 sequential scan on event_recovery and may be on some other tables as well).

 

Kind Regards,

Maxim

=====

 

Well spotted! On closer examination it seems that data types are wrong in several places. I suspect that this comes

from the time when our Zabbix ran on a MySQL database, which was converted over to PostgreSQL a few years

ago. I agree this discrepancy is suspicious and I will continue to examine it.

 

Regarding your ideas in the email below, I can say that 1) is not valid, disk latency is in the range of a few ms.

This is the output from your recommended query, which seems to verify your suspicions.  

 

zabbix_34=# begin; delete from zabbix.events where eventid = 7123123; select * from pg_stat_xact_user_tables where seq_scan>0 or idx_scan>0 order by seq_scan+idx_scan desc; rollback;

Time: 0.113 ms

Time: 4798.189 ms (00:04.798)

relid  | schemaname |    relname     | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd

--------+------------+----------------+----------+--------------+----------+---------------+-----------+-----------+-----------+---------------

  41940 | zabbix     | event_recovery |        3 |     35495224 |        0 |             0 |         0 |         0 |         1 |             0

  41675 | zabbix     | alerts         |        1 |       544966 |        1 |             0 |         0 |         0 |         0 |             0

  42573 | zabbix     | problem        |        2 |        13896 |        0 |             0 |         0 |         0 |         0 |             0

  41943 | zabbix     | event_tag      |        1 |        22004 |        0 |             0 |         0 |         0 |         0 |             0

  41649 | zabbix     | acknowledges   |        1 |           47 |        0 |             0 |         0 |         0 |         0 |             0

  41951 | zabbix     | events         |        0 |            0 |        1 |             1 |         0 |         0 |         1 |             0

260215 | zabbix     | event_suppress |        1 |            0 |        0 |             0 |         0 |         0 |         0 |             0

(7 rows)

 

Time: 2.857 ms

Time: 0.162 ms

 

Regards

Kristian

 

 

 

 

 

Kristian Ejvind
Linux System Administrator
IT Operations | Technical Operations

 

Resurs Bank
Ekslingan 8
Box 222 09, SE-25467 Helsingborg

 

Mobil: +46 728571483
Växel: +46 42 38 20 00
E-post: [hidden email]
Webb: www.resursbank.se

 

From: Maxim Boguk <[hidden email]>
Date: Wednesday, 24 July 2019 at 15:55
To: Kristian Ejvind <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: zabbix on postgresql - very slow delete of events

 

Hi Kristian,

 

If you look for explain analyze results for delete,

you will see that 99% of time query spent on the foreign key triggers checks.

In the same time the database have indexes on foreign key side in place.

 

 

I recommend try this:

 

\timing on

BEGIN;

delete from zabbix.events where eventid = [some testing id];

select * from pg_stat_xact_user_tables where seq_scan>0 or idx_scan>0 order by seq_scan+idx_scan desc;

ABORT;

 

And provide result of the last query and how long delete runs.

It might help us understand whats going on.

 

Currently I have 3 ideas:

1)very very slow and overloaded IO subsystem

2)a lot of stuff being delete by ON DELETE CASCADE 

3)some locking prevent foreign key checks run fast

 

 

 

On Wed, Jul 24, 2019 at 11:12 AM Kristian Ejvind <[hidden email]> wrote:

Hi.

Well, the events table has both a primary key and foreign keys referencing it, which is not possible
on a partitioned table in postgresql 10. How did you work around this issue?

On the other hand, if we can get the deletion of rows from the events table run at normal speed, I
can't imagine we would have a problem with it in a long time. After all, although our Zabbix installation
definitely is larger than "small", it's still far from "large".

I think I would need assistance with debugging why postgresql behaves like it does.
Is there a defect with deleting data from a table that has multiple foreign keys referencing it from a  certain table?
Is there a problem with the query optimizer that chooses the wrong plan when working on the foreign key constraints?
How do I inspect how the db works on the deletion of rows from the referencing tables?

Regards
Kristian



?On 2019-07-23, 16:33, "Kenneth Marshall" <[hidden email]> wrote:

    On Tue, Jul 23, 2019 at 01:41:53PM +0000, Kristian Ejvind wrote:
    > Thanks Kenneth. In fact we've already partitioned the largest history* and trends* tables
    > and that has been running fine for a year. Performance was vastly improved. But since you
    > can't have a unique index on a partitioned table in postgres 10, we haven't worked on that.
    >
    > Regards
    > Kristian

    Hi Kristian,

    Why are you not partitioning the events and alerts tables as well? That
    would eliminate this problem and you already have the infrastructure in
    place to support the management since you are using it for the history
    and trends tables.

    Regards,
    Ken






Resurs Bank AB
Kristian Ejvind
Linux System Administrator
IT Operations | Technical Operations

Ekslingan 8
Box 222 09, SE-25467 Helsingborg

Direkt Tfn:
Mobil: +46 728571483
Vxl: +46 42 382000
Fax:
E-post: [hidden email]
Webb: http://www.resursbank.se



 

--

Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"??????, ?? ??? ?????????? ??? ?? ??????, ?? ?????? ??? ??-???????? ?????? ????? ? ??? ????? ??? ????"

Reply | Threaded
Open this post in threaded view
|

Re: zabbix on postgresql - very slow delete of events

Maxim Boguk-2


On Wed, Jul 24, 2019 at 6:12 PM Kristian Ejvind <[hidden email]> wrote:

Hi Maxim

 

Thanks for your advice, and let me start with your second email, which I'll copy here:

 

=====

Hi Kristian,

 

After comparing structure of zabbix tables with same in my zabbix installation I found one very weird difference.

Why type of events.eventid had been changed from default bigint to numeric?

 

I suspect that the difference between events.eventid (numeric) type and event_recovery.*_eventid (bigint) types might lead to inability of use index during foreign key checks.

Anyway it will be clearly visible on the pg_stat_xact_user_tables results (I now expect to see 3 sequential scan on event_recovery and may be on some other tables as well).

 

Kind Regards,

Maxim

=====

 

Well spotted! On closer examination it seems that data types are wrong in several places. I suspect that this comes

from the time when our Zabbix ran on a MySQL database, which was converted over to PostgreSQL a few years

ago. I agree this discrepancy is suspicious and I will continue to examine it.

 

Regarding your ideas in the email below, I can say that 1) is not valid, disk latency is in the range of a few ms.

This is the output from your recommended query, which seems to verify your suspicions.  

 

zabbix_34=# begin; delete from zabbix.events where eventid = 7123123; select * from pg_stat_xact_user_tables where seq_scan>0 or idx_scan>0 order by seq_scan+idx_scan desc; rollback;

Time: 0.113 ms

Time: 4798.189 ms (00:04.798)

relid  | schemaname |    relname     | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd

--------+------------+----------------+----------+--------------+----------+---------------+-----------+-----------+-----------+---------------

  41940 | zabbix     | event_recovery |        3 |     35495224 |        0 |             0 |         0 |         0 |         1 |             0

  41675 | zabbix     | alerts         |        1 |       544966 |        1 |             0 |         0 |         0 |         0 |             0

  42573 | zabbix     | problem        |        2 |        13896 |        0 |             0 |         0 |         0 |         0 |             0

  41943 | zabbix     | event_tag      |        1 |        22004 |        0 |             0 |         0 |         0 |         0 |             0

  41649 | zabbix     | acknowledges   |        1 |           47 |        0 |             0 |         0 |         0 |         0 |             0

  41951 | zabbix     | events         |        0 |            0 |        1 |             1 |         0 |         0 |         1 |             0

260215 | zabbix     | event_suppress |        1 |            0 |        0 |             0 |         0 |         0 |         0 |             0


Hi Kristian,

This result definitely proves that indexes not used during foreign key checks (see that non-zero seq_scan counters for linked tables).
Only possible reason (IMHO) that wrong usage numeric in place of bigint.
I recommend change types of events.eventid (and any other similar fields) to bigint.
It should resolve your performance issues with deletes on events table (as additional bonus - bigint a lot faster and compact type than numeric).

-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"

Reply | Threaded
Open this post in threaded view
|

Re: zabbix on postgresql - very slow delete of events

Kristian Ejvind

Hi.

 

Just a short message, confirming that after we've altered the tables to have matching

types, deletes now take 1 ms, instead of 5 sec. Indexes are being used now.

 

Thanks for assistance.

 

Regards

Kristian

 

ps. would be nice with some warnings or indications in analyze output when this happens.

 

 

 

Kristian Ejvind
Linux System Administrator
IT Operations | Technical Operations

 

Resurs Bank
Ekslingan 8
Box 222 09, SE-25467 Helsingborg

 

Mobil: +46 728571483
Växel: +46 42 38 20 00
E-post: [hidden email]
Webb: www.resursbank.se

 

From: Maxim Boguk <[hidden email]>
Date: Wednesday, 24 July 2019 at 19:17
To: Kristian Ejvind <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: zabbix on postgresql - very slow delete of events

 

 

 

On Wed, Jul 24, 2019 at 6:12 PM Kristian Ejvind <[hidden email]> wrote:

Hi Maxim

 

Thanks for your advice, and let me start with your second email, which I'll copy here:

 

=====

Hi Kristian,

 

After comparing structure of zabbix tables with same in my zabbix installation I found one very weird difference.

Why type of events.eventid had been changed from default bigint to numeric?

 

I suspect that the difference between events.eventid (numeric) type and event_recovery.*_eventid (bigint) types might lead to inability of use index during foreign key checks.

Anyway it will be clearly visible on the pg_stat_xact_user_tables results (I now expect to see 3 sequential scan on event_recovery and may be on some other tables as well).

 

Kind Regards,

Maxim

=====

 

Well spotted! On closer examination it seems that data types are wrong in several places. I suspect that this comes

from the time when our Zabbix ran on a MySQL database, which was converted over to PostgreSQL a few years

ago. I agree this discrepancy is suspicious and I will continue to examine it.

 

Regarding your ideas in the email below, I can say that 1) is not valid, disk latency is in the range of a few ms.

This is the output from your recommended query, which seems to verify your suspicions.  

 

zabbix_34=# begin; delete from zabbix.events where eventid = 7123123; select * from pg_stat_xact_user_tables where seq_scan>0 or idx_scan>0 order by seq_scan+idx_scan desc; rollback;

Time: 0.113 ms

Time: 4798.189 ms (00:04.798)

relid  | schemaname |    relname     | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd

--------+------------+----------------+----------+--------------+----------+---------------+-----------+-----------+-----------+---------------

  41940 | zabbix     | event_recovery |        3 |     35495224 |        0 |             0 |         0 |         0 |         1 |             0

  41675 | zabbix     | alerts         |        1 |       544966 |        1 |             0 |         0 |         0 |         0 |             0

  42573 | zabbix     | problem        |        2 |        13896 |        0 |             0 |         0 |         0 |         0 |             0

  41943 | zabbix     | event_tag      |        1 |        22004 |        0 |             0 |         0 |         0 |         0 |             0

  41649 | zabbix     | acknowledges   |        1 |           47 |        0 |             0 |         0 |         0 |         0 |             0

  41951 | zabbix     | events         |        0 |            0 |        1 |             1 |         0 |         0 |         1 |             0

260215 | zabbix     | event_suppress |        1 |            0 |        0 |             0 |         0 |         0 |         0 |             0

 

Hi Kristian,

 

This result definitely proves that indexes not used during foreign key checks (see that non-zero seq_scan counters for linked tables).

Only possible reason (IMHO) that wrong usage numeric in place of bigint.

I recommend change types of events.eventid (and any other similar fields) to bigint.

It should resolve your performance issues with deletes on events table (as additional bonus - bigint a lot faster and compact type than numeric).

 

-- 

Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"??????, ?? ??? ?????????? ??? ?? ??????, ?? ?????? ??? ??-???????? ?????? ????? ? ??? ????? ??? ????"