[Proposal] Add accumulated statistics for wait event

classic Classic list List threaded Threaded
39 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Add accumulated statistics for wait event

legrand legrand
Bertrand DROUVOT wrote

> Hello Guys,
>
> As you mentioned Oracle like active session history sampling in this
> thread, I just want to let you know that I am working on a brand new
> extension to provide this feature.
>
> You can find the extension here: https://github.com/pgsentinel/pgsentinel
>
> [...]
>
> If you want to have a look, give your thoughts, you are welcome.
>
> Bertrand


+1!

just a regret: ash sampling interval can not be smaller than a second ;o(

Cordialement
PAscal




--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Reply | Threaded
Open this post in threaded view
|

RE: [Proposal] Add accumulated statistics for wait event

Yotsunaga, Naoki
In reply to this post by Yotsunaga, Naoki

On Thu, Oct 4, 2018 at 8:22 PM, Yotsunaga Naoki wrote:

 

Hi, I understood and thought of your statistic comment once again. In the case of sampling, is there enough statistic to investigate?

In the case of a long SQL, I think that it is possible to obtain a sufficient sampling number.

 

However, in the case of about 1 minute of SQL, only 60 samples can be obtained at most.

#Because legards comment.

https://www.postgresql.org/message-id/1539158356795-0.post%40n3.nabble.com

 

Does this sampling number of 60 give information that I really want?

Perhaps it is not to miss the real problem part?

---------------------------------------

Naoki, Yotsunaga.

Reply | Threaded
Open this post in threaded view
|

RE: [Proposal] Add accumulated statistics for wait event

legrand legrand
Hello,
You are right, sampling has to be "tuned" regarding the event(s) you want to
catch.

Sampling of 1 second interval is good with treatments that take hours, and
not enough for a minute or a second analysis.

May I invite you to try it, using PASH-viewer (github) with pgsentinel
(github).

Changing pgsentiel.c sampling from 1 second

rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
ash_sampling_period * 1000L,PG_WAIT_EXTENSION);

to 1/10 second
rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
ash_sampling_period * 100L,PG_WAIT_EXTENSION);

seems the good balance for me (for analysis periods from a few seconds to
minutes).

Regards
PAscal



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Reply | Threaded
Open this post in threaded view
|

RE: [Proposal] Add accumulated statistics for wait event

Phil Florent
In reply to this post by Yotsunaga, Naoki
Hi,

Is DBA really able to solve bottlenecks with sampling?

What I would like to say is that if we have information on the number of wait events and the wait time(like other DB), we can investigate more easily.


Yes you will be able to solve bottlenecks with sampling. In interactive mode, a 1s interval is probably too large. I use 0s1 - 0s01 with my tool and it is normally OK. In batch mode I use 1s=>10s. If you want to visualize the results it's easy to use a dedicated tool and bottlenecks will clearly appear .
Since grafana is now able to connect directly to a postgresql source, I use it to display the information collected from pg_stat_activity and psutil ( e.g https://pgphil.ovh/traqueur_dashboard_02.php page is written in french but panels are in english)

Other DB have accumulated statistics but you can notice that sampling is also their most modern method.
E.g Oracle DB : 20 years ago you already had tools like "utlbstat/utlestat" . Then you had "statspack". Those tools were based on accumulated statistics and the reports were based on differences between 2 points. It was useful to solve major problems but it was limited and not precise enough in many cases.

The preferred feature to identify bottlenecks in the Oracle world is now ASH (active session history). It can help with major problems, specific problems AND it can identify short blockages.
Too bad it is licensed as an option of their Enterprise Edition but similar tools exist and they are also based on sampling of the activity.

With the "official" ASH, sampling and archiving are done internally and you have a circular memory zone dedicated to the feature. Hence the overhead is lower but that's all.

The most advanced interactive tool is called "snapper" and it is also based on sampling.

Best regards
Phil



De : Yotsunaga, Naoki <[hidden email]>
Envoyé : lundi 29 octobre 2018 02:20
À : 'Phil Florent'; 'Michael Paquier'
Cc : 'Tomas Vondra'; '[hidden email]'
Objet : RE: [Proposal] Add accumulated statistics for wait event
 

On Thu, Oct 4, 2018 at 8:22 PM, Yotsunaga Naoki wrote:

 

Hi, I understood and thought of your statistic comment once again. In the case of sampling, is there enough statistic to investigate?

In the case of a long SQL, I think that it is possible to obtain a sufficient sampling number.

 

However, in the case of about 1 minute of SQL, only 60 samples can be obtained at most.

#Because legards comment.

https://www.postgresql.org/message-id/1539158356795-0.post%40n3.nabble.com

 

Does this sampling number of 60 give information that I really want?

Perhaps it is not to miss the real problem part?

---------------------------------------

Naoki, Yotsunaga.

Reply | Threaded
Open this post in threaded view
|

RE: [Proposal] Add accumulated statistics for wait event

Yotsunaga, Naoki

On Mon, Oct 29, 2018 at 1:52 AM, Phil Florent wrote:

 

Hi, thank you for comments.

 

>Yes you will be able to solve bottlenecks with sampling. In interactive mode, a 1s interval is probably too large. I use 0s1 - 0s01 with my tool and it is normally OK.

 

With the tool you are using, can you sample at intervals shorter than 1 second?

If you can, you can get enough sampling number and you can also acquire short events.

 

>Since grafana is now able to connect directly to a postgresql source, I use it to display the information collected from pg_stat_activity and psutil ( e.g https://pgphil.ovh/traqueur_dashboard_02.php page is written in french but panels are in english)

 

It is wonderful to visualize.

Especially for beginners like me.

 

 

>Other DB have accumulated statistics but you can notice that sampling is also their most modern method.

>E.g Oracle DB : 20 years ago you already had tools like "utlbstat/utlestat" . Then you had "statspack". Those tools were based on accumulated statistics and the reports were based on differences between 2 points. It was useful to solve major problems but it was limited and not precise enough in many cases.

 

>The preferred feature to identify bottlenecks in the Oracle world is now ASH (active session history). It can help with major problems, specific problems AND it can identify short blockages.

>Too bad it is licensed as an option of their Enterprise Edition but similar tools exist and they are also based on sampling of the activity.

 

>With the "official" ASH, sampling and archiving are done internally and you have a circular memory zone dedicated to the feature. Hence the overhead is lower but that's all.

 

>The most advanced interactive tool is called "snapper" and it is also based on sampling.

 

Thanks. I will check it.

 

The current bottleneck survey method, from sampling, I can know the number (ratio) of waiting events.

Then, investigate from those with a high number of times (ratio).

Do you agree with this recognition?

 

---------------------------------------

Naoki, Yotsunaga.

Reply | Threaded
Open this post in threaded view
|

RE: [Proposal] Add accumulated statistics

Yotsunaga, Naoki

On Sat, Nov 3, 2018 at 1:28 AM, Phil Florent wrote:

 

>2) it consumes system resources

While the system is running, you are always sampling system information, do not you? Like Oracle ASH.

If so, does sampling have no significant impact on performance? Even if the interval is 0.01 s or more.

 

>The main interest of sampling is to discard negligible activity to allow the DBA to work on meaningful queries and events.

In other words, do you mean narrowing down candidate of problems?

 

---------------------------

Naoki Yotsunaga

 

Reply | Threaded
Open this post in threaded view
|

RE: [Proposal] Add accumulated statistics

Yotsunaga, Naoki

On Mon, Nov 5, 2018 at 4:26 PM, Naoki Yotsunaga wrote:

 

>>2) it consumes system resources

>While the system is running, you are always sampling system information, do not you? Like Oracle ASH.

 

I don’t understand well how sampling is used.

In which scene do you use the sampling? Or is it both scenes? Or is it a different scene?

A)  Perform sampling in order to obtain information on the entire DB while DB is in operation.

B)  I know roughly the process to be late. When investigating the reason for delaying the processing, sampling is performed on the processing.

 

---------------------------

Naoki Yotsunaga

 

Reply | Threaded
Open this post in threaded view
|

RE: [Proposal] Add accumulated statistics for wait event

Yotsunaga, Naoki
In reply to this post by legrand legrand
On Sun, Oct 28, 2018 at 6:39 PM, legrand legrand wrote:

Hi, Thanks for comments.
I had overlooked the reply from you.

>You are right, sampling has to be "tuned" regarding the event(s) you want to catch.
 I see. For tuning, you need to know the length of processing you want to sample?
 
> May I invite you to try it, using PASH-viewer (github) with pgsentinel (github).
 I'll check and try it.

------------------
Naoki Yotsunaga



Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Add accumulated statistics

Bruce Momjian
In reply to this post by Yotsunaga, Naoki
On Tue, Nov  6, 2018 at 04:26:03AM +0000, Yotsunaga, Naoki wrote:

> On Sat, Nov 3, 2018 at 1:28 AM, Phil Florent wrote:
>
>  
>
> >2) it consumes system resources
>
> While the system is running, you are always sampling system information, do not
> you? Like Oracle ASH.
>
> If so, does sampling have no significant impact on performance? Even if the
> interval is 0.01 s or more.

I am replying late, but one of the reasons that sampling is used is that
decreasing the sampling interval increases to overhead of the sampling
process, but doesn't affect the running backends.

--
  Bruce Momjian  <[hidden email]>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Reply | Threaded
Open this post in threaded view
|

RE: [Proposal] Add accumulated statistics

Yotsunaga, Naoki
On Wed, Nov 21, 2018 at 9:27 PM, Bruce Momjian wrote:

Hi, thank you for the information.
I understood that sampling is effective for investigation of waiting events.

By the way, you can see the number of wait events with "LWLOCK_STATS", right?
Is this function implemented because it is necessary to know the number of waiting events for investigation?
If so, is not that the number of wait events is useful information?
Now, I need to rebuild to get this information and I feel inconvenience.

So, how about checking the number of wait events in the view?
Also, I think that it will be useful if you know the waiting time.
I think that it is easy to investigate when it is clearly known how long waiting time is occupied with processing time.

--
Naoki Yotsunaga


Reply | Threaded
Open this post in threaded view
|

RE: [Proposal] Add accumulated statistics

Tsunakawa, Takayuki
From: Yotsunaga, Naoki [mailto:[hidden email]]

> By the way, you can see the number of wait events with "LWLOCK_STATS", right?
> Is this function implemented because it is necessary to know the number
> of waiting events for investigation?
> If so, is not that the number of wait events is useful information?
> Now, I need to rebuild to get this information and I feel inconvenience.
>
> So, how about checking the number of wait events in the view?
> Also, I think that it will be useful if you know the waiting time.
> I think that it is easy to investigate when it is clearly known how long
> waiting time is occupied with processing time.

That's interesting.  It should be convenient for PG developers to be able to see wait events of each session and the whole instance to improve PostgreSQL.  I developed a flight recorder feature for a proprietary database (not based on PostgreSQL), which recorded various events, including waits, in the ring buffer for each session.  It could dump the ring buffers into external files in CSV format, so that we can load them into a database or a spreadsheet to analyze the time model of SQL execution.  That helped us a lot to eliminate many bottlenecks.  MySQL says a similar thing about its Performance Schema:

https://dev.mysql.com/doc/refman/8.0/en/performance-schema-examples.html

"Tune the code (this applies to storage engine or server developers only)."



Regards
Takayuki Tsunakawa



Reply | Threaded
Open this post in threaded view
|

RE: [Proposal] Add accumulated statistics

Tsunakawa, Takayuki
In reply to this post by Yotsunaga, Naoki
Hi all,

I think sampling like Oracle ASH should work for the DBA to find probable bottlenecks in many cases (, so I hope PostgreSQL will incorporate it...)  On the other hand, it seems to have the following disadvantages, some of which others have already pointed out:

1. Doesn't provide precise data
Sampling could miss intermittent short waits, e.g., buffer content lock waits during checkpoints.  This might make it difficult or impossible to solve transient performance problems, such as infrequent 100 millisecond response times while the normal response time is a few milliseconds.
The proposed wait event collection doesn't miss anything.

2. Overuses resources
We may be able to shorten the sampling interval to 10 ms or even 1 ms to detect short periods of problems.  However, the sampled data of active sessions become voluminous in memory and storage.  It would take longer to analyze those samples.  Also, the background sampling process prevents the CPU core from becoming idle to save power, which bgwriter and walwriter tries to avoid by hibernation.
The proposed wait event collection just records what actually happened.  No waste.  Would it use many resources if waits happen frequently?  That leads to our motivation to reduce waits.

3. Cannot determine the impact or illness of waits just by sampling or counting without time
As the following MySQL and Oracle manual articles describe, precise measurement of wait count and time helps to judge the impact and justify the remedy.  They can measure the whole SQL execution and its various processing steps (parse, plan, sort, etc.) as well as waits, so that the most significant areas can be determined.
Also, sampling cannot tell if a single wait took long or the same waits occurred repeatedly in succession (at least easily.)  Do the sampled waits indicate an abnormal I/O (which took 2 ms while the normal time is 50 us)?


[MySQL]

Chapter 26 MySQL Performance Schema
https://dev.mysql.com/doc/refman/8.0/en/performance-schema.html
--------------------------------------------------
The Performance Schema monitors server events. An “event” is anything the server does that takes time and has been instrumented so that timing information can be collected. In general, an event could be a function call, a wait for the operating system, a stage of an SQL statement execution such as parsing or sorting, or an entire statement or group of statements. Event collection provides access to information about synchronization calls (such as for mutexes) file and table I/O, table locks, and so forth for the server and for several storage engines.

Current events are available, as well as event histories and summaries. This enables you to determine how many times instrumented activities were performed and how much time they took. Event information is available to show the activities of specific threads, or activity associated with particular objects such as a mutex or file.
--------------------------------------------------


[Oracle]

https://docs.oracle.com/en/database/oracle/oracle-database/18/tdppt/automatic-database-performance-monitoring.html#GUID-32E92AEC-AF1A-4602-B998-3250920CD3BE
--------------------------------------------------
The goal of database performance tuning is to reduce the DB time of the system for a given workload. By reducing DB time, the database can support more user requests by using the same or fewer resources. ADDM reports system resources that are using a significant portion of DB time as problem areas and sorts them in descending order by the amount of related DB time spent.
--------------------------------------------------


Instance Tuning Using Performance Views
https://docs.oracle.com/en/database/oracle/oracle-database/18/tgdba/instance-tuning-using-performance-views.html#GUID-07982549-507F-4465-8843-7F753BCF8F99
--------------------------------------------------
Wait event statistics include the number of times an event was waited for and the time waited for the event to complete. If the initialization parameter TIMED_STATISTICS is set to true, then you can also see how long each resource was waited for.
To minimize user response time, reduce the time spent by server processes waiting for event completion. Not all wait events have the same wait time. Therefore, it is more important to examine events with the most total time waited rather than wait events with a high number of occurrences. Usually, it is best to set the dynamic parameter TIMED_STATISTICS to true at least while monitoring performance.
--------------------------------------------------


https://docs.oracle.com/en/database/oracle/oracle-database/18/tgdba/measuring-database-performance.html#GUID-811E9E65-C64A-4028-A90E-102BBFF6E68F
5.2.3 Using Wait Events without Timed Statistics
--------------------------------------------------
If TIMED_STATISTICS is set to FALSE, then the amount of time spent waiting for an event is not available. Therefore, it is only possible to order wait events by the number of times each event was waited for. Although the events with the largest number of waits might indicate a potential bottleneck, they might not be the main bottleneck. This situation can happen when an event is waited for a large number of times, but the total time waited for that event is small. Conversely, an event with fewer waits might be a bigger bottleneck if the wait time accounts for a significant proportion of the total wait time. Without the wait times to use for comparison, it is difficult to determine whether a wait event is worth investigating.
--------------------------------------------------


10.2.2 Using Wait Event Statistics to Drill Down to Bottlenecks
--------------------------------------------------
The most effective way to use wait event data is to order the events by the wait time. This is only possible if TIMED_STATISTICS is set to true. Otherwise, the wait events can only be ranked by the number of times waited, which is often not the ordering that best represents the problem.

To get an indication of where time is spent, follow these steps:

1. Examine the data collection for V$SYSTEM_EVENT. The events of interest should be ranked by wait time.
Identify the wait events that have the most significant percentage of wait time. ...
Alternatively, look at the Top 5 Timed Events section at the beginning of the Automatic Workload Repository report. This section automatically orders the wait events (omitting idle events), and calculates the relative percentage:

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                                % Total
Event                                         Waits    Time (s) Call Time
-------------------------------------- ------------ ----------- ---------
CPU time                                                    559     88.80
log file parallel write                       2,181          28      4.42
SQL*Net more data from client               516,611          27      4.24
db file parallel write                       13,383          13      2.04
db file sequential read                         563           2       .27

2. Look at the number of waits for these events, and the average wait time. For example, for I/O related events, the average time might help identify whether the I/O system is slow. The following example of this data is taken from the Wait Event section of the AWR report:

                                                             Avg
                                                Total Wait   wait     Waits
Event                           Waits  Timeouts   Time (s)   (ms)      /txn
--------------------------- --------- --------- ---------- ------ ---------
log file parallel write         2,181         0         28     13      41.2
SQL*Net more data from clie   516,611         0         27      0   9,747.4
db file parallel write         13,383         0         13      1     252.5

3. The top wait events identify the next places to investigate. A table of common wait events is listed in Table 10-1. It is usually a good idea to also have quick look at high-load SQL.

4. Examine the related data indicated by the wait events to see what other information this data provides. Determine whether this information is consistent with the wait event data. In most situations, there is enough data to begin developing a theory about the potential causes of the performance bottleneck.

5. To determine whether this theory is valid, cross-check data you have examined with other statistics available for consistency. The appropriate statistics vary depending on the problem, but usually include load profile-related data in V$SYSSTAT, operating system statistics, and so on. Perform cross-checks with other data to confirm or refute the developing theory.
--------------------------------------------------



So, why don't we have the proposed wait event count/time data?  I hope we can nurture this to become a database profiling tool like MySQL and Oracle.  This is the first step.  I think it would be useful to have both sampling and precise statistics.  Oracle has both, and MySQL has the latter (I don't know why MySQL doesn't provide sampling, because the Performance Schema should probably have been developed after Oracle's ASH.)

What would make us conservative about doing this?  Skimming the old thread, the remaining concern is the timer overhead.  As the following article suggests, some lightweight timers seem to be available.  We can turn the timing off by default if they aren't light enough.


Performance Schema Timers
https://dev.mysql.com/doc/refman/8.0/en/performance-schema-timing.html


Regards
Takayuki Tsunakawa



Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Add accumulated statistics

Adrien Nayrat-2
On 1/7/19 6:34 AM, Tsunakawa, Takayuki wrote:
> 1. Doesn't provide precise data
> Sampling could miss intermittent short waits, e.g., buffer content lock waits during checkpoints.  This might make it difficult or impossible to solve transient performance problems, such as infrequent 100 millisecond response times while the normal response time is a few milliseconds.
> The proposed wait event collection doesn't miss anything.
>
> 2. Overuses resources
> We may be able to shorten the sampling interval to 10 ms or even 1 ms to detect short periods of problems.  However, the sampled data of active sessions become voluminous in memory and storage.  It would take longer to analyze those samples.  Also, the background sampling process prevents the CPU core from becoming idle to save power, which bgwriter and walwriter tries to avoid by hibernation.
> The proposed wait event collection just records what actually happened.  No waste.  Would it use many resources if waits happen frequently?  That leads to our motivation to reduce waits.

FIY, wait events have been added in PoWA by using pg_wait_sampling  
extension :  
https://rjuju.github.io/postgresql/2018/07/09/wait-events-support-for-powa.html

pg_wait_sampling sample the wait events in shared memory and PoWA store  
them.

Reply | Threaded
Open this post in threaded view
|

RE: [Proposal] Add accumulated statistics

Tsunakawa, Takayuki
From: Adrien NAYRAT [mailto:[hidden email]]
> FIY, wait events have been added in PoWA by using pg_wait_sampling
> extension :
> https://rjuju.github.io/postgresql/2018/07/09/wait-events-support-for-
> powa.html
>
> pg_wait_sampling sample the wait events in shared memory and PoWA store
> them.

Great.  Also FYI, Amazon RDS/Aurora Performance Insights already provides a load profiling feature based on the wait events collected from pg_stat_activity, which samples once a second.  An intuitive 5 minute video for introduction is here:

Using Amazon RDS Performance Insights
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.html

Although I cannot see the graphics as I'm almost blind, the explanation  sounds like it mimics Oracle Enterprise Manager.  Cool.

Aren't you thinking of incorporating your work into PostgreSQL as a contrib like pg_stat_statements?


Regards
Takayuki Tsunakawa




Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Add accumulated statistics

Robert Haas
In reply to this post by Yotsunaga, Naoki
On Thu, Dec 20, 2018 at 8:48 PM Yotsunaga, Naoki
<[hidden email]> wrote:
> If so, is not that the number of wait events is useful information?

My theory is that the number of wait events is NOT useful information,
or at least not nearly as useful the results of a sampling approach.
The data that LWLOCK_STATS produce are downright misleading -- they
lead you to think that the bottlenecks are in different places than
they really are, because the locks that produce the most waiting can
be 5th or 10th in terms of the number of wait events.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

RE: [Proposal] Add accumulated statistics

Yotsunaga, Naoki
On Thu, Jan 10, 2019 at 8:42 PM, Robert Hass wrote:

Thanks for comments.

>or at least not nearly as useful the results of a sampling approach.
I agree with your opinion.
Because it can't be asserted that the wait event is a bottleneck just because the number of wait event is large.
The same thing is mentioned in Oracle.
It also suggests that it is important to acquire waiting time for that.
----
https://docs.oracle.com/en/database/oracle/oracle-database/18/tgdba/measuring-database-performance.html#GUID-811E9E65-C64A-4028-A90E-102BBFF6E68F
5.2.3 Using Wait Events without Timed Statistics
----

>The data that LWLOCK_STATS produce are downright misleading
Is that so?
I do not know the need for this function.

---
Naoki Yotsunaga
Reply | Threaded
Open this post in threaded view
|

RE: [Proposal] Add accumulated statistics

Tsunakawa, Takayuki
In reply to this post by Robert Haas
From: Robert Haas [mailto:[hidden email]]
> My theory is that the number of wait events is NOT useful information,
> or at least not nearly as useful the results of a sampling approach.
> The data that LWLOCK_STATS produce are downright misleading -- they
> lead you to think that the bottlenecks are in different places than
> they really are, because the locks that produce the most waiting can
> be 5th or 10th in terms of the number of wait events.

I understood you're saying that the number of waits alone does not necessarily indicate the bottleneck, because a wait with fewer counts but longer time can take a large portion of the entire SQL execution time.  So, wait time is also useful.  I think that's why Oracle describes and MySQL provides precise count and time without sampling.

Haven't LOCK_STATS been helpful for PG developers?  IIRC, it was used to pinpoint the bottleneck and evaluate the patch to improve shared buffers, WAL buffers, ProcArray, etc.


Regards
Takayuki Tsunakawa




Reply | Threaded
Open this post in threaded view
|

Re: [Proposal] Add accumulated statistics

Pavel Stehule


pá 11. 1. 2019 v 2:10 odesílatel Tsunakawa, Takayuki <[hidden email]> napsal:
From: Robert Haas [mailto:[hidden email]]
> My theory is that the number of wait events is NOT useful information,
> or at least not nearly as useful the results of a sampling approach.
> The data that LWLOCK_STATS produce are downright misleading -- they
> lead you to think that the bottlenecks are in different places than
> they really are, because the locks that produce the most waiting can
> be 5th or 10th in terms of the number of wait events.

I understood you're saying that the number of waits alone does not necessarily indicate the bottleneck, because a wait with fewer counts but longer time can take a large portion of the entire SQL execution time.  So, wait time is also useful.  I think that's why Oracle describes and MySQL provides precise count and time without sampling.

the cumulated lock statistics maybe doesn't help with debugging - but it is very good indicator of database (in production usage) health.

Regards

Pavel



Reply | Threaded
Open this post in threaded view
|

RE: [Proposal] Add accumulated statistics

Tsunakawa, Takayuki
From: Pavel Stehule [mailto:[hidden email]]
> the cumulated lock statistics maybe doesn't help with debugging - but it
> is very good indicator of database (in production usage) health.

I think it will help both.  But I don't think the sampling won't be as helpful as the precise lock statistics accumulation, because the sampling doesn't give us exactly how effective our improvements to PostgreSQL code are.  I remember PG developers used LOCK_STATS to see how many (or ratio of) lwlock waits decreased by applying patches.


We can use the cumulated lock stats like:

1. SELECT * FROM pg_session_waits;
2. Run a benchmark.
3. SELECT * FROM pg_session_waits;
4. Calculate the difference between 1 and 3.

Or, reset the wait stats before the benchmark run and just use the stats as-is.

I'd like to know why you thought the cumulated wait stats isn't helpful for debugging.


Regards
Takayuki Tsunakawa


12