shared-memory based stats collector

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
127 messages Options
1 ... 34567
Reply | Threaded
Open this post in threaded view
|

Re: shared-memory based stats collector

Andres Freund
Hi,

Thomas, could you look at the first two patches here, and my review
questions?


General comments about this series:
- A lot of the review comments feel like I've written them before, a
  year or more ago. I feel this patch ought to be in a much better
  state. There's a lot of IMO fairly obvious stuff here, and things that
  have been mentioned multiple times previously.
- There's a *lot* of typos in here. I realize being an ESL is hard, but
  a lot of these can be found with the simplest spellchecker.  That's
  one thing for a patch that just has been hacked up as a POC, but this
  is a multi year thread?
- There's some odd formatting. Consider using pgindent more regularly.

More detailed comments below.

I'm considering rewriting the parts of the patchset that I don't like -
but it'll look quite different afterwards.


On 2020-01-22 17:24:04 +0900, Kyotaro Horiguchi wrote:
> From 5f7946522dc189429008e830af33ff2db435dd42 Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <[hidden email]>
> Date: Fri, 29 Jun 2018 16:41:04 +0900
> Subject: [PATCH 1/5] sequential scan for dshash
>
> Add sequential scan feature to dshash.


>   dsa_pointer item_pointer = hash_table->buckets[i];
> @@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
>   LW_EXCLUSIVE));
>
>   delete_item(hash_table, item);
> - hash_table->find_locked = false;
> - hash_table->find_exclusively_locked = false;
> - LWLockRelease(PARTITION_LOCK(hash_table, partition));
> +
> + /* We need to keep partition lock while sequential scan */
> + if (!hash_table->seqscan_running)
> + {
> + hash_table->find_locked = false;
> + hash_table->find_exclusively_locked = false;
> + LWLockRelease(PARTITION_LOCK(hash_table, partition));
> + }
>  }

This seems like a failure prone API.

>  /*
> @@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
>   Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
>   hash_table->find_exclusively_locked
>   ? LW_EXCLUSIVE : LW_SHARED));
> + /* lock is under control of sequential scan */
> + Assert(!hash_table->seqscan_running);
>
>   hash_table->find_locked = false;
>   hash_table->find_exclusively_locked = false;
> @@ -592,6 +610,164 @@ dshash_memhash(const void *v, size_t size, void *arg)
>   return tag_hash(v, size);
>  }
>
> +/*
> + * dshash_seq_init/_next/_term
> + *           Sequentially scan trhough dshash table and return all the
> + *           elements one by one, return NULL when no more.

s/trhough/through/

This uses a different comment style that the other functions in this
file. Why?


> + * dshash_seq_term should be called for incomplete scans and otherwise
> + * shoudln't. Finished scans are cleaned up automatically.

s/shoudln't/shouldn't/

I find the "cleaned up automatically" API terrible. I know you copied it
from dynahash, but I find it to be really failure prone. dynahash isn't
an example of good postgres code, the opposite, I'd say. It's a lot
easier to unconditionally have a terminate call if we need that.


> + * Returned elements are locked as is the case with dshash_find.  However, the
> + * caller must not release the lock.
> + *
> + * Same as dynanash, the caller may delete returned elements midst of a scan.

I think it's a bad idea to refer to dynahash here. That's just going to
get out of date. Also, code should be documented on its own.


> + * If consistent is set for dshash_seq_init, the all hash table partitions are
> + * locked in the requested mode (as determined by the exclusive flag) during
> + * the scan.  Otherwise partitions are locked in one-at-a-time way during the
> + * scan.

Yet delete unconditionally retains locks?


> + */
> +void
> +dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
> + bool consistent, bool exclusive)
> +{

Why does this patch add the consistent mode? There's no users currently?
Without it's not clear that we need a seperate _term function, I think?

I think we also can get rid of the dshash_delete changes, by instead
adding a dshash_delete_current(dshash_seq_stat *status, void *entry) API
or such.


> @@ -70,7 +86,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
>  extern void dshash_detach(dshash_table *hash_table);
>  extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
>  extern void dshash_destroy(dshash_table *hash_table);
> -
>  /* Finding, creating, deleting entries. */
>  extern void *dshash_find(dshash_table *hash_table,
>   const void *key, bool
>  exclusive);

There's a number of spurious changes like this.



> From 60da67814fe40fd2a0c1870b15dcf6fcb21c989a Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <[hidden email]>
> Date: Thu, 27 Sep 2018 11:15:19 +0900
> Subject: [PATCH 2/5] Add conditional lock feature to dshash
>
> Dshash currently waits for lock unconditionally. This commit adds new
> interfaces for dshash_find and dshash_find_or_insert. The new
> interfaces have an extra parameter "nowait" taht commands not to wait
> for lock.

s/taht/that/

There should be at least a sentence or two explaining why these are
useful.


> +/*
> + * The version of dshash_find, which is allowed to return immediately on lock
> + * failure. Lock status is set to *lock_failed in that case.
> + */

Hm. Not sure I like the *lock_acquired API.

> +void *
> +dshash_find_extended(dshash_table *hash_table, const void *key,
> + bool exclusive, bool nowait, bool *lock_acquired)
>  {
>   dshash_hash hash;
>   size_t partition;
>   dshash_table_item *item;
>
> + /*
> + * No need to return lock resut when !nowait. Otherwise the caller may
> + * omit the lock result when NULL is returned.
> + */
> + Assert(nowait || !lock_acquired);
> +
>   hash = hash_key(hash_table, key);
>   partition = PARTITION_FOR_HASH(hash);
>
>   Assert(hash_table->control->magic == DSHASH_MAGIC);
>   Assert(!hash_table->find_locked);
>
> - LWLockAcquire(PARTITION_LOCK(hash_table, partition),
> -  exclusive ? LW_EXCLUSIVE : LW_SHARED);
> + if (nowait)
> + {
> + if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
> +  exclusive ? LW_EXCLUSIVE : LW_SHARED))
> + {
> + if (lock_acquired)
> + *lock_acquired = false;

Why is the test for lock_acquired needed here? I don't think it's
possible to use nowait correctly without passing in lock_acquired?

Think it'd make sense to document & assert that nowait = true implies
lock_acquired set, and nowait = false implies lock_acquired not being
set.

But, uh, why do we even need the lock_acquired parameter? If we couldn't
find an entry, then we should just release the lock, no?


I'm however inclined to think it's better to just have a separate
function for the nowait case, rather than an extended version supporting
both (with an internal helper doing most of the work).


> +/*
> + * The version of dshash_find_or_insert, which is allowed to return immediately
> + * on lock failure.
> + *
> + * Notes above dshash_find_extended() regarding locking and error handling
> + * equally apply here.

They don't, there's no lock_acquired parameter.

> + */
> +void *
> +dshash_find_or_insert_extended(dshash_table *hash_table,
> +   const void *key,
> +   bool *found,
> +   bool nowait)

I think it's absurd to have dshash_find, dshash_find_extended,
dshash_find_or_insert, dshash_find_or_insert_extended. If they're
extended they should also be able to specify whether the entry will get
created.


> From d10c1117cec77a474dbb2cff001086d828b79624 Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <[hidden email]>
> Date: Wed, 7 Nov 2018 16:53:49 +0900
> Subject: [PATCH 3/5] Make archiver process an auxiliary process
>
> This is a preliminary patch for shared-memory based stats collector.
> Archiver process must be a auxiliary process since it uses shared
> memory after stats data wes moved onto shared-memory. Make the process

s/wes/was/ s/onto/into/

> an auxiliary process in order to make it work.

>

> @@ -451,6 +454,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
>   StartupProcessMain();
>   proc_exit(1); /* should never return */
>
> + case ArchiverProcess:
> + /* don't set signals, archiver has its own agenda */
> + PgArchiverMain();
> + proc_exit(1); /* should never return */
> +
>   case BgWriterProcess:
>   /* don't set signals, bgwriter has its own agenda */
>   BackgroundWriterMain();

I think I'd rather remove the two comments that are copied to 6 out of 8
cases - they don't add anything.


>  /* ------------------------------------------------------------
>   * Local functions called by archiver follow
>   * ------------------------------------------------------------
> @@ -219,8 +148,8 @@ pgarch_forkexec(void)
>   * The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
>   * since we don't use 'em, it hardly matters...
>   */
> -NON_EXEC_STATIC void
> -PgArchiverMain(int argc, char *argv[])
> +void
> +PgArchiverMain(void)
>  {
>   /*
>   * Ignore all signals usually bound to some action in the postmaster,
> @@ -252,8 +181,27 @@ PgArchiverMain(int argc, char *argv[])
>  static void
>  pgarch_exit(SIGNAL_ARGS)
>  {
> - /* SIGQUIT means curl up and die ... */
> - exit(1);
> + PG_SETMASK(&BlockSig);
> +
> + /*
> + * We DO NOT want to run proc_exit() callbacks -- we're here because
> + * shared memory may be corrupted, so we don't want to try to clean up our
> + * transaction.  Just nail the windows shut and get out of town.  Now that
> + * there's an atexit callback to prevent third-party code from breaking
> + * things by calling exit() directly, we have to reset the callbacks
> + * explicitly to make this work as intended.
> + */
> + on_exit_reset();
> +
> + /*
> + * Note we do exit(2) not exit(0).  This is to force the postmaster into a
> + * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
> + * process.  This is necessary precisely because we don't clean up our
> + * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
> + * should ensure the postmaster sees this as a crash, too, but no harm in
> + * being doubly sure.)
> + */
> + exit(2);
>  }
>

This seems to be a copy of code & comments from other signal handlers that predates

commit 8e19a82640d3fa2350db146ec72916856dd02f0a
Author: Heikki Linnakangas <[hidden email]>
Date:   2018-08-08 19:08:10 +0300

    Don't run atexit callbacks in quickdie signal handlers.


I think this just should use SignalHandlerForCrashExit().


I think we can even commit that separately - there's not really a reason
to not do that today, as far as I can tell?


>  /* SIGUSR1 signal handler for archiver process */

Hm - this currently doesn't set up a correct sigusr1 handler for a
shared memory backend - needs to invoke procsignal_sigusr1_handler
somewhere.

We can probably just convert to using normal latches here, and remove
the current 'wakened' logic? That'll remove the indirection via
postmaster too, which is nice.

> @@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType)
>
>   switch (backendType)
>   {
> + case B_ARCHIVER:
> + backendDesc = "archiver";
> + break;

should imo include 'WAL' or such.



> From 5079583c447c3172aa0b4f8c0f0a46f6e1512812 Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <[hidden email]>
> Date: Thu, 21 Feb 2019 12:44:56 +0900
> Subject: [PATCH 4/5] Shared-memory based stats collector
>
> Previously activity statistics is shared via files on disk. Every
> backend sends the numbers to the stats collector process via a socket.
> It makes snapshots as a set of files on disk with a certain interval
> then every backend reads them as necessary. It worked fine for
> comparatively small set of statistics but the set is under the
> pressure to growing up and the file size has reached the order of
> megabytes. To deal with larger statistics set, this patch let backends
> directly share the statistics via shared memory.

This spends a fair bit describing the old state, but very little
describing the new state.


> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> index 0bfd6151c4..a6b0bdec12 100644
> --- a/doc/src/sgml/monitoring.sgml
> +++ b/doc/src/sgml/monitoring.sgml
> @@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
>  postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
>  postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
>  postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
> -postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
>  postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
>  postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
>  postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in transaction
> @@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
>     master server process.  The command arguments
>     shown for it are the same ones used when it was launched.  The next five
>     processes are background worker processes automatically launched by the
> -   master process.  (The <quote>stats collector</quote> process will not be present
> -   if you have set the system not to start the statistics collector; likewise
> -   the <quote>autovacuum launcher</quote> process can be disabled.)
> +   master process.  (The <quote>autovacuum launcher</quote> process will not
> +   be present if you have set the system not to start it.)
>     Each of the remaining
>     processes is a server process handling one client connection.  Each such
>     process sets its command line display in the form

There's more references to the stats collector than this... E.g. in
catalogs.sgml

   <xref linkend="view-table"/> lists the system views described here.
   More detailed documentation of each view follows below.
   There are some additional views that provide access to the results of
   the statistics collector; they are described in <xref
   linkend="monitoring-stats-views-table"/>.
  </para>


> diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
> index 6d1f28c327..8dcb0fb7f7 100644
> --- a/src/backend/postmaster/autovacuum.c
> +++ b/src/backend/postmaster/autovacuum.c
> @@ -1956,15 +1956,15 @@ do_autovacuum(void)
>    ALLOCSET_DEFAULT_SIZES);
>   MemoryContextSwitchTo(AutovacMemCxt);
>
> + /* Start a transaction so our commands have one to play into. */
> + StartTransactionCommand();
> +
>   /*
>   * may be NULL if we couldn't find an entry (only happens if we are
>   * forcing a vacuum for anti-wrap purposes).
>   */
>   dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
>
> - /* Start a transaction so our commands have one to play into. */
> - StartTransactionCommand();
> -
>   /*
>   * Clean up any dead statistics collector entries for this DB. We always
>   * want to do this exactly once per DB-processing cycle, even if we find
> @@ -2747,12 +2747,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
>   if (isshared)
>   {
>   if (PointerIsValid(shared))
> - tabentry = hash_search(shared->tables, &relid,
> -   HASH_FIND, NULL);
> + tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
>   }
>   else if (PointerIsValid(dbentry))
> - tabentry = hash_search(dbentry->tables, &relid,
> -   HASH_FIND, NULL);
> + tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
>
>   return tabentry;
>  }

Why is pgstat_fetch_stat_tabentry_extended called "_extended"? Outside
the stats subsystem there are exactly one caller for the non extended
version, as far as I can see. That's index_concurrently_swap() - and imo
that's code that should live in the stats subsystem, rather than open
coded in index.c.



> diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
> index ca5c6376e5..1ffe073a1f 100644
> --- a/src/backend/postmaster/pgstat.c
> +++ b/src/backend/postmaster/pgstat.c
> @@ -1,15 +1,23 @@
>  /* ----------
>   * pgstat.c
>   *
> - * All the statistics collector stuff hacked up in one big, ugly file.
> + * Statistics collector facility.
>   *
> - * TODO: - Separate collector, postmaster and backend stuff
> - *  into different files.
> + *  Collects per-table and per-function usage statistics of all backends on
> + *  shared memory. pg_count_*() and friends are the interface to locally store
> + *  backend activities during a transaction. Then pgstat_flush_stat() is called
> + *  at the end of a transaction to pulish the local stats on shared memory.
>   *

I'd rather not exhaustively list the different objects this handles -
it'll either be annoying to maintain, or just get out of date.


> - * - Add some automatic call for pgstat vacuuming.
> + *  To avoid congestion on the shared memory, we update shared stats no more
> + *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). In the case where
> + *  all the local numbers cannot be flushed immediately, we postpone updates
> + *  and try the next chance after the interval of
> + *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but we don't wait for no longer than
> + *  PGSTAT_STAT_MAX_INTERVAL(1000ms).

I'm not convinced by this backoff logic. The basic interval seems quite
high for something going through shared memory, and the max retry seems
pretty low.


> +/*
> + * Operation mode and return code of pgstat_get_db_entry.
> + */
> +#define PGSTAT_SHARED 0

This is unreferenced.


> +#define PGSTAT_EXCLUSIVE 1
> +#define PGSTAT_NOWAIT 2

And these should imo rather be parameters.


> +typedef enum PgStat_TableLookupResult
> +{
> + NOT_FOUND,
> + FOUND,
> + LOCK_FAILED
> +} PgStat_TableLookupResult;

This seems like a seriously bad idea to me. These are very generic
names. There's also basically no references except setting them to the
first two?

> +#define StatsLock (&StatsShmem->StatsMainLock)
>
> -static time_t last_pgstat_start_time;
> +/* Shared stats bootstrap information */
> +typedef struct StatsShmemStruct
> +{
> + LWLock StatsMainLock; /* lock to protect this struct */
> + dsa_handle stats_dsa_handle; /* DSA handle for stats data */
> + dshash_table_handle db_hash_handle;
> + dsa_pointer global_stats;
> + dsa_pointer archiver_stats;
> + int refcount;
> +} StatsShmemStruct;

Why isn't this an lwlock in lwlock in lwlocknames.h, rather than
allocated here?


> +/*
> + * BgWriter global statistics counters. The name cntains a remnant from the
> + * time when the stats collector was a dedicate process, which used sockets to
> + * send it.
> + */
> +PgStat_MsgBgWriter BgWriterStats = {0};

I am strongly against keeping the 'Msg' prefix. That seems extremely
confusing going forward.


> +/* common header of snapshot entry in reader snapshot hash */
> +typedef struct PgStat_snapshot
> +{
> + Oid key;
> + bool negative;
> + void   *body; /* end of header part: to keep alignment */
> +} PgStat_snapshot;


> +/* context struct for snapshot_statentry */
> +typedef struct pgstat_snapshot_param
> +{
> + char   *hash_name; /* name of the snapshot hash */
> + int hash_entsize; /* element size of hash entry */
> + dshash_table_handle dsh_handle; /* dsh handle to attach */
> + const dshash_parameters *dsh_params;/* dshash params */
> + HTAB  **hash; /* points to variable to hold hash */
> + dshash_table  **dshash; /* ditto for dshash */
> +} pgstat_snapshot_param;

Why does this exist? The struct contents are actually constant across
calls, yet you have declared them inside functions (as static - static
on function scope isn't really the same as global static).

If we want it, I think we should separate the naming more
meaningfully. The important difference between 'hash' and 'dshash' isn't
the hashing module, it's that one is a local copy, the other a shared
hashtable!


> +/*
> + * Backends store various database-wide info that's waiting to be flushed out
> + * to shared memory in these variables.
> + *
> + * checksum_failures is the exception in that it is cluster-wide value.
> + */
> +typedef struct BackendDBStats
> +{
> + int n_conflict_tablespace;
> + int n_conflict_lock;
> + int n_conflict_snapshot;
> + int n_conflict_bufferpin;
> + int n_conflict_startup_deadlock;
> + int n_deadlocks;
> + size_t n_tmpfiles;
> + size_t tmpfilesize;
> + HTAB *checksum_failures;
> +} BackendDBStats;

Why is this a separate struct from PgStat_StatDBEntry? We should have
these fields in multiple places.


> + if (StatsShmem->refcount > 0)
> + StatsShmem->refcount++;

What prevents us from leaking the refcount here? We could e.g. error out
while attaching, no? Which'd mean we'd leak the refcount.


To me it looks like there's a lot of added complexity just because you
want to be able to reset stats via

void
pgstat_reset_all(void)a
{

        /*
         * We could directly remove files and recreate the shared memory area. But
         * detach then attach for simplicity.
         */
        pgstat_detach_shared_stats(false); /* Don't write */
        pgstat_attach_shared_stats();

Without that you'd not need the complexity of attaching, detaching to
the same degree - every backend could just cache lookup data during
initialization, instead of having to constantly re-compute that.

Nor would the dynamic re-creation of the db dshash table be needed.


> +/* ----------
> + * pgstat_report_stat() -
> + *
> + * Must be called by processes that performs DML: tcop/postgres.c, logical
> + * receiver processes, SPI worker, etc. to apply the so far collected
> + * per-table and function usage statistics to the shared statistics hashes.
> + *
> + *  Updates are applied not more frequent than the interval of
> + *  PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock
> + *  failure if force is false and there's no pending updates longer than
> + *  PGSTAT_STAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
> + *  succeeding calls of this function.
> + *
> + * Returns the time until the next timing when updates are applied in
> + * milliseconds if there are no updates holded for more than
> + * PGSTAT_STAT_MIN_INTERVAL milliseconds.
> + *
> + * Note that this is called only out of a transaction, so it is fine to use
> + * transaction stop time as an approximation of current time.
> + * ----------
> + */

Inconsistent indentation.

> +long
> +pgstat_report_stat(bool force)
>  {

> + /* Flush out table stats */
> + if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
> + pending_stats = true;
> +
> + /* Flush out function stats */
> + if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
> + pending_stats = true;

This seems weird. pgstat_flush_stat(), pgstat_flush_funcstats() operate
on pgStatTabList/pgStatFunctions, but don't actually reset it? Besides
being confusing while reading the code, it also made the diff much
harder to read.


> - snprintf(fname, sizeof(fname), "%s/%s", directory,
> - entry->d_name);
> - unlink(fname);
> + /* Flush out database-wide stats */
> + if (HAVE_PENDING_DBSTATS())
> + {
> + if (!pgstat_flush_dbstats(&cxt, !force))
> + pending_stats = true;
>   }

Linearly checking a number of stats doesn't seem like the right way
going forward. Also seems fairly omission prone.

Why does this code check live in pgstat_report_stat(), rather than
pgstat_flush_dbstats()?


> /*
>  * snapshot_statentry() - Common routine for functions
>  * pgstat_fetch_stat_*entry()
>  *

Why has this function been added between the closely linked
pgstat_report_stat() and pgstat_flush_stat() etc?


>  *  Returns the pointer to a snapshot of a shared entry for the key or NULL if
>  *  not found. Returned snapshots are stable during the current transaction or
>  *  until pgstat_clear_snapshot() is called.
>  *
>  *  The snapshots are stored in a hash, pointer to which is stored in the
>  *  *HTAB variable pointed by cxt->hash. If not created yet, it is created
>  *  using hash_name, hash_entsize in cxt.
>  *
>  *  cxt->dshash points to dshash_table for dbstat entries. If not yet
>  *  attached, it is attached using cxt->dsh_handle.

Why do we still have this? A hashtable lookup is cheap, compared to
fetching a file - so it's not to save time. Given how infrequent the
pgstat_fetch_* calls are, it's not to avoid contention either.

At first one could think it's for consistency - but no, that's not it
either, because snapshot_statentry() refetches the snapshot without
control from the outside:


>   /*
>    * We don't want so frequent update of stats snapshot. Keep it at least
>    * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
>    */
>   if (clear_snapshot)
>   {
>       clear_snapshot = false;
>
>       if (pgStatSnapshotContext &&
>           snapshot_globalStats.stats_timestamp <
>           GetCurrentStatementStartTimestamp() -
>           PGSTAT_STAT_MIN_INTERVAL * 1000)
>       {
>           MemoryContextReset(pgStatSnapshotContext);
>
>           /* Reset variables */
>           global_snapshot_is_valid = false;
>           pgStatSnapshotContext = NULL;
>           pgStatLocalHash = NULL;
>
>           pgstat_setup_memcxt();
>       }
>   }

I think we should just remove this entire local caching snapshot layer
for lookups.


> /*
>  * pgstat_flush_stat: Flushes table stats out to shared statistics.
>  *

Why is this named pgstat_flush_stat, rather than pgstat_flush_tabstats
or such? Given that the code for dealing with an individual table's
entry is named pgstat_flush_tabstat() that's very confusing.



>  *  If nowait is true, returns false if required lock was not acquired
>  *  immediately. In that case, unapplied table stats updates are left alone in
>  *  TabStatusArray to wait for the next chance. cxt holds some dshash related
>  *  values that we want to carry around while updating shared stats.
>  *
>  *  Returns true if all stats info are flushed. Caller must detach dshashes
>  *  stored in cxt after use.
>  */
> static bool
> pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
> {
>   static const PgStat_TableCounts all_zeroes;
>   TabStatusArray *tsa;
>   HTAB           *new_tsa_hash = NULL;
>   TabStatusArray *dest_tsa = pgStatTabList;
>   int             dest_elem = 0;
>   int             i;
>
>   /* nothing to do, just return */
>   if (pgStatTabHash == NULL)
>       return true;
>
>   /*
>    * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
>    * entries it points to.
>    */
>   hash_destroy(pgStatTabHash);
>   pgStatTabHash = NULL;
>
>   /*
>    * Scan through the TabStatusArray struct(s) to find tables that actually
>    * have counts, and try flushing it out to shared stats. We may fail on
>    * some entries in the array. Leaving the entries being packed at the
>    * beginning of the array.
>    */
>   for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
>   {

It seems odd that there's a tabstat specific code in pgstat_flush_stat
(also note singular while it's processing all stats, whereas you're
below treating pgstat_flush_tabstat as only affecting one table).


>       for (i = 0; i < tsa->tsa_used; i++)
>       {
>           PgStat_TableStatus *entry = &tsa->tsa_entries[i];
>
>           /* Shouldn't have any pending transaction-dependent counts */
>           Assert(entry->trans == NULL);
>
>           /*
>            * Ignore entries that didn't accumulate any actual counts, such
>            * as indexes that were opened by the planner but not used.
>            */
>           if (memcmp(&entry->t_counts, &all_zeroes,
>                      sizeof(PgStat_TableCounts)) == 0)
>               continue;
>
>           /* try to apply the tab stats */
>           if (!pgstat_flush_tabstat(cxt, nowait, entry))
>           {
>               /*
>                * Failed. Move it to the beginning in TabStatusArray and
>                * leave it.
>                */
>               TabStatHashEntry *hash_entry;
>               bool found;
>
>               if (new_tsa_hash == NULL)
>                   new_tsa_hash = create_tabstat_hash();
>
>               /* Create hash entry for this entry */
>               hash_entry = hash_search(new_tsa_hash, &entry->t_id,
>                                        HASH_ENTER, &found);
>               Assert(!found);
>
>               /*
>                * Move insertion pointer to the next segment if the segment
>                * is filled up.
>                */
>               if (dest_elem >= TABSTAT_QUANTUM)
>               {
>                   Assert(dest_tsa->tsa_next != NULL);
>                   dest_tsa = dest_tsa->tsa_next;
>                   dest_elem = 0;
>               }
>
>               /*
>                * Pack the entry at the begining of the array. Do nothing if
>                * no need to be moved.
>                */
>               if (tsa != dest_tsa || i != dest_elem)
>               {
>                   PgStat_TableStatus *new_entry;
>                   new_entry = &dest_tsa->tsa_entries[dest_elem];
>                   *new_entry = *entry;
>
>                   /* use new_entry as entry hereafter */
>                   entry = new_entry;
>               }
>
>               hash_entry->tsa_entry = entry;
>               dest_elem++;
>           }

This seems like too much code. Why is this entirely different from the
way funcstats works? The difference was already too big before, but this
made it *way* worse.

One goal of this project, as I understand it, is to make it easier to
add additional stats. As is, this seems to make it harder from the code
level.


> bool
> pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
>                    PgStat_TableStatus *entry)
> {
>   Oid     dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
>   int     table_mode = PGSTAT_EXCLUSIVE;
>   bool    updated = false;
>   dshash_table *tabhash;
>   PgStat_StatDBEntry *dbent;
>   int     generation;
>
>   if (nowait)
>       table_mode |= PGSTAT_NOWAIT;
>
>   /* Attach required table hash if not yet. */
>   if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
>   {
>       /*
>        *  Return if we don't have corresponding dbentry. It would've been
>        *  removed.
>        */
>       dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
>       if (!dbent)
>           return false;
>
>       /*
>        * We don't hold lock on the dbentry since it cannot be dropped while
>        * we are working on it.
>        */
>       generation = pin_hashes(dbent);
>       tabhash = attach_table_hash(dbent, generation);

This again is just cost incurred by insisting on destroying hashtables
instead of keeping them around as long as necessary.


>       if (entry->t_shared)
>       {
>           cxt->shgeneration = generation;
>           cxt->shdbentry = dbent;
>           cxt->shdb_tabhash = tabhash;
>       }
>       else
>       {
>           cxt->mygeneration = generation;
>           cxt->mydbentry = dbent;
>           cxt->mydb_tabhash = tabhash;
>
>           /*
>            * We come here once per database. Take the chance to update
>            * database-wide stats
>            */
>           LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
>           dbent->n_xact_commit += pgStatXactCommit;
>           dbent->n_xact_rollback += pgStatXactRollback;
>           dbent->n_block_read_time += pgStatBlockReadTime;
>           dbent->n_block_write_time += pgStatBlockWriteTime;
>           LWLockRelease(&dbent->lock);
>           pgStatXactCommit = 0;
>           pgStatXactRollback = 0;
>           pgStatBlockReadTime = 0;
>           pgStatBlockWriteTime = 0;
>       }
>   }
>   else if (entry->t_shared)
>   {
>       dbent = cxt->shdbentry;
>       tabhash = cxt->shdb_tabhash;
>   }
>   else
>   {
>       dbent = cxt->mydbentry;
>       tabhash = cxt->mydb_tabhash;
>   }
>
>
>   /*
>    * Local table stats should be applied to both dbentry and tabentry at
>    * once. Update dbentry only if we could update tabentry.
>    */
>   if (pgstat_update_tabentry(tabhash, entry, nowait))
>   {
>       pgstat_update_dbentry(dbent, entry);
>       updated = true;
>   }

At this point we're very deeply nested. pgstat_report_stat() ->
pgstat_flush_stat() -> pgstat_flush_tabstat() ->
pgstat_update_tabentry().

That's way over the top imo.


I don't think it makes much sense that pgstat_update_dbentry() is called
separately for each table. Why would we want to constantly lock that
entry? It seems to be much more sensible to instead have
pgstat_flush_stat() transfer the stats it reported to the pending
database wide counters, and then report that to shared memory *once* per
pgstat_report_stat() with pgstat_flush_dbstats()?


> /*
>  * pgstat_flush_dbstats: Flushes out miscellaneous database stats.
>  *
>  *  If nowait is true, returns with false on lock failure on dbentry.
>  *
>  *  Returns true if all stats are flushed out.
>  */
> static bool
> pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait)
> {
>   /* get dbentry if not yet */
>   if (cxt->mydbentry == NULL)
>   {
>       int op = PGSTAT_EXCLUSIVE;
>       if (nowait)
>           op |= PGSTAT_NOWAIT;
>
>       cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
>
>       /* return if lock failed. */
>       if (cxt->mydbentry == NULL)
>           return false;
>
>       /* we use this generation of table /function stats in this turn */
>       cxt->mygeneration = pin_hashes(cxt->mydbentry);
>   }
>
>   LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
>   if (HAVE_PENDING_CONFLICTS())
>       pgstat_flush_recovery_conflict(cxt->mydbentry);
>   if (BeDBStats.n_deadlocks != 0)
>       pgstat_flush_deadlock(cxt->mydbentry);
>   if (BeDBStats.n_tmpfiles != 0)
>       pgstat_flush_tempfile(cxt->mydbentry);
>   if (BeDBStats.checksum_failures != NULL)
>       pgstat_flush_checksum_failure(cxt->mydbentry);
>   LWLockRelease(&cxt->mydbentry->lock);

What's the point of having all these sub-functions? I see that you, for
an undocumented reason, have pgstat_report_recovery_conflict() flush
conflict stats immediately:

>   dbentry = pgstat_get_db_entry(MyDatabaseId,
>                                 PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
>                                 &status);
>
>   if (status == LOCK_FAILED)
>       return;
>
>   /* We had a chance to flush immediately */
>   pgstat_flush_recovery_conflict(dbentry);
>
>   dshash_release_lock(pgStatDBHash, dbentry);

But I don't understand why? Nor why we'd not just report all pending
database wide changes in that case?

The fact that you're locking the per-database entry unconditionally once
for each table almost guarantees contention - and you're not using the
'conditional lock' approach for that. I don't understand.



> /* ----------
>  * pgstat_vacuum_stat() -
>  *
>  *    Remove objects we can get rid of.
>  * ----------
>  */
> void
> pgstat_vacuum_stat(void)
> {
>   HTAB       *oidtab;
>   dshash_seq_status dshstat;
>   PgStat_StatDBEntry *dbentry;
>
>   /* we don't collect stats under standalone mode */
>   if (!IsUnderPostmaster)
>       return;
>
>   /*
>    * Read pg_database and make a list of OIDs of all existing databases
>    */
>   oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
>
>   /*
>    * Search the database hash table for dead databases and drop them
>    * from the hash.
>    */
>
>   dshash_seq_init(&dshstat, pgStatDBHash, false, true);
>   while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
>   {
>       Oid         dbid = dbentry->databaseid;
>
>       CHECK_FOR_INTERRUPTS();
>
>       /* the DB entry for shared tables (with InvalidOid) is never dropped */
>       if (OidIsValid(dbid) &&
>           hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
>           pgstat_drop_database(dbid);
>   }
>
>   /* Clean up */
>   hash_destroy(oidtab);

So, uh, pgstat_drop_database() again does a *separate* lookup in the
dshash, locking the entry. Which only works because you added this dirty
hack:

        /* We need to keep partition lock while sequential scan */
        if (!hash_table->seqscan_running)
        {
                hash_table->find_locked = false;
                hash_table->find_exclusively_locked = false;
                LWLockRelease(PARTITION_LOCK(hash_table, partition));
        }

to dshash_delete_entry(). This seems insane to me. There's not even a
comment explaining this?


>   /*
>    * Similarly to above, make a list of all known relations in this DB.
>    */
>   oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
>
>   /*
>    * Check for all tables listed in stats hashtable if they still exist.
>    * Stats cache is useless here so directly search the shared hash.
>    */
>   pgstat_remove_useless_entries(dbentry->tables, &dsh_tblparams, oidtab);
>
>   /*
>    * Repeat the above but we needn't bother in the common case where no
>    * function stats are being collected.
>    */
>   if (dbentry->functions != DSM_HANDLE_INVALID)
>   {
>       oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
>
>       pgstat_remove_useless_entries(dbentry->functions, &dsh_funcparams,
>                                     oidtab);
>   }
>   dshash_release_lock(pgStatDBHash, dbentry);

Wait, why are we holding the database partition lock across all this?
Again without any comments explaining why?


> +void
> +pgstat_send_archiver(const char *xlog, bool failed)

Why do we still have functions named pgstat_send*?


Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: shared-memory based stats collector

Kyotaro Horiguchi-4
Thank you very much!!

At Thu, 12 Mar 2020 20:13:24 -0700, Andres Freund <[hidden email]> wrote in

> Hi,
>
> Thomas, could you look at the first two patches here, and my review
> questions?
>
>
> General comments about this series:
> - A lot of the review comments feel like I've written them before, a
>   year or more ago. I feel this patch ought to be in a much better
>   state. There's a lot of IMO fairly obvious stuff here, and things that
>   have been mentioned multiple times previously.

I apologize for all of the obvious stuff or things that have been
mentioned..  I'll address them.

> - There's a *lot* of typos in here. I realize being an ESL is hard, but
>   a lot of these can be found with the simplest spellchecker.  That's
>   one thing for a patch that just has been hacked up as a POC, but this
>   is a multi year thread?

I'll review all changed part again.  I used ispell but I should have
failed to check much of the changes.

> - There's some odd formatting. Consider using pgindent more regularly.

I'll do so.

> More detailed comments below.

Thank you very much for the intensive review, I'm going to revise the
patch according to them.

> I'm considering rewriting the parts of the patchset that I don't like -
> but it'll look quite different afterwards.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: shared-memory based stats collector

Andres Freund
Hi,

On 2020-03-13 16:34:50 +0900, Kyotaro Horiguchi wrote:

> Thank you very much!!
>
> At Thu, 12 Mar 2020 20:13:24 -0700, Andres Freund <[hidden email]> wrote in
> > Hi,
> >
> > Thomas, could you look at the first two patches here, and my review
> > questions?
> >
> >
> > General comments about this series:
> > - A lot of the review comments feel like I've written them before, a
> >   year or more ago. I feel this patch ought to be in a much better
> >   state. There's a lot of IMO fairly obvious stuff here, and things that
> >   have been mentioned multiple times previously.
>
> I apologize for all of the obvious stuff or things that have been
> mentioned..  I'll address them.
>
> > - There's a *lot* of typos in here. I realize being an ESL is hard, but
> >   a lot of these can be found with the simplest spellchecker.  That's
> >   one thing for a patch that just has been hacked up as a POC, but this
> >   is a multi year thread?
>
> I'll review all changed part again.  I used ispell but I should have
> failed to check much of the changes.
>
> > - There's some odd formatting. Consider using pgindent more regularly.
>
> I'll do so.
>
> > More detailed comments below.
>
> Thank you very much for the intensive review, I'm going to revise the
> patch according to them.
>
> > I'm considering rewriting the parts of the patchset that I don't like -
> > but it'll look quite different afterwards.

I take your response to mean that you'd prefer to evolve the patch
largely on your own? I'm mainly asking because I think there's some
chance that we could till get this into v13, but if so we'll have to go
for it now.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: shared-memory based stats collector

Thomas Munro-5
Hi Horiguchi-san, Andres,

I tried to rebase this (see attached, no intentional changes beyond
rebasing).  Some feedback:

On Fri, Mar 13, 2020 at 4:13 PM Andres Freund <[hidden email]> wrote:
> Thomas, could you look at the first two patches here, and my review
> questions?

Ack.

> >               dsa_pointer item_pointer = hash_table->buckets[i];
> > @@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
> >                                                               LW_EXCLUSIVE));
> >
> >       delete_item(hash_table, item);
> > -     hash_table->find_locked = false;
> > -     hash_table->find_exclusively_locked = false;
> > -     LWLockRelease(PARTITION_LOCK(hash_table, partition));
> > +
> > +     /* We need to keep partition lock while sequential scan */
> > +     if (!hash_table->seqscan_running)
> > +     {
> > +             hash_table->find_locked = false;
> > +             hash_table->find_exclusively_locked = false;
> > +             LWLockRelease(PARTITION_LOCK(hash_table, partition));
> > +     }
> >  }
>
> This seems like a failure prone API.
If I understand correctly, the only purpose of the seqscan_running
variable is to control that behaviour ^^^.  That is, to make
dshash_delete_entry() keep the partition lock if you delete an entry
while doing a seq scan.  Why not get rid of that, and provide a
separate interface for deleting while scanning?
dshash_seq_delete(dshash_seq_status *scan, void *entry).  I suppose it
would be most common to want to delete the "current" item in the seq
scan, but it could allow you to delete anything in the same partition,
or any entry if using the "consistent" mode.  Oh, I see that Andres
said the same thing later.

> [Andres complaining about comments and language stuff]

I would be happy to proof read and maybe extend the comments (writing
new comments will also help me understand and review the code!), and
maybe some code changes to move this forward.  Horiguchi-san, are you
working on another version now?  If so I'll wait for it before I do
that.

> > + */
> > +void
> > +dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
> > +                             bool consistent, bool exclusive)
> > +{
>
> Why does this patch add the consistent mode? There's no users currently?
> Without it's not clear that we need a seperate _term function, I think?

+1, let's not do that if we don't need it!

> The fact that you're locking the per-database entry unconditionally once
> for each table almost guarantees contention - and you're not using the
> 'conditional lock' approach for that. I don't understand.

Right, I also noticed that:

    /*
     * Local table stats should be applied to both dbentry and tabentry at
     * once. Update dbentry only if we could update tabentry.
     */
    if (pgstat_update_tabentry(tabhash, entry, nowait))
    {
        pgstat_update_dbentry(dbent, entry);
        updated = true;
    }

So pgstat_update_tabentry() goes to great trouble to take locks
conditionally, but then pgstat_update_dbentry() immediately does:

    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
    LWLockRelease(&dbentry->lock);

Why can't we be "lazy" with the dbentry stats too?  Is it really
important for the table stats and DB stats to agree with each other?
Even if it were, your current coding doesn't achieve that: the table
stats are updated before the DB stat under different locks, so I'm not
sure why it can't wait longer.

Hmm.  Even if you change the above code use a conditional lock, I am
wondering (admittedly entirely without data) if this approach is still
too clunky: even trying and failing to acquire the lock creates
contention, just a bit less.  I wonder if it would make sense to make
readers do more work, so that writers can avoid contention.  For
example, maybe PgStat_StatDBEntry could hold an array of N sets of
counters, and readers have to add them all up.  An advanced version of
this idea would use a reasonably fresh copy of something like
sched_getcpu() and numa_node_of_cpu() to select a partition to
minimise contention and cross-node traffic, with a portable fallback
based on PID or something.  CPU core/node awareness is something I
haven't looked into too seriously, but it's been on my mind to solve
some other problems.

v25-0001-Add-sequential-scan-capability-to-dshash.patch (14K) Download Attachment
v25-0002-Add-conditional-lock-facility-to-dshash.patch (6K) Download Attachment
v25-0003-Make-archiver-process-an-auxiliary-process.patch (16K) Download Attachment
v25-0004-Shared-memory-based-statistics-collector.patch (278K) Download Attachment
v25-0005-Remove-the-GUC-stats_temp_directory.patch (14K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: shared-memory based stats collector

Kyotaro Horiguchi-4
In reply to this post by Andres Freund
Thank you for the comment.

The new version is attached.

At Thu, 12 Mar 2020 20:13:24 -0700, Andres Freund <[hidden email]> wrote in

> General comments about this series:
> - A lot of the review comments feel like I've written them before, a
>   year or more ago. I feel this patch ought to be in a much better
>   state. There's a lot of IMO fairly obvious stuff here, and things that
>   have been mentioned multiple times previously.
> - There's a *lot* of typos in here. I realize being an ESL is hard, but
>   a lot of these can be found with the simplest spellchecker.  That's
>   one thing for a patch that just has been hacked up as a POC, but this
>   is a multi year thread?
> - There's some odd formatting. Consider using pgindent more regularly.
>
> More detailed comments below.
>
> I'm considering rewriting the parts of the patchset that I don't like -
> but it'll look quite different afterwards.
>
> On 2020-01-22 17:24:04 +0900, Kyotaro Horiguchi wrote:
> > From 5f7946522dc189429008e830af33ff2db435dd42 Mon Sep 17 00:00:00 2001
> > From: Kyotaro Horiguchi <[hidden email]>
> > Date: Fri, 29 Jun 2018 16:41:04 +0900
> > Subject: [PATCH 1/5] sequential scan for dshash
> >
> > Add sequential scan feature to dshash.
>
>
> >   dsa_pointer item_pointer = hash_table->buckets[i];
> > @@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
> >   LW_EXCLUSIVE));
> >
> >   delete_item(hash_table, item);
> > - hash_table->find_locked = false;
> > - hash_table->find_exclusively_locked = false;
> > - LWLockRelease(PARTITION_LOCK(hash_table, partition));
> > +
> > + /* We need to keep partition lock while sequential scan */
> > + if (!hash_table->seqscan_running)
> > + {
> > + hash_table->find_locked = false;
> > + hash_table->find_exclusively_locked = false;
> > + LWLockRelease(PARTITION_LOCK(hash_table, partition));
> > + }
> >  }
>
> This seems like a failure prone API.
[001] (Fixed)
As the result of the fixed in [044], it's gone now.

> >  /*
> > @@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entr> > + * dshash_seq_init/_next/_term
> > + *           Sequentially scan trhough dshash table and return all the
> > + *           elements one by one, return NULL when no more.
>
> s/trhough/through/

[002] (Fixed)

> This uses a different comment style that the other functions in this
> file. Why?

[003] (Fixed)

It was following the equivalent in dynahash.c.  I rewrote it different
way.

> > + * dshash_seq_term should be called for incomplete scans and otherwise
> > + * shoudln't. Finished scans are cleaned up automatically.
>
> s/shoudln't/shouldn't/

[004] (Fixed)

> I find the "cleaned up automatically" API terrible. I know you copied it
> from dynahash, but I find it to be really failure prone. dynahash isn't
> an example of good postgres code, the opposite, I'd say. It's a lot
> easier to unconditionally have a terminate call if we need that.

[005] (Fixed)
OK, I remember I had a similar thought on this. Fixed with the
all correspondents.

> > + * Returned elements are locked as is the case with dshash_find.  However, the
> > + * caller must not release the lock.
> > + *
> > + * Same as dynanash, the caller may delete returned elements midst of a scan.
>
> I think it's a bad idea to refer to dynahash here. That's just going to
> get out of date. Also, code should be documented on its own.

[006] (Fixed)
Understood, fixed as the follows.

 * Returned elements are locked and the caller must not explicitly release
 * it.

> > + * If consistent is set for dshash_seq_init, the all hash table partitions are
> > + * locked in the requested mode (as determined by the exclusive flag) during
> > + * the scan.  Otherwise partitions are locked in one-at-a-time way during the
> > + * scan.
>
> Yet delete unconditionally retains locks?

[007] (Not fixed)
Yes. If we release the lock on the current partition, hash resize
breaks the concurrent scans.

> > + */
> > +void
> > +dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
> > + bool consistent, bool exclusive)
> > +{
>
> Why does this patch add the consistent mode? There's no users currently?
> Without it's not clear that we need a seperate _term function, I think?

[008] (Fixed)
I remember that it is used in early stage of development. I left it
for a matter of API completeness but actually it is not used. _term is
another matter.  We need to release lock and clean up some dshash
status if we allow seq scan to be exited before it reaches to the end.

I removed the "consistent" from dshash_seq_init and reverted
dshash_seq_term.

> I think we also can get rid of the dshash_delete changes, by instead
> adding a dshash_delete_current(dshash_seq_stat *status, void *entry) API
> or such.

[009] (Fixed)
I'm not sure about the point of having two interfaces that are hard to
distinguish.  Maybe dshash_delete_current(dshash_seq_stat *status) is
enough(). I also reverted the dshash_delete().


> > @@ -70,7 +86,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
> >  extern void dshash_detach(dshash_table *hash_table);
> >  extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
> >  extern void dshash_destroy(dshash_table *hash_table);
> > -
> >  /* Finding, creating, deleting entries. */
> >  extern void *dshash_find(dshash_table *hash_table,
> >   const void *key, bool
> >  exclusive);
>
> There's a number of spurious changes like this.
[010] (Fixed)
I found such isolated line insertion or removal, two in 0001, eight in
0004.

> > From 60da67814fe40fd2a0c1870b15dcf6fcb21c989a Mon Sep 17 00:00:00 2001
> > From: Kyotaro Horiguchi <[hidden email]>
> > Date: Thu, 27 Sep 2018 11:15:19 +0900
> > Subject: [PATCH 2/5] Add conditional lock feature to dshash
> >
> > Dshash currently waits for lock unconditionally. This commit adds new
> > interfaces for dshash_find and dshash_find_or_insert. The new
> > interfaces have an extra parameter "nowait" taht commands not to wait
> > for lock.
>
> s/taht/that/
[011] (Fixed)
Applied ispell on all commit messages.

> There should be at least a sentence or two explaining why these are
> useful.

[011] (Fixed)
Sounds reasonable. I rewrote it that way.

> > +/*
> > + * The version of dshash_find, which is allowed to return immediately on lock
> > + * failure. Lock status is set to *lock_failed in that case.
> > + */
>
> Hm. Not sure I like the *lock_acquired API.
>
> > +void *
> > +dshash_find_extended(dshash_table *hash_table, const void *key,
> > + bool exclusive, bool nowait, bool *lock_acquired)
...
> > + Assert(nowait || !lock_acquired);
...

> > + if (lock_acquired)
> > + *lock_acquired = false;
>
> Why is the test for lock_acquired needed here? I don't think it's
> possible to use nowait correctly without passing in lock_acquired?
>
> Think it'd make sense to document & assert that nowait = true implies
> lock_acquired set, and nowait = false implies lock_acquired not being
> set.
>
> But, uh, why do we even need the lock_acquired parameter? If we couldn't
> find an entry, then we should just release the lock, no?
[012] (Fixed) (related to [013], [014])
The name is confusing. In this version the old dshash_find_extended
and dshash_find_or_insert_extended are merged into new
dshash_find_extended, which covers all the functions of dshash_find
and dshash_find_or_insert plus insertion with shared lock is allowed
now.


> I'm however inclined to think it's better to just have a separate
> function for the nowait case, rather than an extended version supporting
> both (with an internal helper doing most of the work).

[013] (Fixed) (related to [012], [014])
After some thoughts, the nowait is no longer a matter of complexity.
Finally I did as [012].

> > +/*
> > + * The version of dshash_find_or_insert, which is allowed to return immediately
> > + * on lock failure.
> > + *
> > + * Notes above dshash_find_extended() regarding locking and error handling
> > + * equally apply here.
>
> They don't, there's no lock_acquired parameter.
>
> > + */
> > +void *
> > +dshash_find_or_insert_extended(dshash_table *hash_table,
> > +   const void *key,
> > +   bool *found,
> > +   bool nowait)
>
> I think it's absurd to have dshash_find, dshash_find_extended,
> dshash_find_or_insert, dshash_find_or_insert_extended. If they're
> extended they should also be able to specify whether the entry will get
> created.
[014] (Fixed) (related to [012], [013])
As mentioned above, this version has the original two functions and
one dshash_find_extended().

> > From d10c1117cec77a474dbb2cff001086d828b79624 Mon Sep 17 00:00:00 2001
> > From: Kyotaro Horiguchi <[hidden email]>
> > Date: Wed, 7 Nov 2018 16:53:49 +0900
> > Subject: [PATCH 3/5] Make archiver process an auxiliary process
> >
> > This is a preliminary patch for shared-memory based stats collector.
> > Archiver process must be a auxiliary process since it uses shared
> > memory after stats data wes moved onto shared-memory. Make the process
>
> s/wes/was/ s/onto/into/
[015] (Fixed)

> > an auxiliary process in order to make it work.
>
> >
>
> > @@ -451,6 +454,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
> >   StartupProcessMain();
> >   proc_exit(1); /* should never return */
> >
> > + case ArchiverProcess:
> > + /* don't set signals, archiver has its own agenda */
> > + PgArchiverMain();
> > + proc_exit(1); /* should never return */
> > +
> >   case BgWriterProcess:
> >   /* don't set signals, bgwriter has its own agenda */
> >   BackgroundWriterMain();
>
> I think I'd rather remove the two comments that are copied to 6 out of 8
> cases - they don't add anything.
[016] (Fixed)
Agreed. I removed the comments from StartProcess to WalReceiverProcess.

> >  pgarch_exit(SIGNAL_ARGS)
> >  {
..
> > + * We DO NOT want to run proc_exit() callbacks -- we're here because
> > + * shared memory may be corrupted, so we don't want to try to clean up our
...
> > + * being doubly sure.)
> > + */
> > + exit(2);
...
> This seems to be a copy of code & comments from other signal handlers that predates
..
> I think this just should use SignalHandlerForCrashExit().
> I think we can even commit that separately - there's not really a reason
> to not do that today, as far as I can tell?

[017] (Fixed, separate patch 0001)
Exactly. Although on_*_exit_list is empty on the process, SIGQUIT
ought to prevent the process from calling the functions even if
any. That changes the exit status of archiver on SIGQUIT from 1 to 2,
but that doesn't make any behavior change (other than log message).

> >  /* SIGUSR1 signal handler for archiver process */
>
> Hm - this currently doesn't set up a correct sigusr1 handler for a
> shared memory backend - needs to invoke procsignal_sigusr1_handler
> somewhere.
>
> We can probably just convert to using normal latches here, and remove
> the current 'wakened' logic? That'll remove the indirection via
> postmaster too, which is nice.

[018] (Fixed, separate patch 0005)
It seems better. I added it as a separate patch just after the patch
that turns archiver an auxiliary process.

> > @@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType)
> >
> >   switch (backendType)
> >   {
> > + case B_ARCHIVER:
> > + backendDesc = "archiver";
> > + break;
>
> should imo include 'WAL' or such.

[019] (Not Fixed)
It is already named "archiver" by 8e8a0becb3. Do I rename it in this
patch set?  

> > From 5079583c447c3172aa0b4f8c0f0a46f6e1512812 Mon Sep 17 00:00:00 2001
> > From: Kyotaro Horiguchi <[hidden email]>
> > Date: Thu, 21 Feb 2019 12:44:56 +0900
> > Subject: [PATCH 4/5] Shared-memory based stats collector
..
> > megabytes. To deal with larger statistics set, this patch let backends
> > directly share the statistics via shared memory.
>
> This spends a fair bit describing the old state, but very little
> describing the new state.

[020] (Fixed, Maybe)
Ugg.  I got the same comment in the last round. I rewrote it this time.

> > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> > index 0bfd6151c4..a6b0bdec12 100644
> > --- a/doc/src/sgml/monitoring.sgml
> > +++ b/doc/src/sgml/monitoring.sgml
...
> > -   master process.  (The <quote>stats collector</quote> process will not be present
> > -   if you have set the system not to start the statistics collector; likewise
> > +   master process.  (The <quote>autovacuum launcher</quote> process will not
...
> There's more references to the stats collector than this... E.g. in
> catalogs.sgml

[021] (Fixed, separate patch 0007)
However the "statistics collector process" is gone, I'm not sure
"statistics collector" feature also is gone. But actually the word
"collector" looks a bit odd in some context. I replaced "the results
of statistics collector" with "the activity statistics". (I'm not sure
"the activity statistics" is proper as a subsystem name.) The word
"collect" is replaced with "track".  I didn't change section IDs
corresponding to the renaming so that old links can work. I also fixed
the tranche name for LWTRANCHE_STATS from "activity stats" to
"activity_statistics"

> > diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
> > index 6d1f28c327..8dcb0fb7f7 100644
> > --- a/src/backend/postmaster/autovacuum.c
> > +++ b/src/backend/postmaster/autovacuum.c
...

> > @@ -2747,12 +2747,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
> >   if (isshared)
> >   {
> >   if (PointerIsValid(shared))
> > - tabentry = hash_search(shared->tables, &relid,
> > -   HASH_FIND, NULL);
> > + tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
> >   }
> >   else if (PointerIsValid(dbentry))
> > - tabentry = hash_search(dbentry->tables, &relid,
> > -   HASH_FIND, NULL);
> > + tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
> >
> >   return tabentry;
> >  }
>
> Why is pgstat_fetch_stat_tabentry_extended called "_extended"? Outside
[022] (Fixed)
The _extended function is not an extended version of the original
function. I renamed pgstat_fetch_stat_tabentry_extended to
pgstat_fetch_stat_tabentry_snapshot. Also
pgstat_fetch_funcentry_extended and pgstat_fetch_dbentry() are renamed
following that.

> the stats subsystem there are exactly one caller for the non extended
> version, as far as I can see. That's index_concurrently_swap() - and imo
> that's code that should live in the stats subsystem, rather than open
> coded in index.c.

[023] (Fixed)
Agreed. I added a new function "pgstat_copy_index_counters()" and now
pgstat_fetch_stat_tabentry() has no caller sites outside pgstat
subsystem.

> > diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
> > index ca5c6376e5..1ffe073a1f 100644
> > --- a/src/backend/postmaster/pgstat.c
> > +++ b/src/backend/postmaster/pgstat.c
> > + *  Collects per-table and per-function usage statistics of all backends on
> > + *  shared memory. pg_count_*() and friends are the interface to locally store
> > + *  backend activities during a transaction. Then pgstat_flush_stat() is called
> > + *  at the end of a transaction to pulish the local stats on shared memory.
> >   *
>
> I'd rather not exhaustively list the different objects this handles -
> it'll either be annoying to maintain, or just get out of date.
[024] (Fixed, Maybe)
Although not sure I get you correctly, I rewrote it as the follows.

 *  Collects per-table and per-function usage statistics of all backends on
 *  shared memory. The activity numbers are once stored locally, then written
 *  to shared memory at commit time or by idle-timeout.

> > - * - Add some automatic call for pgstat vacuuming.
> > + *  To avoid congestion on the shared memory, we update shared stats no more
> > + *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). In the case where
> > + *  all the local numbers cannot be flushed immediately, we postpone updates
> > + *  and try the next chance after the interval of
> > + *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but we don't wait for no longer than
> > + *  PGSTAT_STAT_MAX_INTERVAL(1000ms).
>
> I'm not convinced by this backoff logic. The basic interval seems quite
> high for something going through shared memory, and the max retry seems
> pretty low.
[025] (Not Fixed)
Is it the matter of intervals? Is (MIN, RETRY, MAX) = (1000, 500,
10000) reasonable?

> > +/*
> > + * Operation mode and return code of pgstat_get_db_entry.
> > + */
> > +#define PGSTAT_SHARED 0
>
> This is unreferenced.
>
>
> > +#define PGSTAT_EXCLUSIVE 1
> > +#define PGSTAT_NOWAIT 2
>
> And these should imo rather be parameters.
[026] (Fixed)
Mmm. Right. The two symbols conveys just two distinct parameters. Two
booleans suffice. But I found some confusion here. As the result
pgstat_get_db_entry have three booleans parameters exclusive, nowait
and create.

> > +typedef enum PgStat_TableLookupResult
> > +{
> > + NOT_FOUND,
> > + FOUND,
> > + LOCK_FAILED
> > +} PgStat_TableLookupResult;
>
> This seems like a seriously bad idea to me. These are very generic
> names. There's also basically no references except setting them to the
> first two?
[027] (Fixed)
Considering some related comments above, I decided not to return lock
status from pgstat_get_db_entry. That makes the enum useless and makes
the function simpler.

> > +#define StatsLock (&StatsShmem->StatsMainLock)
> >
> > -static time_t last_pgstat_start_time;
> > +/* Shared stats bootstrap information */
> > +typedef struct StatsShmemStruct
> > +{
> > + LWLock StatsMainLock; /* lock to protect this struct */
...
> > +} StatsShmemStruct;
>
> Why isn't this an lwlock in lwlock in lwlocknames.h, rather than
> allocated here?

[028] (Fixed)
The activity stats system already used a dedicate tranche, so I might
think that it is natural that it is in the same tranche. Actually it's
not so firm reason. Moved the lock into main tranche.

> > +/*
> > + * BgWriter global statistics counters. The name cntains a remnant from the
> > + * time when the stats collector was a dedicate process, which used sockets to
> > + * send it.
> > + */
> > +PgStat_MsgBgWriter BgWriterStats = {0};
>
> I am strongly against keeping the 'Msg' prefix. That seems extremely
> confusing going forward.

[029] (Fixed) (Related  to [046])
Mmm. It's following your old suggestion to avoid unsubstantial
diffs. I'm happy to change it. The functions that have "send" in their
names are for the same reason. I removed the prefix "m_" of the
members of the struct. (The comment above (with a typo) explains that).

> > +/* common header of snapshot entry in reader snapshot hash */
> > +typedef struct PgStat_snapshot
> > +{
> > + Oid key;
> > + bool negative;
> > + void   *body; /* end of header part: to keep alignment */
> > +} PgStat_snapshot;
>
>
> > +/* context struct for snapshot_statentry */
> > +typedef struct pgstat_snapshot_param
> > +{
> > + char   *hash_name; /* name of the snapshot hash */
> > + int hash_entsize; /* element size of hash entry */
> > + dshash_table_handle dsh_handle; /* dsh handle to attach */
> > + const dshash_parameters *dsh_params;/* dshash params */
> > + HTAB  **hash; /* points to variable to hold hash */
> > + dshash_table  **dshash; /* ditto for dshash */
> > +} pgstat_snapshot_param;
>
> Why does this exist? The struct contents are actually constant across
> calls, yet you have declared them inside functions (as static - static
> on function scope isn't really the same as global static).
[030] (Fixed)
IIUC, I didn't want it initialized at every call and it doesn't need
an external linkage. So it was static variable on function scope.
But, first, the name _param is bogus since it actually contains
context variables.  Second the "context" variables have been moved to
other variables.  I removed the struct and moved the members to the
parameter of snapshot_statentry.

> If we want it, I think we should separate the naming more
> meaningfully. The important difference between 'hash' and 'dshash' isn't
> the hashing module, it's that one is a local copy, the other a shared
> hashtable!

[031] (Fixed)
Definitely. The parameters of snapshot_statentry now have more
meaningful names.

> > +/*
> > + * Backends store various database-wide info that's waiting to be flushed out
> > + * to shared memory in these variables.
> > + *
> > + * checksum_failures is the exception in that it is cluster-wide value.
> > + */
> > +typedef struct BackendDBStats
> > +{
> > + int n_conflict_tablespace;
> > + int n_conflict_lock;
> > + int n_conflict_snapshot;
> > + int n_conflict_bufferpin;
> > + int n_conflict_startup_deadlock;
> > + int n_deadlocks;
> > + size_t n_tmpfiles;
> > + size_t tmpfilesize;
> > + HTAB *checksum_failures;
> > +} BackendDBStats;
>
> Why is this a separate struct from PgStat_StatDBEntry? We should have
> these fields in multiple places.
[032] (Fixed, Maybe) (Related to [042])
It is almost a subset of PgStat_StatDBEntry with an
exception. checksum_failures is different between the two.
Anyway, tracking of conflict events don't need to be so fast so they
have been changed to be counted on shared hash entries directly.

Checkpoint failure is handled different way so only it is left alone.

> > + if (StatsShmem->refcount > 0)
> > + StatsShmem->refcount++;
>
> What prevents us from leaking the refcount here? We could e.g. error out
> while attaching, no? Which'd mean we'd leak the refcount.

[033] (Fixed)
We don't attach shared stats on postmaster process, so I want to know
the first attacher process and the last detacher process of shared
stats.  It's not leaks that I'm considering here.
(continued below)

> To me it looks like there's a lot of added complexity just because you
> want to be able to reset stats via
>
> void
> pgstat_reset_all(void)
> {
>
> /*
> * We could directly remove files and recreate the shared memory area. But
> * detach then attach for simplicity.
> */
> pgstat_detach_shared_stats(false); /* Don't write */
> pgstat_attach_shared_stats();
>
> Without that you'd not need the complexity of attaching, detaching to
> the same degree - every backend could just cache lookup data during
> initialization, instead of having to constantly re-compute that.
Mmm. I don't get that (or I failed to read clear meaning). The
function is assumed be called only from StartupXLOG().
(continued)

> Nor would the dynamic re-creation of the db dshash table be needed.

Maybe you are mentioning the complexity of reset_dbentry_counters? It
is actually complex.  Shared stats dshash cannot be destroyed (or
dshash entry cannot be removed) during someone is working on it. It
was simpler to wait for another process to end its work but that could
slow not only the clearing process but also other processes by
frequent resetting of counters.

After some thoughts, I decided to rip the all "generation" stuff off
and it gets far simpler. But counter reset may conflict with other
backends with a litter higher degree because counter reset needs
exclusive lock.

> > +/* ----------
> > + * pgstat_report_stat() -
> > + *
> > + * Must be called by processes that performs DML: tcop/postgres.c, logical
> > + * receiver processes, SPI worker, etc. to apply the so far collected
> > + * per-table and function usage statistics to the shared statistics hashes.
> > + *
> > + *  Updates are applied not more frequent than the interval of
> > + *  PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock
...
> Inconsistent indentation.

[034] (Fixed)

> > +long
> > +pgstat_report_stat(bool force)
> >  {
>
> > + /* Flush out table stats */
> > + if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
> > + pending_stats = true;
> > +
> > + /* Flush out function stats */
> > + if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
> > + pending_stats = true;
>
> This seems weird. pgstat_flush_stat(), pgstat_flush_funcstats() operate
> on pgStatTabList/pgStatFunctions, but don't actually reset it? Besides
> being confusing while reading the code, it also made the diff much
> harder to read.
[035] (Maybe Fixed)
Is the question that, is there any case where
pgstat_flush_stat/functions leaves some counters unflushed? It skips
tables where someone else is working on (or another table that is in
the same dshash partition), Or "!force == nowait" is the cause of
confusion? It is now translated as "nowait = !force". (Or change the
parameter of pgstat_report_stat from "force" to "nowait"?)

> > - snprintf(fname, sizeof(fname), "%s/%s", directory,
> > - entry->d_name);
> > - unlink(fname);
> > + /* Flush out database-wide stats */
> > + if (HAVE_PENDING_DBSTATS())
> > + {
> > + if (!pgstat_flush_dbstats(&cxt, !force))
> > + pending_stats = true;
> >   }
>
> Linearly checking a number of stats doesn't seem like the right way
> going forward. Also seems fairly omission prone.
>
> Why does this code check live in pgstat_report_stat(), rather than
> pgstat_flush_dbstats()?
[036] (Maybe Fixed) (Related to [041])
It's to avoid useless calls but it is no longer exists. Anyway the
code disappeared by [041].

| /* Flush out individual stats tables */
| pending_stats |= pgstat_flush_stat(&cxt, nowait);
| pending_stats |= pgstat_flush_funcstats(&cxt, nowait);
| pending_stats |= pgstat_flush_checksum_failure(cxt.mydbentry, nowait);


> > /*
> >  * snapshot_statentry() - Common routine for functions
> >  * pgstat_fetch_stat_*entry()
> >  *
>
> Why has this function been added between the closely linked
> pgstat_report_stat() and pgstat_flush_stat() etc?

[037]
It seems to be left there after some editing. Moved it to just before
the caller functdions.

> Why do we still have this? A hashtable lookup is cheap, compared to
> fetching a file - so it's not to save time. Given how infrequent the
> pgstat_fetch_* calls are, it's not to avoid contention either.
>
> At first one could think it's for consistency - but no, that's not it
> either, because snapshot_statentry() refetches the snapshot without
> control from the outside:

[038]
I don't get the second paragraph. When the function re*create*s a
snapshot without control from the outside? It keeps snapshots during a
transaction.  If not, it is broken.
(continued)

> >   /*
> >    * We don't want so frequent update of stats snapshot. Keep it at least
> >    * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
> >    */
...
> I think we should just remove this entire local caching snapshot layer
> for lookups.

Currently the behavior is documented as the follows and it seems reasonable.

   Another important point is that when a server process is asked to display
   any of these statistics, it first fetches the most recent report emitted by
   the collector process and then continues to use this snapshot for all
   statistical views and functions until the end of its current transaction.
   So the statistics will show static information as long as you continue the
   current transaction.  Similarly, information about the current queries of
   all sessions is collected when any such information is first requested
   within a transaction, and the same information will be displayed throughout
   the transaction.
   This is a feature, not a bug, because it allows you to perform several
   queries on the statistics and correlate the results without worrying that
   the numbers are changing underneath you.  But if you want to see new
   results with each query, be sure to do the queries outside any transaction
   block.  Alternatively, you can invoke
   <function>pg_stat_clear_snapshot</function>(), which will discard the
   current transaction's statistics snapshot (if any).  The next use of
   statistical information will cause a new snapshot to be fetched.

> > /*
> >  * pgstat_flush_stat: Flushes table stats out to shared statistics.
> >  *
>
> Why is this named pgstat_flush_stat, rather than pgstat_flush_tabstats
> or such? Given that the code for dealing with an individual table's
> entry is named pgstat_flush_tabstat() that's very confusing.

[039]
Definitely. The names are hchanged while adressing [041]

> > static bool
> > pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
...
> It seems odd that there's a tabstat specific code in pgstat_flush_stat
> (also note singular while it's processing all stats, whereas you're
> below treating pgstat_flush_tabstat as only affecting one table).

[039]
The names are hchanged while adressing [041]


> >       for (i = 0; i < tsa->tsa_used; i++)
> >       {
> >           PgStat_TableStatus *entry = &tsa->tsa_entries[i];
> >
<many TableStatsArray code>
> >               hash_entry->tsa_entry = entry;
> >               dest_elem++;
> >           }
>
> This seems like too much code. Why is this entirely different from the
> way funcstats works? The difference was already too big before, but this
> made it *way* worse.

[040]
We don't flush stats until transaction ends. So the description about
TabStatuArray is stale?

 * NOTE: once allocated, TabStatusArray structures are never moved or deleted
 * for the life of the backend.  Also, we zero out the t_id fields of the
 * contained PgStat_TableStatus structs whenever they are not actively in use.
 * This allows relcache pgstat_info pointers to be treated as long-lived data,
 * avoiding repeated searches in pgstat_initstats() when a relation is
 * repeatedly opened during a transaction.
(continued to below)

> One goal of this project, as I understand it, is to make it easier to
> add additional stats. As is, this seems to make it harder from the code
> level.

Indeed. I removed the TabStatsArray. Having said that it lives a long
life, its life lasts for at most transaction end. I used dynahash
entry as pgstat_info entry. One tricky part is I had to clear
entry->t_id after removal of the entry so that pgstat_initstats can
detect the removal.  It is actually safe but we can add another table
id member in the struct for the use.

> > bool
> > pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
> >                    PgStat_TableStatus *entry)
> > {
> >   Oid     dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
> >   int     table_mode = PGSTAT_EXCLUSIVE;
> >   bool    updated = false;
> >   dshash_table *tabhash;
> >   PgStat_StatDBEntry *dbent;
> >   int     generation;
> >
> >   if (nowait)
> >       table_mode |= PGSTAT_NOWAIT;
> >
> >   /* Attach required table hash if not yet. */
> >   if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
> >   {
> >       /*
> >        *  Return if we don't have corresponding dbentry. It would've been
> >        *  removed.
> >        */
> >       dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
> >       if (!dbent)
> >           return false;
> >
> >       /*
> >        * We don't hold lock on the dbentry since it cannot be dropped while
> >        * we are working on it.
> >        */
> >       generation = pin_hashes(dbent);
> >       tabhash = attach_table_hash(dbent, generation);
>
> This again is just cost incurred by insisting on destroying hashtables
> instead of keeping them around as long as necessary.
[040]
Maybe you are insisting the reverse? The pin_hash complexity is left
in this version. -> [033]

> >   /*
> >    * Local table stats should be applied to both dbentry and tabentry at
> >    * once. Update dbentry only if we could update tabentry.
> >    */
> >   if (pgstat_update_tabentry(tabhash, entry, nowait))
> >   {
> >       pgstat_update_dbentry(dbent, entry);
> >       updated = true;
> >   }
>
> At this point we're very deeply nested. pgstat_report_stat() ->
> pgstat_flush_stat() -> pgstat_flush_tabstat() ->
> pgstat_update_tabentry().
>
> That's way over the top imo.
[041] (Fixed) (Related to [036])
Completely agree, It is a result of that I wanted to avoid scanning
pgStatTables twice.
(continued)

> I don't think it makes much sense that pgstat_update_dbentry() is called
> separately for each table. Why would we want to constantly lock that
> entry? It seems to be much more sensible to instead have
> pgstat_flush_stat() transfer the stats it reported to the pending
> database wide counters, and then report that to shared memory *once* per
> pgstat_report_stat() with pgstat_flush_dbstats()?

In the attched it scans PgStat_StatDBEntry twice. Once for the tables
of current database and another for shared tables. That change
simplified the logic around.

pgstat_report_stat()
  pgstat_flush_tabstats(<tables of current dataase>)
    pgstat_update_tabentry() (at bottom)

> >   LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
> >   if (HAVE_PENDING_CONFLICTS())
> >       pgstat_flush_recovery_conflict(cxt->mydbentry);
> >   if (BeDBStats.n_deadlocks != 0)
> >       pgstat_flush_deadlock(cxt->mydbentry);
..
> What's the point of having all these sub-functions? I see that you, for
> an undocumented reason, have pgstat_report_recovery_conflict() flush
> conflict stats immediately:

[042]
Fixed by [032].

> >   dbentry = pgstat_get_db_entry(MyDatabaseId,
> >                                 PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
> >                                 &status);
> >
> >   if (status == LOCK_FAILED)
> >       return;
> >
> >   /* We had a chance to flush immediately */
> >   pgstat_flush_recovery_conflict(dbentry);
> >
> >   dshash_release_lock(pgStatDBHash, dbentry);
>
> But I don't understand why? Nor why we'd not just report all pending
> database wide changes in that case?
>
> The fact that you're locking the per-database entry unconditionally once
> for each table almost guarantees contention - and you're not using the
> 'conditional lock' approach for that. I don't understand.
[043] (Maybe fixed) (Related to [045].)
Vacuum, analyze, DROP DB and reset cannot be delayed. So the
conditional lock is mainly used by
pgstat_report_stat(). dshash_find_or_insert didn't allow shared
lock. I changed dshash_find_extended to allow shared-lock even if it
is told to create a missing entry. Alrhough it takes exclusive lock at
the mement of entry creation, most of all cases it doesn't need
exclusive lock. This allows use shared lock while processing vacuum or
analyze stats.

Previously I thought that we can work on a shared database entry while
lock is not held, but actually there are cases where insertion of a
new database entry causes rehash (resize). The operation moves entries
so we need at least shared lock on database entry while we are working
on it. So in the attched basically most operations are working by the
following steps.

- get shared database entry with shared lock
  - attach table/function hash
    - fetch an entry with exclusive lock
      - update entry
        - release the table/function entry
  - detach table/function hash

  if needed
    - take LW_EXCLUSIVE on database entry
      - update database numbers
    - release LWLock
- release shared database entry
 
> > pgstat_vacuum_stat(void)
> > {
...
> >   dshash_seq_init(&dshstat, pgStatDBHash, false, true);
> >   while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
> >           pgstat_drop_database(dbid);
..

> So, uh, pgstat_drop_database() again does a *separate* lookup in the
> dshash, locking the entry. Which only works because you added this dirty
> hack:
>
> /* We need to keep partition lock while sequential scan */
> if (!hash_table->seqscan_running)
> {
> hash_table->find_locked = false;
> hash_table->find_exclusively_locked = false;
> LWLockRelease(PARTITION_LOCK(hash_table, partition));
> }
>
> to dshash_delete_entry(). This seems insane to me. There's not even a
> comment explaining this?
[044]

Following [001] and [009], I added
dshash_delete_currenet(). pgstat_vacuum_stat() uses it instead of
pgstat_delete_entry(). The hack is gone.

(pgstat_vacuum_stat(void))
> >   }
> >   dshash_release_lock(pgStatDBHash, dbentry);
>
> Wait, why are we holding the database partition lock across all this?
> Again without any comments explaining why?

[045] (I'm not sure it is fixed)
The lock is shared lock in the current version. The database entry is
needed only for attaching table hash and now the hashes won't be
removed. So, as maybe you suggested, the lock can be released earlier in:

 pgstat_report_stat()
 pgstat_flush_funcstats()
 pgstat_vacuum_stat()
 pgstat_reset_single_counter()
 pgstat_report_vacuum()
 pgstat_report_analyze()

The following functions are working on the database entry so lock needs to be retained till the end of its work.

 pgstat_flush_dbstats()
 pgstat_drop_database()   /* needs exclusive lock */
 pgstat_reset_counters()
 pgstat_report_autovac()
 pgstat_report_recovery_conflict()
 pgstat_report_deadlock()
 pgstat_report_tempfile()
 pgstat_report_checksum_failures_in_db()
 pgstat_flush_checksum_failure() /* repeats short-time lock on each dbs */

> > +void
> > +pgstat_send_archiver(const char *xlog, bool failed)
>
> Why do we still have functions named pgstat_send*?

[046] (Fixed)
Same as [029] and I changed it to pgstat_report_archiver().

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From 79aad94b4cf07c1de8e1a085c9b2c1365a78d4be Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Mon, 16 Mar 2020 17:15:35 +0900
Subject: [PATCH v25 1/8] Use standard crash handler in archiver.

The commit 8e19a82640 changed SIGQUIT handler of almost all processes
not to run atexit callbacks for safety. Archiver process should behave
the same way for the same reason. Exit status changes 1 to 2 but that
doesn't make any behavioral change.
---
 src/backend/postmaster/pgarch.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 01ffd6513c..37be0e2bbb 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -96,7 +96,6 @@ static pid_t pgarch_forkexec(void);
 #endif
 
 NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_exit(SIGNAL_ARGS);
 static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
@@ -229,7 +228,7 @@ PgArchiverMain(int argc, char *argv[])
  pqsignal(SIGHUP, SignalHandlerForConfigReload);
  pqsignal(SIGINT, SIG_IGN);
  pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
- pqsignal(SIGQUIT, pgarch_exit);
+ pqsignal(SIGQUIT, SignalHandlerForCrashExit);
  pqsignal(SIGALRM, SIG_IGN);
  pqsignal(SIGPIPE, SIG_IGN);
  pqsignal(SIGUSR1, pgarch_waken);
@@ -246,14 +245,6 @@ PgArchiverMain(int argc, char *argv[])
  exit(0);
 }
 
-/* SIGQUIT signal handler for archiver process */
-static void
-pgarch_exit(SIGNAL_ARGS)
-{
- /* SIGQUIT means curl up and die ... */
- exit(1);
-}
-
 /* SIGUSR1 signal handler for archiver process */
 static void
 pgarch_waken(SIGNAL_ARGS)
--
2.18.2


From 269f8966be3fbc958f7df4c505b104527c506fdf Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v25 2/8] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 162 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  21 +++++
 2 files changed, 182 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..2086bdbea9 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
  size_t size_log2; /* log2(number of buckets) */
  bool find_locked; /* Is any partition lock held by 'find'? */
  bool find_exclusively_locked; /* ... exclusively? */
+ bool seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2) \
  (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2) \
+ (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2) \
  (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2) \
  ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2) \
+ ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash) \
  (hash_table->buckets[ \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
  hash_table->find_locked = false;
  hash_table->find_exclusively_locked = false;
+ hash_table->seqscan_running = false;
 
  /*
  * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
  hash_table->control = dsa_get_address(area, control);
  hash_table->find_locked = false;
  hash_table->find_exclusively_locked = false;
+ hash_table->seqscan_running = false;
  Assert(hash_table->control->magic == DSHASH_MAGIC);
 
  /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
  ensure_valid_bucket_pointers(hash_table);
 
  /* Free all the entries. */
- size = ((size_t) 1) << hash_table->size_log2;
+ size = NUM_BUCKETS(hash_table->size_log2);
  for (i = 0; i < size; ++i)
  {
  dsa_pointer item_pointer = hash_table->buckets[i];
@@ -568,6 +579,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
  Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
  hash_table->find_exclusively_locked
  ? LW_EXCLUSIVE : LW_SHARED));
+ /* lock is under control of sequential scan */
+ Assert(!hash_table->seqscan_running);
 
  hash_table->find_locked = false;
  hash_table->find_exclusively_locked = false;
@@ -592,6 +605,153 @@ dshash_memhash(const void *v, size_t size, void *arg)
  return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+ bool exclusive)
+{
+ /* allowed at most one scan at once */
+ Assert(!hash_table->seqscan_running);
+
+ status->hash_table = hash_table;
+ status->curbucket = 0;
+ status->nbuckets = 0;
+ status->curitem = NULL;
+ status->pnextitem = InvalidDsaPointer;
+ status->curpartition = -1;
+ status->exclusive = exclusive;
+ hash_table->seqscan_running = true;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+ dsa_pointer next_item_pointer;
+
+ Assert(status->hash_table->seqscan_running);
+ if (status->curitem == NULL)
+ {
+ int partition;
+
+ Assert (status->curbucket == 0);
+ Assert(!status->hash_table->find_locked);
+
+ /* first shot. grab the first item. */
+ partition =
+ PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+   status->hash_table->size_log2);
+ LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+  status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+ status->curpartition = partition;
+
+ /* resize doesn't happen from now until seq scan ends */
+ status->nbuckets =
+ NUM_BUCKETS(status->hash_table->control->size_log2);
+ ensure_valid_bucket_pointers(status->hash_table);
+
+ next_item_pointer = status->hash_table->buckets[status->curbucket];
+ }
+ else
+ next_item_pointer = status->pnextitem;
+
+ /* Move to the next bucket if we finished the current bucket */
+ while (!DsaPointerIsValid(next_item_pointer))
+ {
+ int next_partition;
+
+ if (++status->curbucket >= status->nbuckets)
+ {
+ /* all buckets have been scanned. finish. */
+ return NULL;
+ }
+
+ /* Also move parititon lock if needed */
+ next_partition =
+ PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+   status->hash_table->size_log2);
+
+ /* Move lock along with partition for the bucket */
+ if (status->curpartition != next_partition)
+ {
+ /*
+ * Lock the next partition then release the current, not in the
+ * reverse order to avoid concurrent resizing. Partitions are
+ * locked in the same order with resize() so dead locks won't
+ * happen.
+ */
+ LWLockAcquire(PARTITION_LOCK(status->hash_table,
+ next_partition),
+  status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+ LWLockRelease(PARTITION_LOCK(status->hash_table,
+ status->curpartition));
+ status->curpartition = next_partition;
+ }
+
+ next_item_pointer = status->hash_table->buckets[status->curbucket];
+ }
+
+ status->curitem =
+ dsa_get_address(status->hash_table->area, next_item_pointer);
+ status->hash_table->find_locked = true;
+ status->hash_table->find_exclusively_locked = status->exclusive;
+
+ /*
+ * The caller may delete the item. Store the next item in case of deletion.
+ */
+ status->pnextitem = status->curitem->next;
+
+ return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+ Assert(status->hash_table->seqscan_running);
+ status->hash_table->find_locked = false;
+ status->hash_table->find_exclusively_locked = false;
+ status->hash_table->seqscan_running = false;
+
+ if (status->curpartition >= 0)
+ LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+ dshash_table   *hash_table = status->hash_table;
+ dshash_table_item  *item = status->curitem;
+ size_t partition = PARTITION_FOR_HASH(item->hash);
+
+ Assert(status->exclusive);
+ Assert(hash_table->control->magic == DSHASH_MAGIC);
+ Assert(hash_table->find_locked);
+ Assert(hash_table->find_exclusively_locked);
+ Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+ LW_EXCLUSIVE));
+
+ delete_item(hash_table, item);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..81a929b8d9 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+ dshash_table   *hash_table;
+ int curbucket;
+ int nbuckets;
+ dshash_table_item  *curitem;
+ dsa_pointer pnextitem;
+ int curpartition;
+ bool exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
    const dshash_parameters *params,
@@ -80,6 +95,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+ bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
--
2.18.2


From e6d9ea8f7ac0ec29e8da064a7fed2d943a57bcec Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v25 3/8] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 117 ++++++++++++++++++++++++---------------
 src/include/lib/dshash.h |   3 +
 2 files changed, 75 insertions(+), 45 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 2086bdbea9..9a9b818d86 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -386,6 +386,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -395,36 +399,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
- dshash_hash hash;
- size_t partition;
- dshash_table_item *item;
-
- hash = hash_key(hash_table, key);
- partition = PARTITION_FOR_HASH(hash);
-
- Assert(hash_table->control->magic == DSHASH_MAGIC);
- Assert(!hash_table->find_locked);
-
- LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-  exclusive ? LW_EXCLUSIVE : LW_SHARED);
- ensure_valid_bucket_pointers(hash_table);
-
- /* Search the active bucket. */
- item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
- if (!item)
- {
- /* Not found. */
- LWLockRelease(PARTITION_LOCK(hash_table, partition));
- return NULL;
- }
- else
- {
- /* The caller will free the lock by calling dshash_release_lock. */
- hash_table->find_locked = true;
- hash_table->find_exclusively_locked = exclusive;
- return ENTRY_FROM_ITEM(item);
- }
+ return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -442,30 +417,61 @@ dshash_find_or_insert(dshash_table *hash_table,
   const void *key,
   bool *found)
 {
- dshash_hash hash;
- size_t partition_index;
- dshash_partition *partition;
- dshash_table_item *item;
+ return dshash_find_extended(hash_table, key, true, false, true, found);
+}
 
- hash = hash_key(hash_table, key);
- partition_index = PARTITION_FOR_HASH(hash);
- partition = &hash_table->control->partitions[partition_index];
 
- Assert(hash_table->control->magic == DSHASH_MAGIC);
- Assert(!hash_table->find_locked);
+/*
+ * Find the key in the hash table.
+ *
+ * "insert" indicates insert mode. In this mode new entry is inserted and set
+ * *found to false. *found is set to true if found. "found" must be non-null in
+ * this mode.  exclusive may be false in insert mode, but this function may
+ * take exclusive lock temporarily when actual insertion happens.
+ *
+ * If nowait is true, the function immediately returns if required lock was not
+ * acquired.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+ bool exclusive, bool nowait, bool insert, bool *found)
+{
+ dshash_hash hash = hash_key(hash_table, key);
+ size_t partidx = PARTITION_FOR_HASH(hash);
+ dshash_partition *partition = &hash_table->control->partitions[partidx];
+ LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
+ dshash_table_item *item;
+ bool inserted = false;
+
+ /* must be exclusive when insert allowed */
+ Assert(!insert || found != NULL);
 
 restart:
- LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-  LW_EXCLUSIVE);
+ if (!nowait)
+ LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+ else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+   lockmode))
+ return NULL;
+
  ensure_valid_bucket_pointers(hash_table);
 
  /* Search the active bucket. */
  item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
  if (item)
- *found = true;
+ {
+ if (found)
+ *found = !inserted;
+ }
  else
  {
+ if (!insert)
+ {
+ /* The caller didn't told to add a new entry. */
+ LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+ return NULL;
+ }
+
  *found = false;
 
  /* Check if we are getting too full. */
@@ -482,26 +488,47 @@ restart:
  * Give up our existing lock first, because resizing needs to
  * reacquire all the locks in the right order to avoid deadlocks.
  */
- LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+ LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
  resize(hash_table, hash_table->size_log2 + 1);
 
  goto restart;
  }
 
+ /* need to upgrade the lock to exclusive mode */
+ if (!exclusive)
+ {
+ LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+ LWLockAcquire(PARTITION_LOCK(hash_table, partidx), LW_EXCLUSIVE);
+ }
+
  /* Finally we can try to insert the new item. */
  item = insert_into_bucket(hash_table, key,
   &BUCKET_FOR_HASH(hash_table, hash));
  item->hash = hash;
  /* Adjust per-lock-partition counter for load factor knowledge. */
  ++partition->count;
+
+ if (!exclusive)
+ {
+ /*
+ * The entry can be removed while downgrading lock. Re-find it for
+ * safety.
+ */
+ LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+ inserted = true;
+
+ goto restart;
+ }
  }
 
- /* The caller must release the lock with dshash_release_lock. */
+ /* The caller will free the lock by calling dshash_release_lock. */
  hash_table->find_locked = true;
- hash_table->find_exclusively_locked = true;
+ hash_table->find_exclusively_locked = exclusive;
  return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 81a929b8d9..80a896a99b 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
  const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+  bool exclusive, bool nowait, bool insert,
+  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
--
2.18.2


From d214960c1f1ed8cde84f9c97e5a9caec1e411c48 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v25 4/8] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   | 22 +++++----
 src/backend/postmaster/pgarch.c     | 75 +----------------------------
 src/backend/postmaster/postmaster.c | 43 +++++++++++------
 src/include/miscadmin.h             |  2 +
 src/include/postmaster/pgarch.h     |  4 +-
 5 files changed, 46 insertions(+), 100 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 5480a024e0..d398ce6f03 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -319,6 +319,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
  case StartupProcess:
  MyBackendType = B_STARTUP;
  break;
+ case ArchiverProcess:
+ MyBackendType = B_ARCHIVER;
+ break;
  case BgWriterProcess:
  MyBackendType = B_BG_WRITER;
  break;
@@ -439,30 +442,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
  proc_exit(1); /* should never return */
 
  case StartupProcess:
- /* don't set signals, startup process has its own agenda */
  StartupProcessMain();
- proc_exit(1); /* should never return */
+ proc_exit(1);
+
+ case ArchiverProcess:
+ PgArchiverMain();
+ proc_exit(1);
 
  case BgWriterProcess:
- /* don't set signals, bgwriter has its own agenda */
  BackgroundWriterMain();
- proc_exit(1); /* should never return */
+ proc_exit(1);
 
  case CheckpointerProcess:
- /* don't set signals, checkpointer has its own agenda */
  CheckpointerMain();
- proc_exit(1); /* should never return */
+ proc_exit(1);
 
  case WalWriterProcess:
- /* don't set signals, walwriter has its own agenda */
  InitXLOGAccess();
  WalWriterMain();
- proc_exit(1); /* should never return */
+ proc_exit(1);
 
  case WalReceiverProcess:
- /* don't set signals, walreceiver has its own agenda */
  WalReceiverMain();
- proc_exit(1); /* should never return */
+ proc_exit(1);
 
  default:
  elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 37be0e2bbb..4971b3ae42 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -78,7 +78,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -95,7 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
@@ -110,75 +108,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- * Called from postmaster at startup or after an existing archiver
- * died.  Attempt to fire up a fresh archiver process.
- *
- * Returns PID of child process, or 0 if fail.
- *
- * Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
- time_t curtime;
- pid_t pgArchPid;
-
- /*
- * Do nothing if no archiver needed
- */
- if (!XLogArchivingActive())
- return 0;
-
- /*
- * Do nothing if too soon since last archiver start.  This is a safety
- * valve to protect against continuous respawn attempts if the archiver is
- * dying immediately at launch. Note that since we will be re-called from
- * the postmaster main loop, we will get another chance later.
- */
- curtime = time(NULL);
- if ((unsigned int) (curtime - last_pgarch_start_time) <
- (unsigned int) PGARCH_RESTART_INTERVAL)
- return 0;
- last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
- switch ((pgArchPid = pgarch_forkexec()))
-#else
- switch ((pgArchPid = fork_process()))
-#endif
- {
- case -1:
- ereport(LOG,
- (errmsg("could not fork archiver: %m")));
- return 0;
-
-#ifndef EXEC_BACKEND
- case 0:
- /* in postmaster child ... */
- InitPostmasterChild();
-
- /* Close the postmaster's sockets */
- ClosePostmasterPorts(false);
-
- /* Drop our connection to postmaster's shared memory, as well */
- dsm_detach_all();
- PGSharedMemoryDetach();
-
- PgArchiverMain(0, NULL);
- break;
-#endif
-
- default:
- return (int) pgArchPid;
- }
-
- /* shouldn't get here */
- return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -218,8 +147,8 @@ pgarch_forkexec(void)
  * The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  * since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
  /*
  * Ignore all signals usually bound to some action in the postmaster,
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 2b9ab32293..cab7fb5381 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC 0x0002 /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND 0x0004 /* walsender process */
 #define BACKEND_TYPE_BGWORKER 0x0008 /* bgworker process */
-#define BACKEND_TYPE_ALL 0x000F /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER 0x0010 /* archiver process */
+#define BACKEND_TYPE_ALL 0x001F /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif /* EXEC_BACKEND */
 
 #define StartupDataBase() StartChildProcess(StartupProcess)
+#define StartArchiver() StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer() StartChildProcess(CheckpointerProcess)
 #define StartWalWriter() StartChildProcess(WalWriterProcess)
@@ -1785,7 +1787,7 @@ ServerLoop(void)
 
  /* If we have lost the archiver, try to start a new one. */
  if (PgArchPID == 0 && PgArchStartupAllowed())
- PgArchPID = pgarch_start();
+ PgArchPID = StartArchiver();
 
  /* If we need to signal the autovacuum launcher, do so now */
  if (avlauncher_needs_signal)
@@ -3055,7 +3057,7 @@ reaper(SIGNAL_ARGS)
  if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
  AutoVacPID = StartAutoVacLauncher();
  if (PgArchStartupAllowed() && PgArchPID == 0)
- PgArchPID = pgarch_start();
+ PgArchPID = StartArchiver();
  if (PgStatPID == 0)
  PgStatPID = pgstat_start();
 
@@ -3190,20 +3192,16 @@ reaper(SIGNAL_ARGS)
  }
 
  /*
- * Was it the archiver?  If so, just try to start a new one; no need
- * to force reset of the rest of the system.  (If fail, we'll try
- * again in future cycles of the main loop.).  Unless we were waiting
- * for it to shut down; don't restart it in that case, and
- * PostmasterStateMachine() will advance to the next shutdown step.
+ * Was it the archiver?  Normal exit can be ignored; we'll start a new
+ * one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
  */
  if (pid == PgArchPID)
  {
  PgArchPID = 0;
  if (!EXIT_STATUS_0(exitstatus))
- LogChildExit(LOG, _("archiver process"),
- pid, exitstatus);
- if (PgArchStartupAllowed())
- PgArchPID = pgarch_start();
+ HandleChildCrash(pid, exitstatus,
+ _("archiver process"));
  continue;
  }
 
@@ -3451,7 +3449,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3656,6 +3654,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
  signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
  }
 
+ /* Take care of the archiver too */
+ if (pid == PgArchPID)
+ PgArchPID = 0;
+ else if (PgArchPID != 0 && take_action)
+ {
+ ereport(DEBUG2,
+ (errmsg_internal("sending %s to process %d",
+ (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ (int) PgArchPID)));
+ signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+ }
+
  /*
  * Force a power-cycle of the pgarch process too.  (This isn't absolutely
  * necessary, but it seems like a good idea for robustness, and it
@@ -3928,6 +3938,7 @@ PostmasterStateMachine(void)
  Assert(CheckpointerPID == 0);
  Assert(WalWriterPID == 0);
  Assert(AutoVacPID == 0);
+ Assert(PgArchPID == 0);
  /* syslogger is not considered here */
  pmState = PM_NO_CHILDREN;
  }
@@ -5208,7 +5219,7 @@ sigusr1_handler(SIGNAL_ARGS)
  */
  Assert(PgArchPID == 0);
  if (XLogArchivingAlways())
- PgArchPID = pgarch_start();
+ PgArchPID = StartArchiver();
 
  /*
  * If we aren't planning to enter hot standby mode later, treat
@@ -5493,6 +5504,10 @@ StartChildProcess(AuxProcType type)
  ereport(LOG,
  (errmsg("could not fork startup process: %m")));
  break;
+ case ArchiverProcess:
+ ereport(LOG,
+ (errmsg("could not fork archiver process: %m")));
+ break;
  case BgWriterProcess:
  ereport(LOG,
  (errmsg("could not fork background writer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..619b2f9c71 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -417,6 +417,7 @@ typedef enum
  BootstrapProcess,
  StartupProcess,
  BgWriterProcess,
+ ArchiverProcess,
  CheckpointerProcess,
  WalWriterProcess,
  WalReceiverProcess,
@@ -429,6 +430,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess() (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess() (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif /* _PGARCH_H */
--
2.18.2


From 6de0656c222bbb5e1f8c84c703796fee0e518740 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Mon, 16 Mar 2020 22:30:41 +0900
Subject: [PATCH v25 5/8] Use latch instead of SIGUSR1 to wake up archiver

This is going to be combined into the archiver patch just before.
---
 src/backend/access/transam/xlog.c        | 49 ++++++++++++++++++++++++
 src/backend/access/transam/xlogarchive.c |  2 +-
 src/backend/postmaster/pgarch.c          | 27 ++++++-------
 src/backend/postmaster/postmaster.c      | 10 -----
 src/include/access/xlog.h                |  2 +
 src/include/access/xlog_internal.h       |  1 +
 src/include/storage/pmsignal.h           |  1 -
 7 files changed, 65 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4fa446ffa4..5c477211e9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -668,6 +668,13 @@ typedef struct XLogCtlData
  */
  Latch recoveryWakeupLatch;
 
+ /*
+ * archiverWakeupLatch is used to wake up the archiver process to process
+ * completed WAL segments, if it is waiting for WAL to arrive.
+ * Protected by info_lck.
+ */
+ Latch   *archiverWakeupLatch;
+
  /*
  * During recovery, we keep a copy of the latest checkpoint record here.
  * lastCheckPointRecPtr points to start of checkpoint record and
@@ -8359,6 +8366,48 @@ GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
  return result;
 }
 
+/*
+ * XLogArchiveWakeupEnd - Set up archiver wakeup stuff
+ */
+void
+XLogArchiveWakeupStart(void)
+{
+ Latch *old_latch PG_USED_FOR_ASSERTS_ONLY;
+
+ SpinLockAcquire(&XLogCtl->info_lck);
+ old_latch = XLogCtl->archiverWakeupLatch;
+ XLogCtl->archiverWakeupLatch = MyLatch;
+ SpinLockRelease(&XLogCtl->info_lck);
+ Assert (old_latch == NULL);
+}
+
+/*
+ * XLogArchiveWakeupEnd - Clean up archiver wakeup stuff
+ */
+void
+XLogArchiveWakeupEnd(void)
+{
+ SpinLockAcquire(&XLogCtl->info_lck);
+ XLogCtl->archiverWakeupLatch = NULL;
+ SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * XLogWakeupArchiver - Wake up archiver process
+ */
+void
+XLogArchiveWakeup(void)
+{
+ Latch *latch;
+
+ SpinLockAcquire(&XLogCtl->info_lck);
+ latch = XLogCtl->archiverWakeupLatch;
+ SpinLockRelease(&XLogCtl->info_lck);
+
+ if (latch)
+ SetLatch(latch);
+}
+
 /*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 188b73e752..cedf969812 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -535,7 +535,7 @@ XLogArchiveNotify(const char *xlog)
 
  /* Notify archiver that it's got something to do */
  if (IsUnderPostmaster)
- SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+ XLogArchiveWakeup();
 }
 
 /*
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 4971b3ae42..6fe7a136ba 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -94,7 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -141,6 +141,13 @@ pgarch_forkexec(void)
 #endif /* EXEC_BACKEND */
 
 
+/* Clean up notification stuff on exit */
+static void
+PgArchiverKill(int code, Datum arg)
+{
+ XLogArchiveWakeupEnd();
+}
+
 /*
  * PgArchiverMain
  *
@@ -160,7 +167,7 @@ PgArchiverMain(void)
  pqsignal(SIGQUIT, SignalHandlerForCrashExit);
  pqsignal(SIGALRM, SIG_IGN);
  pqsignal(SIGPIPE, SIG_IGN);
- pqsignal(SIGUSR1, pgarch_waken);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
  pqsignal(SIGUSR2, pgarch_waken_stop);
  /* Reset some signals that are accepted by postmaster but not here */
  pqsignal(SIGCHLD, SIG_DFL);
@@ -169,24 +176,14 @@ PgArchiverMain(void)
  MyBackendType = B_ARCHIVER;
  init_ps_display(NULL);
 
+ XLogArchiveWakeupStart();
+ on_shmem_exit(PgArchiverKill, 0);
+
  pgarch_MainLoop();
 
  exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
- int save_errno = errno;
-
- /* set flag that there is work to be done */
- wakened = true;
- SetLatch(MyLatch);
-
- errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index cab7fb5381..fab4a9dd51 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -5262,16 +5262,6 @@ sigusr1_handler(SIGNAL_ARGS)
  if (StartWorkerNeeded || HaveCrashedWorker)
  maybe_start_bgworkers();
 
- if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
- PgArchPID != 0)
- {
- /*
- * Send SIGUSR1 to archiver process, to wake it up and begin archiving
- * next WAL file.
- */
- signal_child(PgArchPID, SIGUSR1);
- }
-
  /* Tell syslogger to rotate logfile if requested */
  if (SysLoggerPID != 0)
  {
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033fc20..59e2f0f95a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -311,6 +311,8 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 27ded593ab..a272d62b1f 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -331,6 +331,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
  PMSIGNAL_RECOVERY_STARTED, /* recovery has started */
  PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
- PMSIGNAL_WAKEN_ARCHIVER, /* send a NOTIFY signal to xlog archiver */
  PMSIGNAL_ROTATE_LOGFILE, /* send SIGUSR1 to syslogger to rotate logfile */
  PMSIGNAL_START_AUTOVAC_LAUNCHER, /* start an autovacuum launcher */
  PMSIGNAL_START_AUTOVAC_WORKER, /* start an autovacuum worker */
--
2.18.2


From d2a7f51a744b2feca23bef8b95f48cef9ff61acf Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v25 6/8] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 500ms and such large data is
serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 500ms. Locks
on the shared statistics is acquired by the units of such like tables,
functions so the expected chance of collision are not so
high. Furthermore, until 1 second has elapsed since the last flushing
to shared stats, lock failure postpones stats flushing so that lock
contention doesn't slow down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/transam/xlog.c            |    4 +-
 src/backend/catalog/index.c                  |   24 +-
 src/backend/postmaster/autovacuum.c          |   12 +-
 src/backend/postmaster/bgwriter.c            |    2 +-
 src/backend/postmaster/checkpointer.c        |   12 +-
 src/backend/postmaster/pgarch.c              |    4 +-
 src/backend/postmaster/pgstat.c              | 4625 +++++++-----------
 src/backend/postmaster/postmaster.c          |   85 +-
 src/backend/storage/buffer/bufmgr.c          |    8 +-
 src/backend/storage/ipc/ipci.c               |    2 +
 src/backend/storage/lmgr/lwlock.c            |    1 +
 src/backend/tcop/postgres.c                  |   26 +-
 src/backend/utils/adt/pgstatfuncs.c          |   53 +-
 src/backend/utils/init/globals.c             |    1 +
 src/backend/utils/init/postinit.c            |   11 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |    4 +-
 src/include/miscadmin.h                      |    2 +
 src/include/pgstat.h                         |  500 +-
 src/include/storage/lwlock.h                 |    1 +
 src/include/utils/timeout.h                  |    1 +
 20 files changed, 1991 insertions(+), 3387 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5c477211e9..4ea29b8997 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8506,9 +8506,9 @@ LogCheckpointEnd(bool restartpoint)
  &sync_secs, &sync_usecs);
 
  /* Accumulate checkpoint timing summary data, in milliseconds. */
- BgWriterStats.m_checkpoint_write_time +=
+ BgWriterStats.checkpoint_write_time +=
  write_secs * 1000 + write_usecs / 1000;
- BgWriterStats.m_checkpoint_sync_time +=
+ BgWriterStats.checkpoint_sync_time +=
  sync_secs * 1000 + sync_usecs / 1000;
 
  /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 76fd938ce3..613cef9282 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1687,28 +1687,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
  /*
  * Copy over statistics from old to new index
+ * The data will be sent by the next pgstat_report_stat()
+ * call.
  */
- {
- PgStat_StatTabEntry *tabentry;
-
- tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
- if (tabentry)
- {
- if (newClassRel->pgstat_info)
- {
- newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
- newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
- newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
- newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
- newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
- /*
- * The data will be sent by the next pgstat_report_stat()
- * call.
- */
- }
- }
- }
+ pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
  /* Close relations */
  table_close(pg_class, RowExclusiveLock);
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index da75e755f0..333712d3c5 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1956,15 +1956,15 @@ do_autovacuum(void)
   ALLOCSET_DEFAULT_SIZES);
  MemoryContextSwitchTo(AutovacMemCxt);
 
+ /* Start a transaction so our commands have one to play into. */
+ StartTransactionCommand();
+
  /*
  * may be NULL if we couldn't find an entry (only happens if we are
  * forcing a vacuum for anti-wrap purposes).
  */
  dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 
- /* Start a transaction so our commands have one to play into. */
- StartTransactionCommand();
-
  /*
  * Clean up any dead statistics collector entries for this DB. We always
  * want to do this exactly once per DB-processing cycle, even if we find
@@ -2748,12 +2748,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
  if (isshared)
  {
  if (PointerIsValid(shared))
- tabentry = hash_search(shared->tables, &relid,
-   HASH_FIND, NULL);
+ tabentry = pgstat_fetch_stat_tabentry_snapshot(shared, relid);
  }
  else if (PointerIsValid(dbentry))
- tabentry = hash_search(dbentry->tables, &relid,
-   HASH_FIND, NULL);
+ tabentry = pgstat_fetch_stat_tabentry_snapshot(dbentry, relid);
 
  return tabentry;
 }
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427..94bdd664b5 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -236,7 +236,7 @@ BackgroundWriterMain(void)
  /*
  * Send off activity statistics to the stats collector
  */
- pgstat_send_bgwriter();
+ pgstat_report_bgwriter();
 
  if (FirstCallSinceLastCheckpoint())
  {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e354a78725..8a2fd0ddb2 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -350,7 +350,7 @@ CheckpointerMain(void)
  if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
  {
  do_checkpoint = true;
- BgWriterStats.m_requested_checkpoints++;
+ BgWriterStats.requested_checkpoints++;
  }
 
  /*
@@ -364,7 +364,7 @@ CheckpointerMain(void)
  if (elapsed_secs >= CheckPointTimeout)
  {
  if (!do_checkpoint)
- BgWriterStats.m_timed_checkpoints++;
+ BgWriterStats.timed_checkpoints++;
  do_checkpoint = true;
  flags |= CHECKPOINT_CAUSE_TIME;
  }
@@ -492,7 +492,7 @@ CheckpointerMain(void)
  * worth the trouble to split the stats support into two independent
  * stats message types.)
  */
- pgstat_send_bgwriter();
+ pgstat_report_bgwriter();
 
  /*
  * Sleep until we are signaled or it's time for another checkpoint or
@@ -693,7 +693,7 @@ CheckpointWriteDelay(int flags, double progress)
  /*
  * Report interim activity statistics to the stats collector.
  */
- pgstat_send_bgwriter();
+ pgstat_report_bgwriter();
 
  /*
  * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1238,8 +1238,8 @@ AbsorbSyncRequests(void)
  LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
  /* Transfer stats counts into pending pgstats message */
- BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
- BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+ BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+ BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
  CheckpointerShmem->num_backend_writes = 0;
  CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 6fe7a136ba..f0b524ca50 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -401,7 +401,7 @@ pgarch_ArchiverCopyLoop(void)
  * Tell the collector about the WAL file that we successfully
  * archived
  */
- pgstat_send_archiver(xlog, false);
+ pgstat_report_archiver(xlog, false);
 
  break; /* out of inner retry loop */
  }
@@ -411,7 +411,7 @@ pgarch_ArchiverCopyLoop(void)
  * Tell the collector about the WAL file that we failed to
  * archive
  */
- pgstat_send_archiver(xlog, true);
+ pgstat_report_archiver(xlog, true);
 
  if (++failures >= NUM_ARCHIVE_RETRIES)
  {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f9287b7942..34a4005791 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,23 @@
 /* ----------
  * pgstat.c
  *
- * All the statistics collector stuff hacked up in one big, ugly file.
+ * Activity Statistics facility.
  *
- * TODO: - Separate collector, postmaster and backend stuff
- *  into different files.
+ *  Collects per-table and per-function usage statistics of all backends on
+ *  shared memory. pg_count_*() and friends are the interface to locally store
+ *  backend activities during a transaction. Then pgstat_flush_stat() is called
+ *  at the end of a transaction to publish the local stats on shared memory.
  *
- * - Add some automatic call for pgstat vacuuming.
+ *  To avoid congestion on the shared memory, we update shared stats no more
+ *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). In the case where
+ *  all the local numbers cannot be flushed immediately, we postpone updates
+ *  and try the next chance after the interval of
+ *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but we don't wait for no longer than
+ *  PGSTAT_STAT_MAX_INTERVAL(1000ms).
  *
- * - Add a pgstat config column to pg_database, so this
- *  entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  * Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +27,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,68 +36,43 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL 500 /* Minimum time between stats file
- * updates; in milliseconds. */
-
-#define PGSTAT_RETRY_DELAY 10 /* How long to wait between checks for a
- * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME 10000 /* Maximum time to wait for a stats
- * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL 640 /* How often to ping the collector for a
- * new file; in milliseconds. */
+#define PGSTAT_STAT_MIN_INTERVAL 500 /* Minimum interval of stats data
+ * updates; in milliseconds. */
 
-#define PGSTAT_RESTART_INTERVAL 60 /* How often to attempt to restart a
- * failed statistics collector; in
- * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF (100 * 1024)
+#define PGSTAT_STAT_RETRY_INTERVAL 100 /* Retry interval after
+ * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Longest interval of stats data
+ * updates; in milliseconds. */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE 16
-#define PGSTAT_TAB_HASH_SIZE 512
+#define PGSTAT_TABLE_HASH_SIZE 512
 #define PGSTAT_FUNCTION_HASH_SIZE 512
 
 
@@ -116,7 +87,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -131,76 +101,96 @@ int pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char   *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char   *pgstat_stat_filename = NULL;
 char   *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-
-/* ----------
- * Local data
- * ----------
- */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
-
-/*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
- *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
- */
-#define TABSTAT_QUANTUM 100 /* we alloc this many at a time */
-
-typedef struct TabStatusArray
+/* Shared stats bootstrap information, protected by StatsLock */
+typedef struct StatsShmemStruct
 {
- struct TabStatusArray *tsa_next; /* link to next array, if any */
- int tsa_used; /* # entries currently used */
- PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM]; /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+ dsa_handle stats_dsa_handle; /* DSA handle for stats data */
+ dshash_table_handle db_hash_handle;
+ dsa_pointer global_stats;
+ dsa_pointer archiver_stats;
+ int refcount;
+} StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
+
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *pgStatDBHash = NULL;
+
+
+/* parameter for shared hashes */
+static const dshash_parameters dsh_dbparams = {
+ sizeof(Oid),
+ SHARED_DBENT_SIZE,
+ dshash_memcmp,
+ dshash_memhash,
+ LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_tblparams = {
+ sizeof(Oid),
+ sizeof(PgStat_StatTabEntry),
+ dshash_memcmp,
+ dshash_memhash,
+ LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_funcparams = {
+ sizeof(Oid),
+ sizeof(PgStat_StatFuncEntry),
+ dshash_memcmp,
+ dshash_memhash,
+ LWTRANCHE_STATS
+};
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * Backends store per-table info that's waiting to be flushed out to shared
+ * memory in this hash table (indexed by table OID).
  */
-typedef struct TabStatHashEntry
-{
- Oid t_id;
- PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+static HTAB *pgStatTables = NULL;
 
 /*
- * Hash table for O(1) t_id -> tsa_entry lookup
+ * Backends store per-function info that's waiting to be flushed out to shared
+ * memory in this hash table (indexed by function OID).
  */
-static HTAB *pgStatTabHash = NULL;
+static HTAB *pgStatFunctions = NULL;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Backends store database-wide counters that's waiting to be flushed out to
+ * shared memory.
  */
-static HTAB *pgStatFunctions = NULL;
+static PgStat_TableCounts pgStatMyDatabaseStats = {0};
+static PgStat_TableCounts pgStatSharedDatabaseStats = {0};
 
 /*
  * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * written out to the shared stats.
  */
+static bool have_mydatabase_stats = false;
+static bool have_shdatabase_stats = false;
+static bool have_table_stats = false;
 static bool have_function_stats = false;
 
+/* common header of snapshot entry in reader snapshot hash */
+typedef struct PgStat_snapshot
+{
+ Oid key;
+ bool negative;
+ void   *body; /* end of header part: to keep alignment */
+} PgStat_snapshot;
+
+/* Hash entry struct for checksum_failures above */
+typedef struct ChecksumFailureEnt
+{
+ Oid dboid;
+ int count;
+} ChecksumFailureEnt;
+
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
  * into PgStat_TableStatus counters until we know if it is going to commit
@@ -236,11 +226,15 @@ typedef struct TwoPhasePgStatRecord
  bool t_truncated; /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatLocalHash = NULL;
+static bool clear_snapshot = false;
+
+/* Count checksum failure for each database */
+HTAB   *checksum_failures = NULL;
+int nchecksum_failures = 0;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -249,19 +243,17 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
- */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memory and snapshot_* are backend
+ * snapshots. Their validity is indicated by global_snapshot_is_valid.
  */
-static List *pending_write_requests = NIL;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -275,33 +267,34 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
-
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool exclusive,
+   bool nowait, bool create);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table,
  Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static void pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
+static void pgstat_flush_dbstats(bool shared, bool nowait);
+static bool pgstat_flush_tabstats(Oid dbid, dshash_table_handle tabhandle,
+  bool nowait);
+static bool pgstat_flush_funcstats(dshash_table_handle funchandle, bool nowait);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+   PgStat_TableStatus *stat, bool nowait);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
 
+static void pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+  const dshash_parameters *dshparams,
+  HTAB *oidtab);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
+static void pgstat_flush_checksum_failure(bool nowait);
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry_snapshot(PgStat_StatDBEntry *dbent, Oid funcid);
+static void pgstat_snapshot_global_stats(void);
 
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
@@ -309,484 +302,210 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+/* ------------------------------------------------------------
+ * Local support functions follow
+ * ------------------------------------------------------------
+ */
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- * Called from postmaster at startup. Create the resources required
- * by the statistics collector process.  If unable to do so, do not
- * fail --- better to let the postmaster start with stats collection
- * disabled.
- * ----------
+/*
+ * StatsShmemSize
+ * Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+ return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
- ACCEPT_TYPE_ARG3 alen;
- struct addrinfo *addrs = NULL,
-   *addr,
- hints;
- int ret;
- fd_set rset;
- struct timeval tv;
- char test_byte;
- int sel_res;
- int tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
+ bool found;
 
- /*
- * This static assertion verifies that we didn't mess up the calculations
- * involved in selecting maximum payload sizes for our UDP messages.
- * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
- * be silent performance loss from fragmentation, it seems worth having a
- * compile-time cross-check that we didn't.
- */
- StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
- "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
+ StatsShmem = (StatsShmemStruct *)
+ ShmemInitStruct("Stats area", StatsShmemSize(),
+ &found);
 
- /*
- * Create the UDP socket for sending and receiving statistic messages
- */
- hints.ai_flags = AI_PASSIVE;
- hints.ai_family = AF_UNSPEC;
- hints.ai_socktype = SOCK_DGRAM;
- hints.ai_protocol = 0;
- hints.ai_addrlen = 0;
- hints.ai_addr = NULL;
- hints.ai_canonname = NULL;
- hints.ai_next = NULL;
- ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
- if (ret || !addrs)
+ if (!IsUnderPostmaster)
  {
- ereport(LOG,
- (errmsg("could not resolve \"localhost\": %s",
- gai_strerror(ret))));
- goto startup_failed;
- }
+ Assert(!found);
 
- /*
- * On some platforms, pg_getaddrinfo_all() may return multiple addresses
- * only one of which will actually work (eg, both IPv6 and IPv4 addresses
- * when kernel will reject IPv6).  Worse, the failure may occur at the
- * bind() or perhaps even connect() stage.  So we must loop through the
- * results till we find a working combination. We will generate LOG
- * messages, but no error, for bogus combinations.
- */
- for (addr = addrs; addr; addr = addr->ai_next)
- {
-#ifdef HAVE_UNIX_SOCKETS
- /* Ignore AF_UNIX sockets, if any are returned. */
- if (addr->ai_family == AF_UNIX)
- continue;
-#endif
+ StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+ }
+}
 
- if (++tries > 1)
- ereport(LOG,
- (errmsg("trying another address for the statistics collector")));
+/* ----------
+ * pgstat_attach_shared_stats() -
+ *
+ * Attach shared or create stats memory.
+ * ---------
+ */
+static void
+pgstat_attach_shared_stats(void)
+{
+ PgStat_StatDBEntry *dbent;
 
- /*
- * Create the socket.
- */
- if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
- {
- ereport(LOG,
- (errcode_for_socket_access(),
- errmsg("could not create socket for statistics collector: %m")));
- continue;
- }
+ MemoryContext oldcontext;
 
- /*
- * Bind it to a kernel assigned port on localhost and get the assigned
- * port via getsockname().
- */
- if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
- {
- ereport(LOG,
- (errcode_for_socket_access(),
- errmsg("could not bind socket for statistics collector: %m")));
- closesocket(pgStatSock);
- pgStatSock = PGINVALID_SOCKET;
- continue;
- }
+ /*
+ * Don't use dsm under postmaster, or when not tracking counts.
+ */
+ if (!pgstat_track_counts || !IsUnderPostmaster)
+ return;
 
- alen = sizeof(pgStatAddr);
- if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
- {
- ereport(LOG,
- (errcode_for_socket_access(),
- errmsg("could not get address of socket for statistics collector: %m")));
- closesocket(pgStatSock);
- pgStatSock = PGINVALID_SOCKET;
- continue;
- }
+ pgstat_setup_memcxt();
 
- /*
- * Connect the socket to its own address.  This saves a few cycles by
- * not having to respecify the target address on every send. This also
- * provides a kernel-level check that only packets from this same
- * address will be received.
- */
- if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
- {
- ereport(LOG,
- (errcode_for_socket_access(),
- errmsg("could not connect socket for statistics collector: %m")));
- closesocket(pgStatSock);
- pgStatSock = PGINVALID_SOCKET;
- continue;
- }
+ if (area)
+ return;
 
- /*
- * Try to send and receive a one-byte test message on the socket. This
- * is to catch situations where the socket can be created but will not
- * actually pass data (for instance, because kernel packet filtering
- * rules prevent it).
- */
- test_byte = TESTBYTEVAL;
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-retry1:
- if (send(pgStatSock, &test_byte, 1, 0) != 1)
- {
- if (errno == EINTR)
- goto retry1; /* if interrupted, just retry */
- ereport(LOG,
- (errcode_for_socket_access(),
- errmsg("could not send test message on socket for statistics collector: %m")));
- closesocket(pgStatSock);
- pgStatSock = PGINVALID_SOCKET;
- continue;
- }
+ LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
- /*
- * There could possibly be a little delay before the message can be
- * received.  We arbitrarily allow up to half a second before deciding
- * it's broken.
- */
- for (;;) /* need a loop to handle EINTR */
- {
- FD_ZERO(&rset);
- FD_SET(pgStatSock, &rset);
+ if (StatsShmem->refcount > 0)
+ StatsShmem->refcount++;
+ else
+ {
+ /* Need to create shared memory area and load saved stats if any. */
+ Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
- tv.tv_sec = 0;
- tv.tv_usec = 500000;
- sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
- if (sel_res >= 0 || errno != EINTR)
- break;
- }
- if (sel_res < 0)
- {
- ereport(LOG,
- (errcode_for_socket_access(),
- errmsg("select() failed in statistics collector: %m")));
- closesocket(pgStatSock);
- pgStatSock = PGINVALID_SOCKET;
- continue;
- }
- if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
- {
- /*
- * This is the case we actually think is likely, so take pains to
- * give a specific message for it.
- *
- * errno will not be set meaningfully here, so don't use it.
- */
- ereport(LOG,
- (errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("test message did not get through on socket for statistics collector")));
- closesocket(pgStatSock);
- pgStatSock = PGINVALID_SOCKET;
- continue;
- }
+ /* Initialize shared memory area */
+ area = dsa_create(LWTRANCHE_STATS);
+ pgStatDBHash = dshash_create(area, &dsh_dbparams, 0);
 
- test_byte++; /* just make sure variable is changed */
+ StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+ StatsShmem->global_stats =
+ dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+ StatsShmem->archiver_stats =
+ dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+ StatsShmem->db_hash_handle = dshash_get_hash_table_handle(pgStatDBHash);
 
-retry2:
- if (recv(pgStatSock, &test_byte, 1, 0) != 1)
- {
- if (errno == EINTR)
- goto retry2; /* if interrupted, just retry */
- ereport(LOG,
- (errcode_for_socket_access(),
- errmsg("could not receive test message on socket for statistics collector: %m")));
- closesocket(pgStatSock);
- pgStatSock = PGINVALID_SOCKET;
- continue;
- }
+ shared_globalStats = (PgStat_GlobalStats *)
+ dsa_get_address(area, StatsShmem->global_stats);
+ shared_archiverStats = (PgStat_ArchiverStats *)
+ dsa_get_address(area, StatsShmem->archiver_stats);
 
- if (test_byte != TESTBYTEVAL) /* strictly paranoia ... */
- {
- ereport(LOG,
- (errcode(ERRCODE_INTERNAL_ERROR),
- errmsg("incorrect test message transmission on socket for statistics collector")));
- closesocket(pgStatSock);
- pgStatSock = PGINVALID_SOCKET;
- continue;
- }
+ /* Load saved data if any. */
+ pgstat_read_statsfiles();
 
- /* If we get here, we have a working socket */
- break;
+ StatsShmem->refcount = 1;
  }
 
- /* Did we find a working address? */
- if (!addr || pgStatSock == PGINVALID_SOCKET)
- goto startup_failed;
+ LWLockRelease(StatsLock);
 
  /*
- * Set the socket to non-blocking IO.  This ensures that if the collector
- * falls behind, statistics messages will be discarded; backends won't
- * block waiting to send messages to the collector.
+ * If we're not the first process, attach existing shared stats area
+ * outside StatsLock.
  */
- if (!pg_set_noblock(pgStatSock))
+ if (!area)
  {
- ereport(LOG,
- (errcode_for_socket_access(),
- errmsg("could not set statistics collector socket to nonblocking mode: %m")));
- goto startup_failed;
- }
-
- /*
- * Try to ensure that the socket's receive buffer is at least
- * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
- * data.  Use of UDP protocol means that we are willing to lose data under
- * heavy load, but we don't want it to happen just because of ridiculously
- * small default buffer sizes (such as 8KB on older Windows versions).
- */
- {
- int old_rcvbuf;
- int new_rcvbuf;
- ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
- if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-   (char *) &old_rcvbuf, &rcvbufsize) < 0)
- {
- elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
- /* if we can't get existing size, always try to set it */
- old_rcvbuf = 0;
- }
-
- new_rcvbuf = PGSTAT_MIN_RCVBUF;
- if (old_rcvbuf < new_rcvbuf)
- {
- if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-   (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
- elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
- }
+ /* Shared area already exists. Just attach it. */
+ area = dsa_attach(StatsShmem->stats_dsa_handle);
+ pgStatDBHash = dshash_attach(area, &dsh_dbparams,
+ StatsShmem->db_hash_handle, 0);
+
+ /* Setup local variables */
+ pgStatLocalHash = NULL;
+ shared_globalStats = (PgStat_GlobalStats *)
+ dsa_get_address(area, StatsShmem->global_stats);
+ shared_archiverStats = (PgStat_ArchiverStats *)
+ dsa_get_address(area, StatsShmem->archiver_stats);
  }
 
- pg_freeaddrinfo_all(hints.ai_family, addrs);
-
- /* Now that we have a long-lived socket, tell fd.c about it. */
- ReserveExternalFD();
-
- return;
-
-startup_failed:
- ereport(LOG,
- (errmsg("disabling statistics collector for lack of working socket")));
-
- if (addrs)
- pg_freeaddrinfo_all(hints.ai_family, addrs);
-
- if (pgStatSock != PGINVALID_SOCKET)
- closesocket(pgStatSock);
- pgStatSock = PGINVALID_SOCKET;
+ MemoryContextSwitchTo(oldcontext);
 
  /*
- * Adjust GUC variables to suppress useless activity, and for debugging
- * purposes (seeing track_counts off is a clue that we failed here). We
- * use PGC_S_OVERRIDE because there is no point in trying to turn it back
- * on from postgresql.conf without a restart.
+ * create db entries for the current database and shared table if not
+ * created yet.
  */
- SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+ dbent = pgstat_get_db_entry(MyDatabaseId, false, false, true);
+ Assert(dbent);
+ dshash_release_lock(pgStatDBHash, dbent);
+ dbent = pgstat_get_db_entry(InvalidOid, false, false, true);
+ Assert(dbent);
+ dshash_release_lock(pgStatDBHash, dbent);
+
+ /* don't detach automatically */
+ dsa_pin_mapping(area);
+ global_snapshot_is_valid = false;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * pgstat_detach_shared_stats() -
+ *
+ * Detach shared stats. Write out to file if we're the last process and told
+ * to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_detach_shared_stats(bool write_stats)
 {
- DIR   *dir;
- struct dirent *entry;
- char fname[MAXPGPATH * 2];
+ /* immediately return if useless */
+ if (!area || !IsUnderPostmaster)
+ return;
+
+ LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
- dir = AllocateDir(directory);
- while ((entry = ReadDir(dir, directory)) != NULL)
+ /* write out the shared stats to file if needed */
+ if (--StatsShmem->refcount < 1)
  {
- int nchars;
- Oid tmp_oid;
+ if (write_stats)
+ pgstat_write_statsfiles();
 
- /*
- * Skip directory entries that don't match the file names we write.
- * See get_dbstat_filename for the database-specific pattern.
- */
- if (strncmp(entry->d_name, "global.", 7) == 0)
- nchars = 7;
- else
- {
- nchars = 0;
- (void) sscanf(entry->d_name, "db_%u.%n",
-  &tmp_oid, &nchars);
- if (nchars <= 0)
- continue;
- /* %u allows leading whitespace, so reject that */
- if (strchr("0123456789", entry->d_name[3]) == NULL)
- continue;
- }
+ /* We're the last process. Invalidate the dsa area handle. */
+ StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+ }
 
- if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
- strcmp(entry->d_name + nchars, "stat") != 0)
- continue;
+ LWLockRelease(StatsLock);
 
- snprintf(fname, sizeof(fname), "%s/%s", directory,
- entry->d_name);
- unlink(fname);
- }
- FreeDir(dir);
+ /*
+ * Detach the area. Automatically destroyed when the last process detached
+ * it.
+ */
+ dsa_detach(area);
+
+ area = NULL;
+ pgStatDBHash = NULL;
+ shared_globalStats = NULL;
+ shared_archiverStats = NULL;
+ pgStatLocalHash = NULL;
+ global_snapshot_is_valid = false;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
- pgstat_reset_remove_files(pgstat_stat_directory);
- pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
-
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
- char   *av[10];
- int ac = 0;
-
- av[ac++] = "postgres";
- av[ac++] = "--forkcol";
- av[ac++] = NULL; /* filled in by postmaster_forkexec */
-
- av[ac] = NULL;
- Assert(ac < lengthof(av));
-
- return postmaster_forkexec(ac, av);
-}
-#endif /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- * Called from postmaster at startup or after an existing collector
- * died.  Attempt to fire up a fresh statistics collector.
- *
- * Returns PID of child process, or 0 if fail.
- *
- * Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
- time_t curtime;
- pid_t pgStatPid;
+ /* standalone server doesn't use shared stats */
+ if (!IsUnderPostmaster)
+ return;
 
- /*
- * Check that the socket is there, else pgstat_init failed and we can do
- * nothing useful.
- */
- if (pgStatSock == PGINVALID_SOCKET)
- return 0;
+ /* we must have shared stats attached */
+ Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
- /*
- * Do nothing if too soon since last collector start.  This is a safety
- * valve to protect against continuous respawn attempts if the collector
- * is dying immediately at launch.  Note that since we will be re-called
- * from the postmaster main loop, we will get another chance later.
- */
- curtime = time(NULL);
- if ((unsigned int) (curtime - last_pgstat_start_time) <
- (unsigned int) PGSTAT_RESTART_INTERVAL)
- return 0;
- last_pgstat_start_time = curtime;
+ /* Startup must be the only user of shared stats */
+ Assert(StatsShmem->refcount == 1);
 
  /*
- * Okay, fork off the collector.
+ * We could directly remove files and recreate the shared memory area. But
+ * detach then attach for simplicity.
  */
-#ifdef EXEC_BACKEND
- switch ((pgStatPid = pgstat_forkexec()))
-#else
- switch ((pgStatPid = fork_process()))
-#endif
- {
- case -1:
- ereport(LOG,
- (errmsg("could not fork statistics collector: %m")));
- return 0;
-
-#ifndef EXEC_BACKEND
- case 0:
- /* in postmaster child ... */
- InitPostmasterChild();
-
- /* Close the postmaster's sockets */
- ClosePostmasterPorts(false);
-
- /* Drop our connection to postmaster's shared memory, as well */
- dsm_detach_all();
- PGSharedMemoryDetach();
-
- PgstatCollectorMain(0, NULL);
- break;
-#endif
-
- default:
- return (int) pgStatPid;
- }
-
- /* shouldn't get here */
- return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
- last_pgstat_start_time = 0;
+ pgstat_detach_shared_stats(false); /* Don't write */
+ pgstat_attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -794,259 +513,441 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  * Must be called by processes that performs DML: tcop/postgres.c, logical
- * receiver processes, SPI worker, etc. to send the so far collected
- * per-table and function usage statistics to the collector.  Note that this
- * is called only when not within a transaction, so it is fair to use
+ * receiver processes, SPI worker, etc. to apply the so far collected
+ * per-table and function usage statistics to the shared statistics hashes.
+ *
+ * Updates are applied not more frequent than the interval of
+ * PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ * failure if force is false and there's no pending updates longer than
+ * PGSTAT_STAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ * succeeding calls of this function.
+ *
+ * Returns the time until the next timing when updates are applied in
+ * milliseconds if there are no updates held for more than
+ * PGSTAT_STAT_MIN_INTERVAL milliseconds.
+ *
+ * Note that this is called only out of a transaction, so it is fine to use
  * transaction stop time as an approximation of current time.
- * ----------
+ * ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
- /* we assume this inits to all zeroes: */
- static const PgStat_TableCounts all_zeroes;
- static TimestampTz last_report = 0;
-
+ static TimestampTz next_flush = 0;
+ static TimestampTz pending_since = 0;
  TimestampTz now;
- PgStat_MsgTabstat regular_msg;
- PgStat_MsgTabstat shared_msg;
- TabStatusArray *tsa;
- int i;
+ PgStat_StatDBEntry *dbent;
+ bool nowait = !force; /* Don't use force ever after */
+ long elapsed;
+ long secs;
+ int usecs;
+ dshash_table_handle tables_handle;
+ dshash_table_handle functions_handle;
+ bool process_shared_tables = false;
 
  /* Don't expend a clock check if nothing to do */
- if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
- pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
- !have_function_stats)
- return;
+ if (area == NULL ||
+ (!have_table_stats && !have_function_stats &&
+ !have_mydatabase_stats && !have_shdatabase_stats &&
+ pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+ checksum_failures != NULL))
+ return 0;
 
- /*
- * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
- * msec since we last sent one, or the caller wants to force stats out.
- */
  now = GetCurrentTransactionStopTimestamp();
- if (!force &&
- !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
- return;
- last_report = now;
-
- /*
- * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
- * entries it points to.  (Should we fail partway through the loop below,
- * it's okay to have removed the hashtable already --- the only
- * consequence is we'd get multiple entries for the same table in the
- * pgStatTabList, and that's safe.)
- */
- if (pgStatTabHash)
- hash_destroy(pgStatTabHash);
- pgStatTabHash = NULL;
 
- /*
- * Scan through the TabStatusArray struct(s) to find tables that actually
- * have counts, and build messages to send.  We have to separate shared
- * relations from regular ones because the databaseid field in the message
- * header has to depend on that.
- */
- regular_msg.m_databaseid = MyDatabaseId;
- shared_msg.m_databaseid = InvalidOid;
- regular_msg.m_nentries = 0;
- shared_msg.m_nentries = 0;
-
- for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+ if (nowait)
  {
- for (i = 0; i < tsa->tsa_used; i++)
+ /*
+ * Don't flush stats unless it's the time.  Returns time to wait in
+ * milliseconds.
+ */
+ if (now < next_flush)
  {
- PgStat_TableStatus *entry = &tsa->tsa_entries[i];
- PgStat_MsgTabstat *this_msg;
- PgStat_TableEntry *this_ent;
+ /* Record the oldest pending update if not yet. */
+ if (pending_since == 0)
+ pending_since = now;
 
- /* Shouldn't have any pending transaction-dependent counts */
- Assert(entry->trans == NULL);
+ /* now < next_flush here */
+ return (next_flush - now) / 1000;
+ }
 
- /*
- * Ignore entries that didn't accumulate any actual counts, such
- * as indexes that were opened by the planner but not used.
- */
- if (memcmp(&entry->t_counts, &all_zeroes,
-   sizeof(PgStat_TableCounts)) == 0)
- continue;
+ /*
+ * Don't keep pending updates longer than PGSTAT_STAT_MAX_INTERVAL.
+ */
+ if (pending_since > 0)
+ {
+ TimestampDifference(pending_since, now, &secs, &usecs);
+ elapsed = secs * 1000 + usecs / 1000;
 
- /*
- * OK, insert data into the appropriate message, and send if full.
- */
- this_msg = entry->t_shared ? &shared_msg : &regular_msg;
- this_ent = &this_msg->m_entry[this_msg->m_nentries];
- this_ent->t_id = entry->t_id;
- memcpy(&this_ent->t_counts, &entry->t_counts,
-   sizeof(PgStat_TableCounts));
- if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
- {
- pgstat_send_tabstat(this_msg);
- this_msg->m_nentries = 0;
- }
+ if (elapsed > PGSTAT_STAT_MAX_INTERVAL)
+ nowait = false;
  }
- /* zero out PgStat_TableStatus structs after use */
- MemSet(tsa->tsa_entries, 0,
-   tsa->tsa_used * sizeof(PgStat_TableStatus));
- tsa->tsa_used = 0;
  }
 
- /*
- * Send partial messages.  Make sure that any pending xact commit/abort
- * gets counted, even if there are no table stats to send.
- */
- if (regular_msg.m_nentries > 0 ||
- pgStatXactCommit > 0 || pgStatXactRollback > 0)
- pgstat_send_tabstat(&regular_msg);
- if (shared_msg.m_nentries > 0)
- pgstat_send_tabstat(&shared_msg);
-
- /* Now, send function statistics */
- pgstat_send_funcstats();
-}
+ /* Flush out individual stats tables */
+ dbent = pgstat_get_db_entry(MyDatabaseId, false, nowait, false);
+ tables_handle = dbent->tables;
+ functions_handle = dbent->functions;
+ dshash_release_lock(pgStatDBHash, dbent);
 
-/*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
- */
+ /* dbent is no longer usable but indicates it was acquired */
+ if (dbent)
+ {
+ process_shared_tables =
+ pgstat_flush_tabstats(MyDatabaseId, tables_handle, nowait);
+ pgstat_flush_funcstats(functions_handle, nowait);
+ }
+ else
+ {
+ /* uncertain whether shared table stats exists, try it */
+ process_shared_tables = true;
+ }
+
+ /* update database-side stats */
+ pgstat_flush_checksum_failure(nowait);
+ pgstat_flush_dbstats(false, nowait); /* MyDatabase */
+
+ if (process_shared_tables)
+ {
+ /* shared tables found, process them */
+ dbent = pgstat_get_db_entry(InvalidOid, false, nowait, false);
+ tables_handle = dbent->tables;
+ dshash_release_lock(pgStatDBHash, dbent);
+
+ if (dbent)
+ pgstat_flush_tabstats(InvalidOid, tables_handle, nowait);
+ }
+ pgstat_flush_dbstats(true, nowait); /* Shared tables */
+
+ /* Publish the last flush time */
+ LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+ if (shared_globalStats->stats_timestamp < now)
+ shared_globalStats->stats_timestamp = now;
+ LWLockRelease(StatsLock);
+
+ /* Record how long we are keeping pending updates. */
+ if (have_table_stats || have_function_stats ||
+ have_mydatabase_stats || have_shdatabase_stats ||
+ checksum_failures != NULL)
+ {
+ /* Preserve the first value */
+ if (pending_since == 0)
+ pending_since = now;
+
+ /*
+ * It's possible that the retry interval is longer than the limit by
+ * PGSTAT_STAT_MAX_INTERVAL. We don't bother that since it's not so
+ * much.
+ */
+ return PGSTAT_STAT_RETRY_INTERVAL;
+ }
+
+ /* Set the next time to update stats */
+ next_flush = now + PGSTAT_STAT_MIN_INTERVAL * 1000;
+ pending_since = 0;
+
+ return 0;
+}
+
+
+/*
+ * pgstat_flush_dbstats: Flushes database stats out to shared statistics.
+ *
+ *  If nowait is true, returns immediately if required lock was not acquired.
+ */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+pgstat_flush_dbstats(bool shared, bool nowait)
 {
- int n;
- int len;
+ PgStat_StatDBEntry *dbent;
+ PgStat_TableCounts *s;
 
- /* It's unlikely we'd get here with no socket, but maybe not impossible */
- if (pgStatSock == PGINVALID_SOCKET)
- return;
+ if (shared)
+ {
+ if (!have_shdatabase_stats)
+ return;
+ dbent = pgstat_get_db_entry(InvalidOid, false, nowait, false);
+ if (!dbent)
+ return;
 
- /*
- * Report and reset accumulated xact commit/rollback and I/O timings
- * whenever we send a normal tabstat message
- */
- if (OidIsValid(tsmsg->m_databaseid))
+ s = &pgStatSharedDatabaseStats;
+ have_shdatabase_stats = false;
+ }
+ else
+ {
+ if (!have_mydatabase_stats)
+ return;
+ dbent = pgstat_get_db_entry(MyDatabaseId, false, nowait, false);
+ if (!dbent)
+ return;
+
+ s = &pgStatMyDatabaseStats;
+ have_mydatabase_stats = false;
+ }
+
+ /* We got the database entry, update database-wide stats */
+ LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+ dbent->counts.n_tuples_returned += s->t_tuples_returned;
+ dbent->counts.n_tuples_fetched += s->t_tuples_fetched;
+ dbent->counts.n_tuples_inserted += s->t_tuples_inserted;
+ dbent->counts.n_tuples_updated += s->t_tuples_updated;
+ dbent->counts.n_tuples_deleted += s->t_tuples_deleted;
+ dbent->counts.n_blocks_fetched += s->t_blocks_fetched;
+ dbent->counts.n_blocks_hit += s->t_blocks_hit;
+
+ if (!shared)
  {
- tsmsg->m_xact_commit = pgStatXactCommit;
- tsmsg->m_xact_rollback = pgStatXactRollback;
- tsmsg->m_block_read_time = pgStatBlockReadTime;
- tsmsg->m_block_write_time = pgStatBlockWriteTime;
+ dbent->counts.n_xact_commit += pgStatXactCommit;
+ dbent->counts.n_xact_rollback += pgStatXactRollback;
+ dbent->counts.n_block_read_time += pgStatBlockReadTime;
+ dbent->counts.n_block_write_time += pgStatBlockWriteTime;
  pgStatXactCommit = 0;
  pgStatXactRollback = 0;
  pgStatBlockReadTime = 0;
  pgStatBlockWriteTime = 0;
  }
- else
+ LWLockRelease(&dbent->lock);
+
+ dshash_release_lock(pgStatDBHash, dbent);
+}
+
+/*
+ * pgstat_flush_tabstats: Flushes table stats out to shared statistics.
+ *
+ *  If nowait is true, returns false if required lock was not acquired
+ *  immediately. In that case, unapplied table stats updates are left alone in
+ *  pgStatTables to wait for the next chance. cxt holds some dshash related
+ *  values that we want to carry around while updating shared stats.
+ *
+ *  Returns true if entries for another database is found in pgStatTables.
+ */
+static bool
+pgstat_flush_tabstats(Oid dbid, dshash_table_handle tabhandle, bool nowait)
+{
+ static const PgStat_TableCounts all_zeroes;
+
+ HASH_SEQ_STATUS scan;
+ PgStat_TableStatus *bestat;
+ dshash_table *tabhash;
+ bool anotherdb_found = false;
+
+ /* nothing to do, just return */
+ if (!have_table_stats)
+ return false;
+
+ have_table_stats = false;
+
+ tabhash = dshash_attach(area, &dsh_tblparams, tabhandle, 0);
+
+ /*
+ * Scan through the pgStatTables to find tables that actually have counts,
+ * and try flushing it out to shared stats.
+ */
+ hash_seq_init(&scan, pgStatTables);
+ while ((bestat = (PgStat_TableStatus *) hash_seq_search(&scan)) != NULL)
  {
- tsmsg->m_xact_commit = 0;
- tsmsg->m_xact_rollback = 0;
- tsmsg->m_block_read_time = 0;
- tsmsg->m_block_write_time = 0;
+ bool remove_entry = false;
+
+ /*
+ * Ignore entries that didn't accumulate any actual counts, such as
+ * indexes that were opened by the planner but not used.
+ */
+ if (memcmp(&bestat->t_counts, &all_zeroes,
+   sizeof(PgStat_TableCounts)) == 0)
+ remove_entry = true;
+ /* Ignore entries of databases other than our current target */
+ else if (dbid != (bestat->t_shared ? InvalidOid : MyDatabaseId))
+ anotherdb_found = true;
+ else if (pgstat_update_tabentry(tabhash, bestat, nowait))
+ {
+ PgStat_TableCounts *s;
+
+ if (dbid == bestat->t_shared)
+ {
+ s = &pgStatSharedDatabaseStats;
+ have_shdatabase_stats = true;
+ }
+ else
+ {
+ s = &pgStatMyDatabaseStats;
+ have_mydatabase_stats = true;
+ }
+
+ /* database count is applied at once later */
+ s->t_tuples_returned += bestat->t_counts.t_tuples_returned;
+ s->t_tuples_fetched += bestat->t_counts.t_tuples_fetched;
+ s->t_tuples_inserted += bestat->t_counts.t_tuples_inserted;
+ s->t_tuples_updated += bestat->t_counts.t_tuples_updated;
+ s->t_tuples_deleted += bestat->t_counts.t_tuples_deleted;
+ s->t_blocks_fetched += bestat->t_counts.t_blocks_fetched;
+ s->t_blocks_hit += bestat->t_counts.t_blocks_hit;
+
+ remove_entry = true;
+ }
+
+ if (remove_entry)
+ {
+ /*
+ * Reuse of the entry is detected with t_id in pgstat_initstats.
+ * Set invalid value after removal because the value is needed to
+ * remove the entry.
+ */
+ hash_search(pgStatTables, &bestat->t_id, HASH_REMOVE, NULL);
+ bestat->t_id = InvalidOid;
+ }
+ else
+ have_table_stats = true;
  }
 
- n = tsmsg->m_nentries;
- len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
- n * sizeof(PgStat_TableEntry);
+ dshash_detach(tabhash);
 
- pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
- pgstat_send(tsmsg, len);
+ return anotherdb_found;
 }
 
+
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * pgstat_flush_funcstats: Flushes function stats.
+ *
+ *  If nowait is true, returns false on lock failure. Unapplied local hash
+ *  entries are left alone.
+ *
+ *  Returns true if some entries are left unflushed.
  */
-static void
-pgstat_send_funcstats(void)
+static bool
+pgstat_flush_funcstats(dshash_table_handle funchandle, bool nowait)
 {
  /* we assume this inits to all zeroes: */
  static const PgStat_FunctionCounts all_zeroes;
+ HASH_SEQ_STATUS scan;
+ PgStat_BackendFunctionEntry *bestat;
+ dshash_table *funchash = NULL;
 
- PgStat_MsgFuncstat msg;
- PgStat_BackendFunctionEntry *entry;
- HASH_SEQ_STATUS fstat;
-
+ /* nothing to do, just return */
  if (pgStatFunctions == NULL)
- return;
+ return false;
+
+ have_function_stats = false;
+
+ /* dshash for function stats is created on-demand */
+ if (funchandle == DSM_HANDLE_INVALID)
+ {
+ PgStat_StatDBEntry *dbent =
+ pgstat_get_db_entry(MyDatabaseId, false, false, false);
+
+ Assert(dbent);
+
+ funchash = dshash_create(area, &dsh_funcparams, 0);
+
+ LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+ if (dbent->functions == DSM_HANDLE_INVALID)
+ funchandle = dbent->functions =
+ dshash_get_hash_table_handle(funchash);
+ else
+ {
+ /* someone else simultaneously created it, discard mine. */
+ dshash_destroy(funchash);
+ funchandle = dbent->functions;
+ }
+ LWLockRelease(&dbent->lock);
+
+ /* dbent is no longer needed, release it right now */
+ dshash_release_lock(pgStatDBHash, dbent);
+ }
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
- msg.m_databaseid = MyDatabaseId;
- msg.m_nentries = 0;
+ if (funchash == NULL)
+ funchash = dshash_attach(area, &dsh_funcparams, funchandle, 0);
 
- hash_seq_init(&fstat, pgStatFunctions);
- while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+ /*
+ * Scan through the pgStatFunctions to find functions that actually have
+ * counts, and try flushing it out to shared stats.
+ */
+ hash_seq_init(&scan, pgStatFunctions);
+ while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&scan)) != NULL)
  {
- PgStat_FunctionEntry *m_ent;
+ bool found;
+ PgStat_StatFuncEntry *shstat = NULL;
 
- /* Skip it if no counts accumulated since last time */
- if (memcmp(&entry->f_counts, &all_zeroes,
+ /* Skip it if no counts accumulated for it so far */
+ if (memcmp(&bestat->f_counts, &all_zeroes,
    sizeof(PgStat_FunctionCounts)) == 0)
  continue;
 
- /* need to convert format of time accumulators */
- m_ent = &msg.m_entry[msg.m_nentries];
- m_ent->f_id = entry->f_id;
- m_ent->f_numcalls = entry->f_counts.f_numcalls;
- m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
- m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+ shstat = (PgStat_StatFuncEntry *)
+ dshash_find_extended(funchash, (void *) &(bestat->f_id),
+ true, nowait, true, &found);
 
- if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
+ /*
+ * We couldn't acquire lock on the required entry. Leave the local
+ * entry alone.
+ */
+ if (!shstat)
  {
- pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
- msg.m_nentries * sizeof(PgStat_FunctionEntry));
- msg.m_nentries = 0;
+ have_function_stats = true;
+ continue;
  }
 
- /* reset the entry's counts */
- MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
- }
+ /* Initialize if it's new, or add to it. */
+ if (!found)
+ {
+ shstat->functionid = bestat->f_id;
+ shstat->f_numcalls = bestat->f_counts.f_numcalls;
+ shstat->f_total_time =
+ INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+ shstat->f_self_time =
+ INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+ }
+ else
+ {
+ shstat->f_numcalls += bestat->f_counts.f_numcalls;
+ shstat->f_total_time +=
+ INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+ shstat->f_self_time +=
+ INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+ }
+ dshash_release_lock(funchash, shstat);
 
- if (msg.m_nentries > 0)
- pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
- msg.m_nentries * sizeof(PgStat_FunctionEntry));
+ /* reset used counts */
+ MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
+ }
 
- have_function_stats = false;
+ return have_function_stats;
 }
 
 
 /* ----------
  * pgstat_vacuum_stat() -
  *
- * Will tell the collector about objects he can get rid of.
+ * Remove objects we can get rid of.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
- HTAB   *htab;
- PgStat_MsgTabpurge msg;
- PgStat_MsgFuncpurge f_msg;
- HASH_SEQ_STATUS hstat;
+ HTAB   *oidtab;
+ dshash_seq_status dshstat;
  PgStat_StatDBEntry *dbentry;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatFuncEntry *funcentry;
- int len;
+ dshash_table_handle tables_handle;
+ dshash_table_handle functions_handle;
 
- if (pgStatSock == PGINVALID_SOCKET)
+ /* we don't collect stats under standalone mode */
+ if (!IsUnderPostmaster)
  return;
 
- /*
- * If not done for this transaction, read the statistics collector stats
- * file into some hash tables.
- */
- backend_read_statsfile();
-
  /*
  * Read pg_database and make a list of OIDs of all existing databases
  */
- htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+ oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
 
  /*
- * Search the database hash table for dead databases and tell the
- * collector to drop them.
+ * Search the database hash table for dead databases and drop them from
+ * the hash.
  */
- hash_seq_init(&hstat, pgStatDBHash);
- while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+ dshash_seq_init(&dshstat, pgStatDBHash, false);
+ while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
  {
  Oid dbid = dbentry->databaseid;
 
@@ -1054,136 +955,48 @@ pgstat_vacuum_stat(void)
 
  /* the DB entry for shared tables (with InvalidOid) is never dropped */
  if (OidIsValid(dbid) &&
- hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+ hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
  pgstat_drop_database(dbid);
  }
 
  /* Clean up */
- hash_destroy(htab);
+ dshash_seq_term(&dshstat);
+ hash_destroy(oidtab);
 
  /*
  * Lookup our own database entry; if not found, nothing more to do.
+ * MyDatabaseId cannot be removed or the hashes above are not changed, so
+ * we can release the lock just after.
  */
- dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
- (void *) &MyDatabaseId,
- HASH_FIND, NULL);
- if (dbentry == NULL || dbentry->tables == NULL)
+ dbentry = pgstat_get_db_entry(MyDatabaseId, false, false, false);
+ if (dbentry == NULL)
  return;
+ tables_handle = dbentry->tables;
+ functions_handle = dbentry->functions;
+ dshash_release_lock(pgStatDBHash, dbentry);
 
  /*
  * Similarly to above, make a list of all known relations in this DB.
  */
- htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
- /*
- * Initialize our messages table counter to zero
- */
- msg.m_nentries = 0;
-
- /*
- * Check for all tables listed in stats hashtable if they still exist.
- */
- hash_seq_init(&hstat, dbentry->tables);
- while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
- {
- Oid tabid = tabentry->tableid;
-
- CHECK_FOR_INTERRUPTS();
-
- if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
- continue;
-
- /*
- * Not there, so add this table's Oid to the message
- */
- msg.m_tableid[msg.m_nentries++] = tabid;
-
- /*
- * If the message is full, send it out and reinitialize to empty
- */
- if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
- {
- len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
- + msg.m_nentries * sizeof(Oid);
-
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
- msg.m_databaseid = MyDatabaseId;
- pgstat_send(&msg, len);
-
- msg.m_nentries = 0;
- }
- }
+ oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
 
  /*
- * Send the rest
+ * Check for all tables listed in stats hash table if they still exist.
+ * Stats cache is useless here so directly search the shared hash.
  */
- if (msg.m_nentries > 0)
- {
- len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
- + msg.m_nentries * sizeof(Oid);
-
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
- msg.m_databaseid = MyDatabaseId;
- pgstat_send(&msg, len);
- }
-
- /* Clean up */
- hash_destroy(htab);
+ pgstat_remove_useless_entries(tables_handle, &dsh_tblparams, oidtab);
+ hash_destroy(oidtab);
 
  /*
- * Now repeat the above steps for functions.  However, we needn't bother
- * in the common case where no function stats are being collected.
+ * Repeat the above but we needn't bother in the common case where no
+ * function stats are being collected.
  */
- if (dbentry->functions != NULL &&
- hash_get_num_entries(dbentry->functions) > 0)
+ if (dbentry->functions != DSM_HANDLE_INVALID)
  {
- htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
- pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
- f_msg.m_databaseid = MyDatabaseId;
- f_msg.m_nentries = 0;
-
- hash_seq_init(&hstat, dbentry->functions);
- while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
- {
- Oid funcid = funcentry->functionid;
-
- CHECK_FOR_INTERRUPTS();
-
- if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
- continue;
-
- /*
- * Not there, so add this function's Oid to the message
- */
- f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
- /*
- * If the message is full, send it out and reinitialize to empty
- */
- if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
- {
- len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
- + f_msg.m_nentries * sizeof(Oid);
-
- pgstat_send(&f_msg, len);
-
- f_msg.m_nentries = 0;
- }
- }
-
- /*
- * Send the rest
- */
- if (f_msg.m_nentries > 0)
- {
- len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
- + f_msg.m_nentries * sizeof(Oid);
-
- pgstat_send(&f_msg, len);
- }
-
- hash_destroy(htab);
+ oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+ pgstat_remove_useless_entries(functions_handle, &dsh_funcparams,
+  oidtab);
+ hash_destroy(oidtab);
  }
 }
 
@@ -1212,7 +1025,7 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
  hash_ctl.entrysize = sizeof(Oid);
  hash_ctl.hcxt = CurrentMemoryContext;
  htab = hash_create("Temporary table of OIDs",
-   PGSTAT_TAB_HASH_SIZE,
+   PGSTAT_TABLE_HASH_SIZE,
    &hash_ctl,
    HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
@@ -1239,65 +1052,96 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 }
 
 
-/* ----------
- * pgstat_drop_database() -
+/*
+ * pgstat_remove_useless_entries - Remove entries no loner exists from per
+ * table/function dshashes.
  *
- * Tell the collector that we just dropped a database.
- * (If the message gets lost, we will still clean the dead DB eventually
- * via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *  Scan the dshash specified by dshhandle removing entries that are not in
+ *  oidtab. oidtab is destroyed before returning.
  */
 void
-pgstat_drop_database(Oid databaseid)
+pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+  const dshash_parameters *dshparams,
+  HTAB *oidtab)
 {
- PgStat_MsgDropdb msg;
+ dshash_table *dshtable;
+ dshash_seq_status dshstat;
+ void   *ent;
 
- if (pgStatSock == PGINVALID_SOCKET)
- return;
+ dshtable = dshash_attach(area, dshparams, dshhandle, 0);
+ dshash_seq_init(&dshstat, dshtable, true);
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
- msg.m_databaseid = databaseid;
- pgstat_send(&msg, sizeof(msg));
-}
+ while ((ent = dshash_seq_next(&dshstat)) != NULL)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ /* The first member of the entries must be Oid */
+ if (hash_search(oidtab, ent, HASH_FIND, NULL) != NULL)
+ continue;
+
+ /* Not there, so purge this entry */
+ dshash_delete_current(&dshstat);
+ }
 
+ /* clean up */
+ dshash_seq_term(&dshstat);
+ dshash_detach(dshtable);
+}
 
 /* ----------
- * pgstat_drop_relation() -
+ * pgstat_drop_database() -
  *
- * Tell the collector that we just dropped a relation.
- * (If the message gets lost, we will still clean the dead entry eventually
- * via future invocations of pgstat_vacuum_stat().)
+ * Remove entry for the database that we just dropped.
  *
- * Currently not used for lack of any good place to call it; we rely
- * entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
+ * If some stats are flushed after this, this entry will be re-created but we
+ * will still clean the dead DB eventually via future invocations of
+ * pgstat_vacuum_stat().
  * ----------
  */
-#ifdef NOT_USED
 void
-pgstat_drop_relation(Oid relid)
+pgstat_drop_database(Oid databaseid)
 {
- PgStat_MsgTabpurge msg;
- int len;
+ PgStat_StatDBEntry *dbentry;
+
+ Assert(OidIsValid(databaseid));
+
+ if (!IsUnderPostmaster || !pgStatDBHash)
+ return;
+
+ /*
+ * Lookup the database, removal needs exclusive lock.
+ */
+ dbentry = pgstat_get_db_entry(databaseid, true, false, false);
 
- if (pgStatSock == PGINVALID_SOCKET)
+ if (dbentry == NULL)
  return;
 
- msg.m_tableid[0] = relid;
- msg.m_nentries = 1;
+ /* found, remove it */
 
- len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
+ /* Remove table/function stats dshash first. */
+ if (dbentry->tables != DSM_HANDLE_INVALID)
+ {
+ dshash_table *tbl =
+ dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
- msg.m_databaseid = MyDatabaseId;
- pgstat_send(&msg, len);
-}
-#endif /* NOT_USED */
+ dshash_destroy(tbl);
+ }
 
+ if (dbentry->functions != DSM_HANDLE_INVALID)
+ {
+ dshash_table *tbl =
+ dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+
+ dshash_destroy(tbl);
+ }
+
+ dshash_delete_entry(pgStatDBHash, (void *) dbentry);
+}
 
 /* ----------
  * pgstat_reset_counters() -
  *
- * Tell the statistics collector to reset counters for our database.
+ * Reset counters for our database.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -1306,20 +1150,30 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
- PgStat_MsgResetcounter msg;
+ PgStat_StatDBEntry *dbentry;
+
+ if (!pgStatDBHash)
+ return;
 
- if (pgStatSock == PGINVALID_SOCKET)
+ /*
+ * Lookup the database in the hash table.  Nothing to do if not there.
+ * This function works on the dbentry, so we cannot release it earlier.
+ */
+ dbentry = pgstat_get_db_entry(MyDatabaseId, false, false, false);
+ if (!dbentry)
  return;
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
- msg.m_databaseid = MyDatabaseId;
- pgstat_send(&msg, sizeof(msg));
+ /* Reset database-level stats. */
+ reset_dbentry_counters(dbentry);
+
+ dshash_release_lock(pgStatDBHash, dbentry);
+
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- * Tell the statistics collector to reset cluster-wide shared counters.
+ * Reset cluster-wide shared counters.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -1328,29 +1182,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
- PgStat_MsgResetsharedcounter msg;
-
- if (pgStatSock == PGINVALID_SOCKET)
- return;
-
+ /* Reset the archiver statistics for the cluster. */
  if (strcmp(target, "archiver") == 0)
- msg.m_resettarget = RESET_ARCHIVER;
+ {
+ TimestampTz now = GetCurrentTimestamp();
+
+ LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+ MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+ shared_archiverStats->stat_reset_timestamp = now;
+ LWLockRelease(StatsLock);
+ }
+ /* Reset the bgwriter statistics for the cluster. */
  else if (strcmp(target, "bgwriter") == 0)
- msg.m_resettarget = RESET_BGWRITER;
+ {
+ TimestampTz now = GetCurrentTimestamp();
+
+ LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+ MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+ shared_globalStats->stat_reset_timestamp = now;
+ LWLockRelease(StatsLock);
+ }
  else
  ereport(ERROR,
  (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  errmsg("unrecognized reset target: \"%s\"", target),
  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
- pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- * Tell the statistics collector to reset a single counter.
+ * Reset a single counter.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -1359,17 +1221,39 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
- PgStat_MsgResetsinglecounter msg;
+ PgStat_StatDBEntry *dbentry;
+ TimestampTz ts;
+ dshash_table *t;
 
- if (pgStatSock == PGINVALID_SOCKET)
- return;
+ dbentry = pgstat_get_db_entry(MyDatabaseId, false, false, false);
+ Assert(dbentry);
+
+ /* Set the reset timestamp for the whole database */
+ ts = GetCurrentTimestamp();
+ LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+ dbentry->stat_reset_timestamp = ts;
+ LWLockRelease(&dbentry->lock);
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
- msg.m_databaseid = MyDatabaseId;
- msg.m_resettype = type;
- msg.m_objectid = objoid;
+ /*
+ * MyDatabaseId cannot be removed or the hashes above are not changed, so
+ * we can release the lock right now.
+ */
+ dshash_release_lock(pgStatDBHash, dbentry);
+
+ /* Remove object if it exists, ignore if not */
+ if (type == RESET_TABLE)
+ {
+ t = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+ dshash_delete_key(t, (void *) &objoid);
+ dshash_detach(t);
+ }
 
- pgstat_send(&msg, sizeof(msg));
+ if (type == RESET_FUNCTION && dbentry->functions != DSM_HANDLE_INVALID)
+ {
+ t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+ dshash_delete_key(t, (void *) &objoid);
+ dshash_detach(t);
+ }
 }
 
 /* ----------
@@ -1383,48 +1267,87 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
- PgStat_MsgAutovacStart msg;
+ PgStat_StatDBEntry *dbentry;
+ TimestampTz ts;
 
- if (pgStatSock == PGINVALID_SOCKET)
+ /* return if we are not collecting stats */
+ if (!area)
  return;
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
- msg.m_databaseid = dboid;
- msg.m_start_time = GetCurrentTimestamp();
+ ts = GetCurrentTimestamp();
+
+ /*
+ * Store the last autovacuum time in the database's hash table entry.
+ */
+ dbentry = pgstat_get_db_entry(dboid, false, false, true);
+ Assert(!dbentry);
+
+ LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+ dbentry->last_autovac_time = ts;
+ LWLockRelease(&dbentry->lock);
 
- pgstat_send(&msg, sizeof(msg));
+ dshash_release_lock(pgStatDBHash, dbentry);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- * Tell the collector about the table we just vacuumed.
+ * Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
  PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
- PgStat_MsgVacuum msg;
+ Oid dboid;
+ PgStat_StatDBEntry *dbentry;
+ PgStat_StatTabEntry *tabentry;
+ dshash_table_handle table_handle;
+ dshash_table *table;
+ TimestampTz ts;
 
- if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+ /* return if we are not collecting stats */
+ if (!area)
  return;
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
- msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
- msg.m_tableoid = tableoid;
- msg.m_autovacuum = IsAutoVacuumWorkerProcess();
- msg.m_vacuumtime = GetCurrentTimestamp();
- msg.m_live_tuples = livetuples;
- msg.m_dead_tuples = deadtuples;
- pgstat_send(&msg, sizeof(msg));
+ dboid = shared ? InvalidOid : MyDatabaseId;
+
+ /*
+ * Store the data in the table's hash table entry. The dshash table cannot
+ * be destroyed meanwhile, so release the dbent right now.
+ */
+ dbentry = pgstat_get_db_entry(dboid, false, false, true);
+ Assert(dbentry);
+ table_handle = dbentry->tables;
+ dshash_release_lock(pgStatDBHash, dbentry);
+
+ ts = GetCurrentTimestamp();
+
+ table = dshash_attach(area, &dsh_tblparams, table_handle, 0);
+ tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+ tabentry->n_live_tuples = livetuples;
+ tabentry->n_dead_tuples = deadtuples;
+
+ if (IsAutoVacuumWorkerProcess())
+ {
+ tabentry->autovac_vacuum_timestamp = ts;
+ tabentry->autovac_vacuum_count++;
+ }
+ else
+ {
+ tabentry->vacuum_timestamp = ts;
+ tabentry->vacuum_count++;
+ }
+ dshash_release_lock(table, tabentry);
+ dshash_detach(table);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- * Tell the collector about the table we just analyzed.
+ * Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1435,9 +1358,14 @@ pgstat_report_analyze(Relation rel,
   PgStat_Counter livetuples, PgStat_Counter deadtuples,
   bool resetcounter)
 {
- PgStat_MsgAnalyze msg;
+ Oid dboid;
+ PgStat_StatDBEntry *dbentry;
+ PgStat_StatTabEntry *tabentry;
+ dshash_table_handle table_handle;
+ dshash_table *table;
 
- if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+ /* return if we are not collecting stats */
+ if (!area)
  return;
 
  /*
@@ -1445,10 +1373,10 @@ pgstat_report_analyze(Relation rel,
  * already inserted and/or deleted rows in the target table. ANALYZE will
  * have counted such rows as live or dead respectively. Because we will
  * report our counts of such rows at transaction end, we should subtract
- * off these counts from what we send to the collector now, else they'll
- * be double-counted after commit.  (This approach also ensures that the
- * collector ends up with the right numbers if we abort instead of
- * committing.)
+ * off these counts from what is already written to shared stats now, else
+ * they'll be double-counted after commit.  (This approach also ensures
+ * that the shared stats ends up with the right numbers if we abort
+ * instead of committing.)
  */
  if (rel->pgstat_info != NULL)
  {
@@ -1466,84 +1394,125 @@ pgstat_report_analyze(Relation rel,
  deadtuples = Max(deadtuples, 0);
  }
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
- msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
- msg.m_tableoid = RelationGetRelid(rel);
- msg.m_autovacuum = IsAutoVacuumWorkerProcess();
- msg.m_resetcounter = resetcounter;
- msg.m_analyzetime = GetCurrentTimestamp();
- msg.m_live_tuples = livetuples;
- msg.m_dead_tuples = deadtuples;
- pgstat_send(&msg, sizeof(msg));
+ dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+ /*
+ * Store the data in the table's hash table entry. The dshash table cannot
+ * be destroyed meanwhile, so release the dbent right now.
+ */
+ dbentry = pgstat_get_db_entry(dboid, false, false, true);
+ Assert(dbentry);
+ table_handle = dbentry->tables;
+ dshash_release_lock(pgStatDBHash, dbentry);
+
+ table = dshash_attach(area, &dsh_tblparams, table_handle, 0);
+ tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+ tabentry->n_live_tuples = livetuples;
+ tabentry->n_dead_tuples = deadtuples;
+
+ /*
+ * If commanded, reset changes_since_analyze to zero.  This forgets any
+ * changes that were committed while the ANALYZE was in progress, but we
+ * have no good way to estimate how many of those there were.
+ */
+ if (resetcounter)
+ tabentry->changes_since_analyze = 0;
+
+ if (IsAutoVacuumWorkerProcess())
+ {
+ tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+ tabentry->autovac_analyze_count++;
+ }
+ else
+ {
+ tabentry->analyze_timestamp = GetCurrentTimestamp();
+ tabentry->analyze_count++;
+ }
+
+ dshash_release_lock(table, tabentry);
+ dshash_detach(table);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- * Tell the collector about a Hot Standby recovery conflict.
+ * Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
- PgStat_MsgRecoveryConflict msg;
+ PgStat_StatDBEntry *dbentry;
 
- if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+ /* return if we are not collecting stats */
+ if (!area)
  return;
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
- msg.m_databaseid = MyDatabaseId;
- msg.m_reason = reason;
- pgstat_send(&msg, sizeof(msg));
-}
-
-/* --------
- * pgstat_report_deadlock() -
- *
- * Tell the collector about a deadlock detected.
- * --------
- */
-void
-pgstat_report_deadlock(void)
-{
- PgStat_MsgDeadlock msg;
+ dbentry = pgstat_get_db_entry(MyDatabaseId, false, false, false);
+ Assert(dbentry);
 
- if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
- return;
+ LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+ switch (reason)
+ {
+ case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
- msg.m_databaseid = MyDatabaseId;
- pgstat_send(&msg, sizeof(msg));
+ /*
+ * Since we drop the information about the database as soon as it
+ * replicates, there is no point in counting these conflicts.
+ */
+ break;
+ case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+ dbentry->counts.n_conflict_tablespace++;
+ break;
+ case PROCSIG_RECOVERY_CONFLICT_LOCK:
+ dbentry->counts.n_conflict_lock++;
+ break;
+ case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+ dbentry->counts.n_conflict_snapshot++;
+ break;
+ case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+ dbentry->counts.n_conflict_bufferpin++;
+ break;
+ case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+ dbentry->counts.n_conflict_startup_deadlock++;
+ break;
+ }
+ LWLockRelease(&dbentry->lock);
+ dshash_release_lock(pgStatDBHash, dbentry);
 }
 
 
-
 /* --------
- * pgstat_report_checksum_failures_in_db() -
+ * pgstat_report_deadlock() -
  *
- * Tell the collector about one or more checksum failures.
+ * Report a deadlock detected.
  * --------
  */
 void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
+pgstat_report_deadlock(void)
 {
- PgStat_MsgChecksumFailure msg;
+ PgStat_StatDBEntry *dbentry;
 
- if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+ /* return if we are not collecting stats */
+ if (!area)
  return;
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
- msg.m_databaseid = dboid;
- msg.m_failurecount = failurecount;
- msg.m_failure_time = GetCurrentTimestamp();
+ dbentry = pgstat_get_db_entry(MyDatabaseId, false, false, false);
+ Assert(dbentry);
 
- pgstat_send(&msg, sizeof(msg));
+ LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+ dbentry->counts.n_deadlocks++;
+ LWLockRelease(&dbentry->lock);
+
+ dshash_release_lock(pgStatDBHash, dbentry);
 }
 
+
 /* --------
  * pgstat_report_checksum_failure() -
  *
- * Tell the collector about a checksum failure.
+ * Reports about a checksum failure.
  * --------
  */
 void
@@ -1555,60 +1524,153 @@ pgstat_report_checksum_failure(void)
 /* --------
  * pgstat_report_tempfile() -
  *
- * Tell the collector about a temporary file.
+ * Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
- PgStat_MsgTempFile msg;
+ PgStat_StatDBEntry *dbentry;
 
- if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+ /* return if we are not collecting stats */
+ if (!area)
  return;
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
- msg.m_databaseid = MyDatabaseId;
- msg.m_filesize = filesize;
- pgstat_send(&msg, sizeof(msg));
+ dbentry = pgstat_get_db_entry(MyDatabaseId, false, false, false);
+ Assert(dbentry);
+
+ if (filesize > 0) /* Is there a case where filesize is really 0? */
+ {
+ LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+ dbentry->counts.n_temp_bytes += filesize; /* needs check overflow */
+ dbentry->counts.n_temp_files++;
+ LWLockRelease(&dbentry->lock);
+ }
+
+ dshash_release_lock(pgStatDBHash, dbentry);
 }
 
 
-/* ----------
- * pgstat_ping() -
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- * Send some junk data to the collector to increase traffic.
- * ----------
+ * Reports about one or more checksum failures.
+ * --------
  */
 void
-pgstat_ping(void)
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
- PgStat_MsgDummy msg;
+ PgStat_StatDBEntry *dbentry;
+ ChecksumFailureEnt *failent = NULL;
+
+ /* return if we are not active */
+ if (!area)
+ return;
+
+ /* add accumulated count to the parameter */
+ if (checksum_failures != NULL)
+ {
+ failent = hash_search(checksum_failures, &dboid, HASH_FIND, NULL);
+ if (failent)
+ failurecount += failent->count;
+ }
+
+ if (failurecount == 0)
+ return;
+
+ dbentry = pgstat_get_db_entry(MyDatabaseId, false, true, false);
+
+ if (!dbentry)
+ {
+ /* failed to acquire shared entry, store the number locally */
+ if (!failent)
+ {
+ bool found;
+
+ if (!checksum_failures)
+ {
+ HASHCTL ctl;
+
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(ChecksumFailureEnt);
+ checksum_failures =
+ hash_create("pgstat checksum failure count hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+ }
+
+ failent = hash_search(checksum_failures, &dboid, HASH_ENTER,
+  &found);
 
- if (pgStatSock == PGINVALID_SOCKET)
+ if (!found)
+ nchecksum_failures++;
+ }
+
+ failent->count = failurecount;
  return;
+ }
+
+ /* We have a chance to flush immediately */
+ LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+ dbentry->counts.n_checksum_failures += failurecount;
+ LWLockRelease(&dbentry->lock);
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
- pgstat_send(&msg, sizeof(msg));
+ dshash_release_lock(pgStatDBHash, dbentry);
+
+ if (checksum_failures)
+ {
+ /* Remove the entry and the hash if it gets empty */
+ hash_search(checksum_failures, &dboid, HASH_REMOVE, NULL);
+
+ if (failent != NULL && --nchecksum_failures < 1)
+ {
+ hash_destroy(checksum_failures);
+ checksum_failures = NULL;
+ }
+ }
 }
 
-/* ----------
- * pgstat_send_inquiry() -
+/*
+ * flush checkpoint failure count for all databases
  *
- * Notify collector that we need fresh data.
- * ----------
+ *  Returns true if some entries are left unflushed.
  */
 static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
+pgstat_flush_checksum_failure(bool nowait)
 {
- PgStat_MsgInquiry msg;
+ HASH_SEQ_STATUS stat;
+ ChecksumFailureEnt *ent;
+ PgStat_StatDBEntry *dbentry;
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
- msg.clock_time = clock_time;
- msg.cutoff_time = cutoff_time;
- msg.databaseid = databaseid;
- pgstat_send(&msg, sizeof(msg));
-}
+ if (checksum_failures == NULL)
+ return;
+
+ hash_seq_init(&stat, checksum_failures);
+ while ((ent = (ChecksumFailureEnt *) hash_seq_search(&stat)) != NULL)
+ {
+ dbentry = pgstat_get_db_entry(ent->dboid, false, nowait, true);
+ if (dbentry)
+ {
+ /* update database stats */
+ LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+ dbentry->counts.n_checksum_failures += ent->count;
+ LWLockRelease(&dbentry->lock);
+
+ hash_search(checksum_failures, &ent->dboid, HASH_REMOVE, NULL);
+
+ dshash_release_lock(pgStatDBHash, dbentry);
+ nchecksum_failures--;
+ }
+ }
+
+ /* The hash is empty, destroy it. */
+ if (nchecksum_failures < 1)
+ {
+ hash_destroy(checksum_failures);
+ checksum_failures = NULL;
+ }
 
+ return;
+}
 
 /*
  * Initialize function call usage data.
@@ -1739,8 +1801,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  * We assume that a relcache entry's pgstat_info field is zeroed by
  * relcache.c when the relcache entry is made; thereafter it is long-lived
- * data.  We can avoid repeated searches of the TabStatus arrays when the
- * same relation is touched repeatedly within a transaction.
+ * data.
  * ----------
  */
 void
@@ -1760,7 +1821,8 @@ pgstat_initstats(Relation rel)
  return;
  }
 
- if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+ /* return if we are not collecting stats */
+ if (!area)
  {
  /* We're not counting at all */
  rel->pgstat_info = NULL;
@@ -1779,85 +1841,45 @@ pgstat_initstats(Relation rel)
  rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
+
 /*
  * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
  */
 static PgStat_TableStatus *
 get_tabstat_entry(Oid rel_id, bool isshared)
 {
- TabStatHashEntry *hash_entry;
  PgStat_TableStatus *entry;
- TabStatusArray *tsa;
  bool found;
 
  /*
  * Create hash table if we don't have it already.
  */
- if (pgStatTabHash == NULL)
+ if (pgStatTables == NULL)
  {
  HASHCTL ctl;
 
- memset(&ctl, 0, sizeof(ctl));
+ MemSet(&ctl, 0, sizeof(ctl));
  ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(TabStatHashEntry);
+ ctl.entrysize = sizeof(PgStat_TableStatus);
 
- pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
- TABSTAT_QUANTUM,
- &ctl,
- HASH_ELEM | HASH_BLOBS);
+ pgStatTables = hash_create("Table stat entries",
+   PGSTAT_TABLE_HASH_SIZE,
+   &ctl,
+   HASH_ELEM | HASH_BLOBS);
  }
 
  /*
  * Find an entry or create a new one.
  */
- hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+ entry = hash_search(pgStatTables, &rel_id, HASH_ENTER, &found);
  if (!found)
  {
- /* initialize new entry with null pointer */
- hash_entry->tsa_entry = NULL;
- }
-
- /*
- * If entry is already valid, we're done.
- */
- if (hash_entry->tsa_entry)
- return hash_entry->tsa_entry;
-
- /*
- * Locate the first pgStatTabList entry with free space, making a new list
- * entry if needed.  Note that we could get an OOM failure here, but if so
- * we have left the hashtable and the list in a consistent state.
- */
- if (pgStatTabList == NULL)
- {
- /* Set up first pgStatTabList entry */
- pgStatTabList = (TabStatusArray *)
- MemoryContextAllocZero(TopMemoryContext,
-   sizeof(TabStatusArray));
- }
-
- tsa = pgStatTabList;
- while (tsa->tsa_used >= TABSTAT_QUANTUM)
- {
- if (tsa->tsa_next == NULL)
- tsa->tsa_next = (TabStatusArray *)
- MemoryContextAllocZero(TopMemoryContext,
-   sizeof(TabStatusArray));
- tsa = tsa->tsa_next;
+ entry->t_shared = isshared;
+ entry->trans = NULL;
+ MemSet(&entry->t_counts, 0, sizeof(PgStat_TableCounts));
  }
 
- /*
- * Allocate a PgStat_TableStatus entry within this list entry.  We assume
- * the entry was already zeroed, either at creation or after last use.
- */
- entry = &tsa->tsa_entries[tsa->tsa_used++];
- entry->t_id = rel_id;
- entry->t_shared = isshared;
-
- /*
- * Now we can fill the entry in pgStatTabHash.
- */
- hash_entry->tsa_entry = entry;
+ have_table_stats = true;
 
  return entry;
 }
@@ -1866,26 +1888,16 @@ get_tabstat_entry(Oid rel_id, bool isshared)
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
  * If no entry, return NULL, don't create a new one
- *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
- TabStatHashEntry *hash_entry;
-
- /* If hashtable doesn't exist, there are no entries at all */
- if (!pgStatTabHash)
- return NULL;
-
- hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
- if (!hash_entry)
+ /* If hash table doesn't exist, there are no entries at all */
+ if (!pgStatTables)
  return NULL;
 
- /* Note that this step could also return NULL, but that's correct */
- return hash_entry->tsa_entry;
+ return (PgStat_TableStatus *) hash_search(pgStatTables,
+  &rel_id, HASH_FIND, NULL);
 }
 
 /*
@@ -2315,9 +2327,9 @@ AtPrepare_PgStat(void)
  * Clean up after successful PREPARE.
  *
  * All we need do here is unlink the transaction stats state from the
- * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * nontransactional state.  The nontransactional action counts will be reported
+ * immediately, while the effects on live and dead tuple counts are preserved
+ * in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2415,91 +2427,248 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 }
 
 
-/* ----------
- * pgstat_fetch_stat_dbentry() -
+/*
+ * snapshot_statentry() - Common routine for functions
+ * pgstat_fetch_stat_*entry()
  *
- * Support function for the SQL-callable pgstat* functions. Returns
- * the collected statistics for one database or NULL. NULL doesn't mean
- * that the database doesn't exist, it is just not yet known by the
- * collector, so the caller is better off to report ZERO instead.
- * ----------
+ *  Returns the pointer to a snapshot of a shared entry for the key or NULL if
+ *  not found. Returned snapshots are stable during the current transaction or
+ *  until pgstat_clear_snapshot() is called.
+ *
+ *  Created snapshots are stored in HTAB *snapshot_hash. If not created yet, it
+ *  is created using snapshot_hash_name, snapshot_hash_entsize.
+ *
+ *  *table_hash points to dshash_table. If not yet attached, it is attached
+ *  using table_hash_params and table_hash_handle.
  */
-PgStat_StatDBEntry *
-pgstat_fetch_stat_dbentry(Oid dbid)
+static void *
+snapshot_statentry(const Oid key, const char *snapshot_hash_name,
+   const int snapshot_hash_entsize,
+   const dshash_table_handle table_hash_handle,
+   const dshash_parameters *table_hash_params,
+   HTAB **snapshot_hash, dshash_table **table_hash)
 {
+ PgStat_snapshot *lentry = NULL;
+ size_t table_hash_keysize = table_hash_params->key_size;
+ size_t table_hash_entrysize = table_hash_params->entry_size;
+ bool found;
+
+ /*
+ * We don't want so frequent update of stats snapshot. Keep it at least
+ * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
+ */
+ if (clear_snapshot)
+ {
+ clear_snapshot = false;
+
+ if (pgStatSnapshotContext &&
+ snapshot_globalStats.stats_timestamp <
+ GetCurrentStatementStartTimestamp() -
+ PGSTAT_STAT_MIN_INTERVAL * 1000)
+ {
+ MemoryContextReset(pgStatSnapshotContext);
+
+ /* Reset variables */
+ global_snapshot_is_valid = false;
+ pgStatSnapshotContext = NULL;
+ pgStatLocalHash = NULL;
+
+ pgstat_setup_memcxt();
+ *snapshot_hash = NULL;
+ }
+ }
+
  /*
- * If not done for this transaction, read the statistics collector stats
- * file into some hash tables.
+ * Create new hash, with rather arbitrary initial number of entries since
+ * we don't know how this hash will grow.
  */
- backend_read_statsfile();
+ if (!*snapshot_hash)
+ {
+ HASHCTL ctl;
+
+ /*
+ * Create the hash in the stats context
+ *
+ * The entry is prepended by common header part represented by
+ * PgStat_snapshot.
+ */
+
+ ctl.keysize = table_hash_keysize;
+ ctl.entrysize =
+ offsetof(PgStat_snapshot, body) + snapshot_hash_entsize;
+ ctl.hcxt = pgStatSnapshotContext;
+ *snapshot_hash = hash_create(snapshot_hash_name, 32, &ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
+
+ lentry = hash_search(*snapshot_hash, &key, HASH_ENTER, &found);
 
  /*
- * Lookup the requested database; return NULL if not found
+ * Refer shared hash if not found in the local hash. We return up-to-date
+ * entries outside a transaction so do the same even if the snapshot is
+ * found.
  */
- return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-  (void *) &dbid,
-  HASH_FIND, NULL);
+ if (!found || !IsTransactionState())
+ {
+ void   *sentry;
+
+ /* attach shared hash if not given, leave it alone for later use */
+ if (!*table_hash)
+ {
+ MemoryContext oldcxt;
+
+ Assert(table_hash_handle != DSM_HANDLE_INVALID);
+ oldcxt = MemoryContextSwitchTo(pgStatSnapshotContext);
+ *table_hash =
+ dshash_attach(area, table_hash_params, table_hash_handle, NULL);
+ MemoryContextSwitchTo(oldcxt);
+ }
+
+ sentry = dshash_find(*table_hash, &key, false);
+
+ if (sentry)
+ {
+ /*
+ * In transaction state, it is obvious that we should create local
+ * cache entries for consistency. If we are not, we return an
+ * up-to-date entry. Having said that, we need a local copy since
+ * dshash entry must be released immediately. We share the same
+ * local hash entry for the purpose.
+ */
+ memcpy(&lentry->body, sentry, table_hash_entrysize);
+ dshash_release_lock(*table_hash, sentry);
+
+ /* then zero out the local additional space if any */
+ if (table_hash_entrysize < snapshot_hash_entsize)
+ MemSet((char *) &lentry->body + table_hash_entrysize, 0,
+   snapshot_hash_entsize - table_hash_entrysize);
+ }
+
+ lentry->negative = !sentry;
+ }
+
+ if (lentry->negative)
+ return NULL;
+
+ return &lentry->body;
 }
 
 
+/* ----------
+ * pgstat_fetch_stat_dbentry_snapshot() -
+ *
+ * Find database stats entry on backends. The returned entries are cached
+ * until transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatDBEntry *
+pgstat_fetch_stat_dbentry(Oid dbid)
+{
+ /* should be called from backends */
+ Assert(IsUnderPostmaster);
+
+ /* If not done for this transaction, take a snapshot of global stats */
+ pgstat_snapshot_global_stats();
+
+ /* caller doesn't have a business with snapshot-local members */
+ return (PgStat_StatDBEntry *)
+ snapshot_statentry(dbid,
+   "local database stats", /* snapshot hash name */
+   sizeof(PgStat_StatDBEntry), /* snapshot ent size */
+   DSM_HANDLE_INVALID, /* dshash handle  */
+   &dsh_dbparams, /* dshash params */
+   &pgStatLocalHash, /* snapshot hash */
+   &pgStatDBHash); /* shared hash */
+}
+
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  * Support function for the SQL-callable pgstat* functions. Returns
- * the collected statistics for one table or NULL. NULL doesn't mean
+ * the activity statistics for one table or NULL. NULL doesn't mean
  * that the table doesn't exist, it is just not yet known by the
- * collector, so the caller is better off to report ZERO instead.
+ * activity statistics facilities, so the caller is better off to
+ * report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
- Oid dbid;
  PgStat_StatDBEntry *dbentry;
  PgStat_StatTabEntry *tabentry;
 
- /*
- * If not done for this transaction, read the statistics collector stats
- * file into some hash tables.
- */
- backend_read_statsfile();
+ /* Lookup our database, then look in its table hash table. */
+ dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+ if (dbentry == NULL)
+ return NULL;
 
- /*
- * Lookup our database, then look in its table hash table.
- */
- dbid = MyDatabaseId;
- dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
- (void *) &dbid,
- HASH_FIND, NULL);
- if (dbentry != NULL && dbentry->tables != NULL)
- {
- tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-   (void *) &relid,
-   HASH_FIND, NULL);
- if (tabentry)
- return tabentry;
- }
+ tabentry = pgstat_fetch_stat_tabentry_snapshot(dbentry, relid);
+ if (tabentry != NULL)
+ return tabentry;
 
  /*
  * If we didn't find it, maybe it's a shared table.
  */
- dbid = InvalidOid;
- dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
- (void *) &dbid,
- HASH_FIND, NULL);
- if (dbentry != NULL && dbentry->tables != NULL)
- {
- tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-   (void *) &relid,
-   HASH_FIND, NULL);
- if (tabentry)
- return tabentry;
- }
+ dbentry = pgstat_fetch_stat_dbentry(InvalidOid);
+ if (dbentry == NULL)
+ return NULL;
+
+ tabentry = pgstat_fetch_stat_tabentry_snapshot(dbentry, relid);
+ if (tabentry != NULL)
+ return tabentry;
 
  return NULL;
 }
 
 
+/* ----------
+ * pgstat_fetch_stat_tabentry_snapshot() -
+ *
+ * Find table stats entry on backends in dbent. The returned entry is cached
+ * until transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_snapshot(PgStat_StatDBEntry *dbent, Oid reloid)
+{
+ return (PgStat_StatTabEntry *)
+ snapshot_statentry(reloid,
+   "table stats snapshot", /* snapshot hash name */
+   sizeof(PgStat_StatTabEntry), /* snapshot ent size */
+   dbent->tables, /* dshash handle  */
+   &dsh_tblparams, /* dshash params */
+   &dbent->snapshot_tables, /* snapshot hash */
+   &dbent->dshash_tables); /* shared hash */
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ * Support function for index swapping. Sets index counters to specified
+ * place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+ PgStat_StatTabEntry *tabentry;
+
+ /* No point fetching tabentry when dst is NULL */
+ if (!dst)
+ return;
+
+ tabentry = pgstat_fetch_stat_tabentry(relid);
+
+ if (!tabentry)
+ return;
+
+ dst->t_counts.t_numscans = tabentry->numscans;
+ dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+ dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+ dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+ dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
+}
+
+
 /* ----------
  * pgstat_fetch_stat_funcentry() -
  *
@@ -2513,49 +2682,103 @@ pgstat_fetch_stat_funcentry(Oid func_id)
  PgStat_StatDBEntry *dbentry;
  PgStat_StatFuncEntry *funcentry = NULL;
 
- /* load the stats file if needed */
- backend_read_statsfile();
-
- /* Lookup our database, then find the requested function.  */
+ /* Lookup our database, then find the requested function */
  dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
- if (dbentry != NULL && dbentry->functions != NULL)
- {
- funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
- (void *) &func_id,
- HASH_FIND, NULL);
- }
+ if (dbentry == NULL)
+ return NULL;
+
+ funcentry = pgstat_fetch_stat_funcentry_snapshot(dbentry, func_id);
 
  return funcentry;
 }
 
-
 /* ----------
- * pgstat_fetch_stat_beentry() -
- *
- * Support function for the SQL-callable pgstat* functions. Returns
- * our local copy of the current-activity entry for one backend.
+ * pgstat_fetch_stat_funcentry_snapshot() -
  *
- * NB: caller is responsible for a check if the user is permitted to see
- * this info (especially the querystring).
- * ----------
+ * Find function stats entry on backends in dbent. The returned entry is
+ * cached until transaction end or pgstat_clear_snapshot() is called.
  */
-PgBackendStatus *
-pgstat_fetch_stat_beentry(int beid)
+static PgStat_StatFuncEntry *
+pgstat_fetch_stat_funcentry_snapshot(PgStat_StatDBEntry *dbent, Oid funcid)
 {
- pgstat_read_current_status();
+ /* should be called from backends */
+ Assert(IsUnderPostmaster);
 
- if (beid < 1 || beid > localNumBackends)
+ if (dbent->functions == DSM_HANDLE_INVALID)
  return NULL;
 
- return &localBackendStatusTable[beid - 1].backendStatus;
+ return (PgStat_StatFuncEntry *)
+ snapshot_statentry(funcid,
+   "function stats snapshot", /* snapshot hash name */
+   sizeof(PgStat_StatFuncEntry), /* snapshot ent size */
+   dbent->functions, /* dshash handle  */
+   &dsh_funcparams, /* dshash params */
+   &dbent->snapshot_functions, /* snapshot hash */
+   &dbent->dshash_functions); /* shared hash */
 }
 
-
-/* ----------
- * pgstat_fetch_stat_local_beentry() -
- *
- * Like pgstat_fetch_stat_beentry() but with locally computed additions (like
- * xid and xmin values of the backend)
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+ MemoryContext oldcontext;
+
+ pgstat_attach_shared_stats();
+
+ /* Nothing to do if already done */
+ if (global_snapshot_is_valid)
+ return;
+
+ oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+ LWLockAcquire(StatsLock, LW_SHARED);
+ memcpy(&snapshot_globalStats, shared_globalStats,
+   sizeof(PgStat_GlobalStats));
+
+ memcpy(&snapshot_archiverStats, shared_archiverStats,
+   sizeof(PgStat_ArchiverStats));
+ LWLockRelease(StatsLock);
+
+ global_snapshot_is_valid = true;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return;
+}
+
+/* ----------
+ * pgstat_fetch_stat_beentry() -
+ *
+ * Support function for the SQL-callable pgstat* functions. Returns
+ * our local copy of the current-activity entry for one backend.
+ *
+ * NB: caller is responsible for a check if the user is permitted to see
+ * this info (especially the querystring).
+ * ----------
+ */
+PgBackendStatus *
+pgstat_fetch_stat_beentry(int beid)
+{
+ pgstat_read_current_status();
+
+ if (beid < 1 || beid > localNumBackends)
+ return NULL;
+
+ return &localBackendStatusTable[beid - 1].backendStatus;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_local_beentry() -
+ *
+ * Like pgstat_fetch_stat_beentry() but with locally computed additions (like
+ * xid and xmin values of the backend)
  *
  * NB: caller is responsible for a check if the user is permitted to see
  * this info (especially the querystring).
@@ -2599,9 +2822,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
- backend_read_statsfile();
+ /* If not done for this transaction, take a stats snapshot */
+ pgstat_snapshot_global_stats();
 
- return &archiverStats;
+ return &snapshot_archiverStats;
 }
 
 
@@ -2616,9 +2840,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
- backend_read_statsfile();
+ /* If not done for this transaction, take a stats snapshot */
+ pgstat_snapshot_global_stats();
 
- return &globalStats;
+ return &snapshot_globalStats;
 }
 
 
@@ -2832,8 +3057,8 @@ pgstat_initialize(void)
  MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
  }
 
- /* Set up a process-exit hook to clean up */
- on_shmem_exit(pgstat_beshutdown_hook, 0);
+ /* need to be called before dsm shutdown */
+ before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3009,12 +3234,16 @@ pgstat_bestart(void)
  /* Update app name to current GUC setting */
  if (application_name)
  pgstat_report_appname(application_name);
+
+
+ /* attach shared database stats area */
+ pgstat_attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3027,7 +3256,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
  /*
  * If we got as far as discovering our own database ID, we can report what
- * we did to the collector.  Otherwise, we'd be sending an invalid
+ * we did to the shares stats.  Otherwise, we'd be sending an invalid
  * database ID, so forget it.  (This means that accesses to pg_database
  * during failed backend starts might never get counted.)
  */
@@ -3044,6 +3273,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
  beentry->st_procpid = 0; /* mark invalid */
 
  PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+ pgstat_detach_shared_stats(true);
 }
 
 
@@ -3304,7 +3535,8 @@ pgstat_read_current_status(void)
 #endif
  int i;
 
- Assert(!pgStatRunningInCollector);
+ Assert(IsUnderPostmaster);
+
  if (localBackendStatusTable)
  return; /* already done */
 
@@ -3599,9 +3831,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
  case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
  event_name = "LogicalLauncherMain";
  break;
- case WAIT_EVENT_PGSTAT_MAIN:
- event_name = "PgStatMain";
- break;
  case WAIT_EVENT_RECOVERY_WAL_ALL:
  event_name = "RecoveryWalAll";
  break;
@@ -4221,94 +4450,71 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
- *
- * Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
- hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- * Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
- int rc;
-
- if (pgStatSock == PGINVALID_SOCKET)
- return;
-
- ((PgStat_MsgHdr *) msg)->m_size = len;
-
- /* We'll retry after EINTR, but ignore all other failures */
- do
- {
- rc = send(pgStatSock, msg, len, 0);
- } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
- /* In debug builds, log send failures ... */
- if (rc < 0)
- elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
+ * pgstat_report_archiver() -
  *
- * Tell the collector about the WAL file that we successfully
- * archived or failed to archive.
+ * Report archiver statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
+pgstat_report_archiver(const char *xlog, bool failed)
 {
- PgStat_MsgArchiver msg;
+ TimestampTz now = GetCurrentTimestamp();
 
- /*
- * Prepare and send the message
- */
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
- msg.m_failed = failed;
- StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
- msg.m_timestamp = GetCurrentTimestamp();
- pgstat_send(&msg, sizeof(msg));
+ if (failed)
+ {
+ /* Failed archival attempt */
+ LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+ ++shared_archiverStats->failed_count;
+ memcpy(shared_archiverStats->last_failed_wal, xlog,
+   sizeof(shared_archiverStats->last_failed_wal));
+ shared_archiverStats->last_failed_timestamp = now;
+ LWLockRelease(StatsLock);
+ }
+ else
+ {
+ /* Successful archival operation */
+ LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+ ++shared_archiverStats->archived_count;
+ memcpy(shared_archiverStats->last_archived_wal, xlog,
+   sizeof(shared_archiverStats->last_archived_wal));
+ shared_archiverStats->last_archived_timestamp = now;
+ LWLockRelease(StatsLock);
+ }
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- * Send bgwriter statistics to the collector
+ * Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
  /* We assume this initializes to zeroes */
- static const PgStat_MsgBgWriter all_zeroes;
+ static const PgStat_BgWriter all_zeroes;
+
+ PgStat_BgWriter *s = &BgWriterStats;
 
  /*
  * This function can be called even if nothing at all has happened. In
- * this case, avoid sending a completely empty message to the stats
- * collector.
+ * this case, avoid taking lock for  a completely empty message.
  */
- if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+ if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
  return;
 
- /*
- * Prepare and send the message
- */
- pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
- pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+ LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+ shared_globalStats->timed_checkpoints += s->timed_checkpoints;
+ shared_globalStats->requested_checkpoints += s->requested_checkpoints;
+ shared_globalStats->checkpoint_write_time += s->checkpoint_write_time;
+ shared_globalStats->checkpoint_sync_time += s->checkpoint_sync_time;
+ shared_globalStats->buf_written_checkpoints += s->buf_written_checkpoints;
+ shared_globalStats->buf_written_clean += s->buf_written_clean;
+ shared_globalStats->maxwritten_clean += s->maxwritten_clean;
+ shared_globalStats->buf_written_backend += s->buf_written_backend;
+ shared_globalStats->buf_fsync_backend += s->buf_fsync_backend;
+ shared_globalStats->buf_alloc += s->buf_alloc;
+ LWLockRelease(StatsLock);
 
  /*
  * Clear out the statistics buffer, so it can be re-used.
@@ -4317,422 +4523,162 @@ pgstat_send_bgwriter(void)
 }
 
 
-/* ----------
- * PgstatCollectorMain() -
- *
- * Start up the statistics collector process.  This is the body of the
- * postmaster child process.
- *
- * The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+static void
+init_dbentry(PgStat_StatDBEntry *dbentry)
 {
- int len;
- PgStat_Msg msg;
- int wr;
-
- /*
- * Ignore all signals usually bound to some action in the postmaster,
- * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
- * support latch operations, because we only use a local latch.
- */
- pqsignal(SIGHUP, SignalHandlerForConfigReload);
- pqsignal(SIGINT, SIG_IGN);
- pqsignal(SIGTERM, SIG_IGN);
- pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
- pqsignal(SIGALRM, SIG_IGN);
- pqsignal(SIGPIPE, SIG_IGN);
- pqsignal(SIGUSR1, SIG_IGN);
- pqsignal(SIGUSR2, SIG_IGN);
- /* Reset some signals that are accepted by postmaster but not here */
- pqsignal(SIGCHLD, SIG_DFL);
- PG_SETMASK(&UnBlockSig);
-
- MyBackendType = B_STATS_COLLECTOR;
- init_ps_display(NULL);
-
- /*
- * Read in existing stats files or initialize the stats to zero.
- */
- pgStatRunningInCollector = true;
- pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
- /*
- * Loop to process messages until we get SIGQUIT or detect ungraceful
- * death of our parent postmaster.
- *
- * For performance reasons, we don't want to do ResetLatch/WaitLatch after
- * every message; instead, do that only after a recv() fails to obtain a
- * message.  (This effectively means that if backends are sending us stuff
- * like mad, we won't notice postmaster death until things slack off a
- * bit; which seems fine.) To do that, we have an inner loop that
- * iterates as long as recv() succeeds.  We do check ConfigReloadPending
- * inside the inner loop, which means that such interrupts will get
- * serviced but the latch won't get cleared until next time there is a
- * break in the action.
- */
- for (;;)
- {
- /* Clear any already-pending wakeups */
- ResetLatch(MyLatch);
-
- /*
- * Quit if we get SIGQUIT from the postmaster.
- */
- if (ShutdownRequestPending)
- break;
-
- /*
- * Inner loop iterates as long as we keep getting messages, or until
- * ShutdownRequestPending becomes set.
- */
- while (!ShutdownRequestPending)
- {
- /*
- * Reload configuration if we got SIGHUP from the postmaster.
- */
- if (ConfigReloadPending)
- {
- ConfigReloadPending = false;
- ProcessConfigFile(PGC_SIGHUP);
- }
-
- /*
- * Write the stats file(s) if a new request has arrived that is
- * not satisfied by existing file(s).
- */
- if (pgstat_write_statsfile_needed())
- pgstat_write_statsfiles(false, false);
-
- /*
- * Try to receive and process a message.  This will not block,
- * since the socket is set to non-blocking mode.
- *
- * XXX On Windows, we have to force pgwin32_recv to cooperate,
- * despite the previous use of pg_set_noblock() on the socket.
- * This is extremely broken and should be fixed someday.
- */
-#ifdef WIN32
- pgwin32_noblock = 1;
-#endif
-
- len = recv(pgStatSock, (char *) &msg,
-   sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
- pgwin32_noblock = 0;
-#endif
-
- if (len < 0)
- {
- if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
- break; /* out of inner loop */
- ereport(ERROR,
- (errcode_for_socket_access(),
- errmsg("could not read statistics message: %m")));
- }
-
- /*
- * We ignore messages that are smaller than our common header
- */
- if (len < sizeof(PgStat_MsgHdr))
- continue;
-
- /*
- * The received length must match the length in the header
- */
- if (msg.msg_hdr.m_size != len)
- continue;
-
- /*
- * O.K. - we accept this message.  Process it.
- */
- switch (msg.msg_hdr.m_type)
- {
- case PGSTAT_MTYPE_DUMMY:
- break;
-
- case PGSTAT_MTYPE_INQUIRY:
- pgstat_recv_inquiry(&msg.msg_inquiry, len);
- break;
-
- case PGSTAT_MTYPE_TABSTAT:
- pgstat_recv_tabstat(&msg.msg_tabstat, len);
- break;
-
- case PGSTAT_MTYPE_TABPURGE:
- pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
- break;
-
- case PGSTAT_MTYPE_DROPDB:
- pgstat_recv_dropdb(&msg.msg_dropdb, len);
- break;
-
- case PGSTAT_MTYPE_RESETCOUNTER:
- pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
- break;
-
- case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
- pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-   len);
- break;
-
- case PGSTAT_MTYPE_RESETSINGLECOUNTER:
- pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-   len);
- break;
-
- case PGSTAT_MTYPE_AUTOVAC_START:
- pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
- break;
-
- case PGSTAT_MTYPE_VACUUM:
- pgstat_recv_vacuum(&msg.msg_vacuum, len);
- break;
-
- case PGSTAT_MTYPE_ANALYZE:
- pgstat_recv_analyze(&msg.msg_analyze, len);
- break;
-
- case PGSTAT_MTYPE_ARCHIVER:
- pgstat_recv_archiver(&msg.msg_archiver, len);
- break;
-
- case PGSTAT_MTYPE_BGWRITER:
- pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
- break;
-
- case PGSTAT_MTYPE_FUNCSTAT:
- pgstat_recv_funcstat(&msg.msg_funcstat, len);
- break;
-
- case PGSTAT_MTYPE_FUNCPURGE:
- pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
- break;
-
- case PGSTAT_MTYPE_RECOVERYCONFLICT:
- pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
- len);
- break;
-
- case PGSTAT_MTYPE_DEADLOCK:
- pgstat_recv_deadlock(&msg.msg_deadlock, len);
- break;
-
- case PGSTAT_MTYPE_TEMPFILE:
- pgstat_recv_tempfile(&msg.msg_tempfile, len);
- break;
-
- case PGSTAT_MTYPE_CHECKSUMFAILURE:
- pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
- len);
- break;
-
- default:
- break;
- }
- } /* end of inner message-processing loop */
-
- /* Sleep until there's something to do */
-#ifndef WIN32
- wr = WaitLatchOrSocket(MyLatch,
-   WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-   pgStatSock, -1L,
-   WAIT_EVENT_PGSTAT_MAIN);
-#else
+ dshash_table *tabhash;
 
- /*
- * Windows, at least in its Windows Server 2003 R2 incarnation,
- * sometimes loses FD_READ events.  Waking up and retrying the recv()
- * fixes that, so don't sleep indefinitely.  This is a crock of the
- * first water, but until somebody wants to debug exactly what's
- * happening there, this is the best we can do.  The two-second
- * timeout matches our pre-9.2 behavior, and needs to be short enough
- * to not provoke "using stale statistics" complaints from
- * backend_read_statsfile.
- */
- wr = WaitLatchOrSocket(MyLatch,
-   WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-   pgStatSock,
-   2 * 1000L /* msec */ ,
-   WAIT_EVENT_PGSTAT_MAIN);
-#endif
+ LWLockInitialize(&dbentry->lock, LWTRANCHE_STATS);
 
- /*
- * Emergency bailout if postmaster has died.  This is to avoid the
- * necessity for manual cleanup of all postmaster children.
- */
- if (wr & WL_POSTMASTER_DEATH)
- break;
- } /* end of outer loop */
+ dbentry->last_autovac_time = 0;
+ dbentry->last_checksum_failure = 0;
+ dbentry->stat_reset_timestamp = 0;
+ dbentry->stats_timestamp = 0;
+ /* initialize the new shared entry */
+ MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
 
- /*
- * Save the final stats to reuse at next startup.
- */
- pgstat_write_statsfiles(true, true);
+ dbentry->functions = DSM_HANDLE_INVALID;
 
- exit(0);
+ /* dbentry always has the table hash */
+ tabhash = dshash_create(area, &dsh_tblparams, 0);
+ dbentry->tables = dshash_get_hash_table_handle(tabhash);
+ dshash_detach(tabhash);
 }
 
+
 /*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
  */
 static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
 {
- HASHCTL hash_ctl;
-
- dbentry->n_xact_commit = 0;
- dbentry->n_xact_rollback = 0;
- dbentry->n_blocks_fetched = 0;
- dbentry->n_blocks_hit = 0;
- dbentry->n_tuples_returned = 0;
- dbentry->n_tuples_fetched = 0;
- dbentry->n_tuples_inserted = 0;
- dbentry->n_tuples_updated = 0;
- dbentry->n_tuples_deleted = 0;
- dbentry->last_autovac_time = 0;
- dbentry->n_conflict_tablespace = 0;
- dbentry->n_conflict_lock = 0;
- dbentry->n_conflict_snapshot = 0;
- dbentry->n_conflict_bufferpin = 0;
- dbentry->n_conflict_startup_deadlock = 0;
- dbentry->n_temp_files = 0;
- dbentry->n_temp_bytes = 0;
- dbentry->n_deadlocks = 0;
- dbentry->n_checksum_failures = 0;
- dbentry->last_checksum_failure = 0;
- dbentry->n_block_read_time = 0;
- dbentry->n_block_write_time = 0;
-
- dbentry->stat_reset_timestamp = GetCurrentTimestamp();
- dbentry->stats_timestamp = 0;
-
- memset(&hash_ctl, 0, sizeof(hash_ctl));
- hash_ctl.keysize = sizeof(Oid);
- hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
- dbentry->tables = hash_create("Per-database table",
-  PGSTAT_TAB_HASH_SIZE,
-  &hash_ctl,
-  HASH_ELEM | HASH_BLOBS);
+ int printed;
 
- hash_ctl.keysize = sizeof(Oid);
- hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
- dbentry->functions = hash_create("Per-database function",
- PGSTAT_FUNCTION_HASH_SIZE,
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS);
+ /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+ printed = snprintf(filename, len, "%s/db_%u.%s",
+   PGSTAT_STAT_PERMANENT_DIRECTORY,
+   databaseid,
+   tempname ? "tmp" : "stat");
+ if (printed >= len)
+ elog(ERROR, "overlength pgstat path");
 }
 
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+
+static void
+reset_tabcount(PgStat_StatTabEntry *ent)
 {
- PgStat_StatDBEntry *result;
- bool found;
- HASHACTION action = (create ? HASH_ENTER : HASH_FIND);
+ ent->numscans = 0;
+ ent->tuples_returned = 0;
+ ent->tuples_fetched = 0;
+ ent->tuples_inserted = 0;
+ ent->tuples_updated = 0;
+ ent->tuples_deleted = 0;
+ ent->tuples_hot_updated = 0;
+ ent->n_live_tuples = 0;
+ ent->n_dead_tuples = 0;
+ ent->changes_since_analyze = 0;
+ ent->blocks_fetched = 0;
+ ent->blocks_hit = 0;
+ ent->vacuum_count = 0;
+ ent->autovac_vacuum_count = 0;
+ ent->analyze_count = 0;
+ ent->autovac_analyze_count = 0;
+
+ ent->vacuum_timestamp = 0;
+ ent->autovac_vacuum_timestamp = 0;
+ ent->analyze_timestamp = 0;
+ ent->autovac_analyze_timestamp = 0;
+}
 
- /* Lookup or create the hash table entry for this database */
- result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
- &databaseid,
- action, &found);
 
- if (!create && !found)
- return NULL;
+static void
+reset_dbcount(PgStat_StatDBEntry *ent)
+{
+ TimestampTz ts = GetCurrentTimestamp();
+
+ LWLockAcquire(&ent->lock, LW_EXCLUSIVE);
+
+ ent->counts.n_tuples_returned = 0;
+ ent->counts.n_tuples_fetched = 0;
+ ent->counts.n_tuples_inserted = 0;
+ ent->counts.n_tuples_updated = 0;
+ ent->counts.n_tuples_deleted = 0;
+ ent->counts.n_blocks_fetched = 0;
+ ent->counts.n_blocks_hit = 0;
+ ent->counts.n_xact_commit = 0;
+ ent->counts.n_xact_rollback = 0;
+ ent->counts.n_block_read_time = 0;
+ ent->counts.n_block_write_time = 0;
+ ent->stat_reset_timestamp = ts;
+
+ LWLockRelease(&ent->lock);
+}
 
- /*
- * If not found, initialize the new one.  This creates empty hash tables
- * for tables and functions, too.
- */
- if (!found)
- reset_dbentry_counters(result);
 
- return result;
+static void
+reset_funccount(PgStat_StatFuncEntry *ent)
+{
+ ent->f_numcalls = 0;
+ ent->f_total_time = 0;
+ ent->f_self_time = 0;
 }
 
 
 /*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ * Subroutine to clear stats in a database entry
+ *
+ * Reset all counters in the dbentry.
  */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+static void
+reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
- PgStat_StatTabEntry *result;
- bool found;
- HASHACTION action = (create ? HASH_ENTER : HASH_FIND);
-
- /* Lookup or create the hash table entry for this table */
- result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
- &tableoid,
- action, &found);
-
- if (!create && !found)
- return NULL;
-
- /* If not found, initialize the new one. */
- if (!found)
+ dshash_table *tbl;
+ dshash_seq_status dshstat;
+ PgStat_StatTabEntry *tabent;
+ PgStat_StatFuncEntry *funcent;
+
+ tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+ dshash_seq_init(&dshstat, tbl, true);
+ while ((tabent = dshash_seq_next(&dshstat)) != NULL)
+ reset_tabcount(tabent);
+ dshash_seq_term(&dshstat);
+ dshash_detach(tbl);
+
+ if (dbentry->functions != DSM_HANDLE_INVALID)
  {
- result->numscans = 0;
- result->tuples_returned = 0;
- result->tuples_fetched = 0;
- result->tuples_inserted = 0;
- result->tuples_updated = 0;
- result->tuples_deleted = 0;
- result->tuples_hot_updated = 0;
- result->n_live_tuples = 0;
- result->n_dead_tuples = 0;
- result->changes_since_analyze = 0;
- result->blocks_fetched = 0;
- result->blocks_hit = 0;
- result->vacuum_timestamp = 0;
- result->vacuum_count = 0;
- result->autovac_vacuum_timestamp = 0;
- result->autovac_vacuum_count = 0;
- result->analyze_timestamp = 0;
- result->analyze_count = 0;
- result->autovac_analyze_timestamp = 0;
- result->autovac_analyze_count = 0;
+ tbl = dshash_attach(area, &dsh_tblparams, dbentry->functions, 0);
+ dshash_seq_init(&dshstat, tbl, true);
+ while ((funcent = dshash_seq_next(&dshstat)) != NULL)
+ reset_funccount(funcent);
+ dshash_seq_term(&dshstat);
+ dshash_detach(tbl);
  }
 
- return result;
+ reset_dbcount(dbentry);
 }
 
 
 /* ----------
  * pgstat_write_statsfiles() -
- * Write the global statistics file, as well as requested DB files.
- *
- * 'permanent' specifies writing to the permanent files not temporary ones.
- * When true (happens only when the collector is shutting down), also remove
- * the temporary files so that backends starting up under a new postmaster
- * can't read old data before the new collector is ready.
- *
- * When 'allDbs' is false, only the requested databases (listed in
- * pending_write_requests) will be written; otherwise, all databases
- * will be written.
+ * Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
- HASH_SEQ_STATUS hstat;
+ dshash_seq_status hstat;
  PgStat_StatDBEntry *dbentry;
  FILE   *fpout;
  int32 format_id;
- const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
- const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+ const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+ const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
  int rc;
 
+ /* stats is not initialized yet. just return. */
+ if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+ return;
+
  elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
  /*
@@ -4751,7 +4697,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  /*
  * Set the timestamp of the stats file.
  */
- globalStats.stats_timestamp = GetCurrentTimestamp();
+ shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
  /*
  * Write the file header --- currently just a format ID.
@@ -4763,32 +4709,29 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  /*
  * Write global stats struct
  */
- rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+ rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
  (void) rc; /* we'll check for error with ferror */
 
  /*
  * Write archiver stats struct
  */
- rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+ rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
  (void) rc; /* we'll check for error with ferror */
 
  /*
  * Walk through the database table.
  */
- hash_seq_init(&hstat, pgStatDBHash);
- while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+ dshash_seq_init(&hstat, pgStatDBHash, false);
+ while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
  {
  /*
  * Write out the table and function stats for this DB into the
  * appropriate per-DB stat file, if required.
  */
- if (allDbs || pgstat_db_requested(dbentry->databaseid))
- {
- /* Make DB's timestamp consistent with the global stats */
- dbentry->stats_timestamp = globalStats.stats_timestamp;
+ /* Make DB's timestamp consistent with the global stats */
+ dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
- pgstat_write_db_statsfile(dbentry, permanent);
- }
+ pgstat_write_pgStatDBHashfile(dbentry);
 
  /*
  * Write out the DB entry. We don't write the tables or functions
@@ -4798,6 +4741,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
  (void) rc; /* we'll check for error with ferror */
  }
+ dshash_seq_term(&hstat);
 
  /*
  * No more output to be done. Close the temp file and replace the old
@@ -4831,53 +4775,19 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  tmpfile, statfile)));
  unlink(tmpfile);
  }
-
- if (permanent)
- unlink(pgstat_stat_filename);
-
- /*
- * Now throw away the list of requests.  Note that requests sent after we
- * started the write are still waiting on the network socket.
- */
- list_free(pending_write_requests);
- pending_write_requests = NIL;
 }
 
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
- char *filename, int len)
-{
- int printed;
-
- /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
- printed = snprintf(filename, len, "%s/db_%u.%s",
-   permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-   pgstat_stat_directory,
-   databaseid,
-   tempname ? "tmp" : "stat");
- if (printed >= len)
- elog(ERROR, "overlength pgstat path");
-}
 
 /* ----------
- * pgstat_write_db_statsfile() -
+ * pgstat_write_pgStatDBHashfile() -
  * Write the stat file for a single database.
- *
- * If writing to the permanent file (happens when the collector is
- * shutting down only), remove the temporary file so that backends
- * starting up under a new postmaster can't read the old data before
- * the new collector is ready.
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
 {
- HASH_SEQ_STATUS tstat;
- HASH_SEQ_STATUS fstat;
+ dshash_seq_status tstat;
+ dshash_seq_status fstat;
  PgStat_StatTabEntry *tabentry;
  PgStat_StatFuncEntry *funcentry;
  FILE   *fpout;
@@ -4886,9 +4796,10 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
  int rc;
  char tmpfile[MAXPGPATH];
  char statfile[MAXPGPATH];
+ dshash_table *tbl;
 
- get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
- get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+ get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+ get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
  elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4915,23 +4826,34 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
  /*
  * Walk through the database's access stats per table.
  */
- hash_seq_init(&tstat, dbentry->tables);
- while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+ Assert(dbentry->tables != DSM_HANDLE_INVALID);
+
+ tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+ dshash_seq_init(&tstat, tbl, false);
+ while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
  {
  fputc('T', fpout);
  rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
  (void) rc; /* we'll check for error with ferror */
  }
+ dshash_seq_term(&tstat);
+ dshash_detach(tbl);
 
  /*
  * Walk through the database's function stats table.
  */
- hash_seq_init(&fstat, dbentry->functions);
- while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+ if (dbentry->functions != DSM_HANDLE_INVALID)
  {
- fputc('F', fpout);
- rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
- (void) rc; /* we'll check for error with ferror */
+ tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+ dshash_seq_init(&fstat, tbl, false);
+ while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+ {
+ fputc('F', fpout);
+ rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+ }
+ dshash_seq_term(&fstat);
+ dshash_detach(tbl);
  }
 
  /*
@@ -4966,94 +4888,56 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
  tmpfile, statfile)));
  unlink(tmpfile);
  }
-
- if (permanent)
- {
- get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
- elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
- unlink(statfile);
- }
 }
 
+
 /* ----------
  * pgstat_read_statsfiles() -
  *
- * Reads in some existing statistics collector files and returns the
- * databases hash table that is the top level of the data.
+ * Reads in existing activity statistics files into the shared stats hash.
  *
- * If 'onlydb' is not InvalidOid, it means we only want data for that DB
- * plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- * table for all databases, but we don't bother even creating table/function
- * hash tables for other databases.
- *
- * 'permanent' specifies reading from the permanent files not temporary ones.
- * When true (happens only when the collector is starting up), remove the
- * files after reading; the in-memory status is now authoritative, and the
- * files would be out of date in case somebody else reads them.
- *
- * If a 'deep' read is requested, table/function stats are read, otherwise
- * the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
  PgStat_StatDBEntry *dbentry;
  PgStat_StatDBEntry dbbuf;
- HASHCTL hash_ctl;
- HTAB   *dbhash;
  FILE   *fpin;
  int32 format_id;
  bool found;
- const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
- /*
- * The tables will live in pgStatLocalContext.
- */
- pgstat_setup_memcxt();
+ const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 
- /*
- * Create the DB hashtable
- */
- memset(&hash_ctl, 0, sizeof(hash_ctl));
- hash_ctl.keysize = sizeof(Oid);
- hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
- hash_ctl.hcxt = pgStatLocalContext;
- dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ /* shouldn't be called from postmaster */
+ Assert(IsUnderPostmaster);
 
- /*
- * Clear out global and archiver statistics so they start from zero in
- * case we can't load an existing statsfile.
- */
- memset(&globalStats, 0, sizeof(globalStats));
- memset(&archiverStats, 0, sizeof(archiverStats));
+ elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
  /*
  * Set the current timestamp (will be kept only in case we can't load an
  * existing statsfile).
  */
- globalStats.stat_reset_timestamp = GetCurrentTimestamp();
- archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+ shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+ shared_archiverStats->stat_reset_timestamp =
+ shared_globalStats->stat_reset_timestamp;
 
  /*
  * Try to open the stats file. If it doesn't exist, the backends simply
- * return zero for anything and the collector simply starts from scratch
- * with empty counters.
+ * returns zero for anything and the activity statistics simply starts
+ * from scratch with empty counters.
  *
- * ENOENT is a possibility if the stats collector is not running or has
- * not yet written the stats file the first time.  Any other failure
+ * ENOENT is a possibility if the activity statistics is not running or
+ * has not yet written the stats file the first time.  Any other failure
  * condition is suspicious.
  */
  if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
  {
  if (errno != ENOENT)
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errcode_for_file_access(),
  errmsg("could not open statistics file \"%s\": %m",
  statfile)));
- return dbhash;
+ return;
  }
 
  /*
@@ -5062,7 +4946,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
  if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
  format_id != PGSTAT_FILE_FORMAT_ID)
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"", statfile)));
  goto done;
  }
@@ -5070,38 +4954,30 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
  /*
  * Read global stats struct
  */
- if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+ if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+ sizeof(*shared_globalStats))
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"", statfile)));
- memset(&globalStats, 0, sizeof(globalStats));
+ MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
  goto done;
  }
 
- /*
- * In the collector, disregard the timestamp we read from the permanent
- * stats file; we should be willing to write a temp stats file immediately
- * upon the first request from any backend.  This only matters if the old
- * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
- * an unusual scenario.
- */
- if (pgStatRunningInCollector)
- globalStats.stats_timestamp = 0;
-
  /*
  * Read archiver stats struct
  */
- if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+ if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+ sizeof(*shared_archiverStats))
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"", statfile)));
- memset(&archiverStats, 0, sizeof(archiverStats));
+ MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
  goto done;
  }
 
  /*
- * We found an existing collector stats file. Read it and put all the
- * hashtable entries into place.
+ * We found an existing activity statistics file. Read it and put all the
+ * hash table entries into place.
  */
  for (;;)
  {
@@ -5115,7 +4991,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
  if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
   fpin) != offsetof(PgStat_StatDBEntry, tables))
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"",
  statfile)));
  goto done;
@@ -5124,76 +5000,36 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
  /*
  * Add to the DB hash
  */
- dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
- (void *) &dbbuf.databaseid,
- HASH_ENTER,
- &found);
+ dbentry = (PgStat_StatDBEntry *)
+ dshash_find_or_insert(pgStatDBHash, (void *) &dbbuf.databaseid,
+  &found);
+
+ /* don't allow duplicate dbentries */
  if (found)
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ dshash_release_lock(pgStatDBHash, dbentry);
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"",
  statfile)));
  goto done;
  }
 
- memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
- dbentry->tables = NULL;
- dbentry->functions = NULL;
+ init_dbentry(dbentry);
+ memcpy(dbentry, &dbbuf,
+   offsetof(PgStat_StatDBEntry, tables));
 
- /*
- * In the collector, disregard the timestamp we read from the
- * permanent stats file; we should be willing to write a temp
- * stats file immediately upon the first request from any
- * backend.
- */
- if (pgStatRunningInCollector)
- dbentry->stats_timestamp = 0;
-
- /*
- * Don't create tables/functions hashtables for uninteresting
- * databases.
- */
- if (onlydb != InvalidOid)
- {
- if (dbbuf.databaseid != onlydb &&
- dbbuf.databaseid != InvalidOid)
- break;
- }
-
- memset(&hash_ctl, 0, sizeof(hash_ctl));
- hash_ctl.keysize = sizeof(Oid);
- hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
- hash_ctl.hcxt = pgStatLocalContext;
- dbentry->tables = hash_create("Per-database table",
-  PGSTAT_TAB_HASH_SIZE,
-  &hash_ctl,
-  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- hash_ctl.keysize = sizeof(Oid);
- hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
- hash_ctl.hcxt = pgStatLocalContext;
- dbentry->functions = hash_create("Per-database function",
- PGSTAT_FUNCTION_HASH_SIZE,
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /*
- * If requested, read the data from the database-specific
- * file.  Otherwise we just leave the hashtables empty.
- */
- if (deep)
- pgstat_read_db_statsfile(dbentry->databaseid,
- dbentry->tables,
- dbentry->functions,
- permanent);
+ Assert(dbentry->tables != DSM_HANDLE_INVALID);
 
+ /* Read the data from the database-specific file. */
+ pgstat_read_pgStatDBHashfile(dbentry);
+ dshash_release_lock(pgStatDBHash, dbentry);
  break;
 
  case 'E':
  goto done;
 
  default:
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"",
  statfile)));
  goto done;
@@ -5203,59 +5039,49 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
  FreeFile(fpin);
 
- /* If requested to read the permanent file, also get rid of it. */
- if (permanent)
- {
- elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
- unlink(statfile);
- }
+ elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+ unlink(statfile);
 
- return dbhash;
+ return;
 }
 
 
 /* ----------
- * pgstat_read_db_statsfile() -
- *
- * Reads in the existing statistics collector file for the given database,
- * filling the passed-in tables and functions hash tables.
- *
- * As in pgstat_read_statsfiles, if the permanent file is requested, it is
- * removed after reading.
+ * pgstat_read_pgStatDBHashfile() -
  *
- * Note: this code has the ability to skip storing per-table or per-function
- * data, if NULL is passed for the corresponding hashtable.  That's not used
- * at the moment though.
+ * Reads in the at-rest statistics file and create shared statistics
+ * tables. The file is removed after reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
- bool permanent)
+pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
 {
  PgStat_StatTabEntry *tabentry;
  PgStat_StatTabEntry tabbuf;
  PgStat_StatFuncEntry funcbuf;
  PgStat_StatFuncEntry *funcentry;
+ dshash_table *tabhash = NULL;
+ dshash_table *funchash = NULL;
  FILE   *fpin;
  int32 format_id;
  bool found;
  char statfile[MAXPGPATH];
 
- get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+ get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
 
  /*
  * Try to open the stats file. If it doesn't exist, the backends simply
- * return zero for anything and the collector simply starts from scratch
- * with empty counters.
+ * returns zero for anything and the activity statistics simply starts
+ * from scratch with empty counters.
  *
- * ENOENT is a possibility if the stats collector is not running or has
- * not yet written the stats file the first time.  Any other failure
+ * ENOENT is a possibility if the activity statistics is not running or
+ * has not yet written the stats file the first time.  Any other failure
  * condition is suspicious.
  */
  if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
  {
  if (errno != ENOENT)
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errcode_for_file_access(),
  errmsg("could not open statistics file \"%s\": %m",
  statfile)));
@@ -5268,14 +5094,17 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
  if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
  format_id != PGSTAT_FILE_FORMAT_ID)
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"", statfile)));
  goto done;
  }
 
+ /* Create table stats hash */
+ Assert(dbentry->tables != DSM_HANDLE_INVALID);
+
  /*
- * We found an existing collector stats file. Read it and put all the
- * hashtable entries into place.
+ * We found an existing statistics file. Read it and put all the hash
+ * table entries into place.
  */
  for (;;)
  {
@@ -5288,31 +5117,32 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
  if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
   fpin) != sizeof(PgStat_StatTabEntry))
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"",
  statfile)));
  goto done;
  }
 
- /*
- * Skip if table data not wanted.
- */
- if (tabhash == NULL)
- break;
+ if (!tabhash)
+ tabhash = dshash_attach(area, &dsh_tblparams,
+ dbentry->tables, 0);
 
- tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-   (void *) &tabbuf.tableid,
-   HASH_ENTER, &found);
+ tabentry = (PgStat_StatTabEntry *)
+ dshash_find_or_insert(tabhash,
+  (void *) &tabbuf.tableid, &found);
 
+ /* don't allow duplicate entries */
  if (found)
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ dshash_release_lock(tabhash, tabentry);
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"",
  statfile)));
  goto done;
  }
 
  memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+ dshash_release_lock(tabhash, tabentry);
  break;
 
  /*
@@ -5322,31 +5152,34 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
  if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
   fpin) != sizeof(PgStat_StatFuncEntry))
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"",
  statfile)));
  goto done;
  }
 
- /*
- * Skip if function data not wanted.
- */
  if (funchash == NULL)
- break;
+ {
+ funchash = dshash_create(area, &dsh_tblparams, 0);
+ dbentry->functions =
+ dshash_get_hash_table_handle(funchash);
+ }
 
- funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
- (void *) &funcbuf.functionid,
- HASH_ENTER, &found);
+ funcentry = (PgStat_StatFuncEntry *)
+ dshash_find_or_insert(funchash,
+  (void *) &funcbuf.functionid, &found);
 
  if (found)
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ dshash_release_lock(funchash, funcentry);
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"",
  statfile)));
  goto done;
  }
 
  memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+ dshash_release_lock(funchash, funcentry);
  break;
 
  /*
@@ -5356,7 +5189,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
  goto done;
 
  default:
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"",
  statfile)));
  goto done;
@@ -5364,292 +5197,38 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
  }
 
 done:
- FreeFile(fpin);
-
- if (permanent)
- {
- elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
- unlink(statfile);
- }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- * Attempt to determine the timestamp of the last db statfile write.
- * Returns true if successful; the timestamp is stored in *ts.
- *
- * This needs to be careful about handling databases for which no stats file
- * exists, such as databases without a stat entry or those not yet written:
- *
- * - if there's a database entry in the global file, return the corresponding
- * stats_timestamp value.
- *
- * - if there's no db stat entry (e.g. for a new or inactive database),
- * there's no stats_timestamp value, but also nothing to write so we return
- * the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-   TimestampTz *ts)
-{
- PgStat_StatDBEntry dbentry;
- PgStat_GlobalStats myGlobalStats;
- PgStat_ArchiverStats myArchiverStats;
- FILE   *fpin;
- int32 format_id;
- const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
- /*
- * Try to open the stats file.  As above, anything but ENOENT is worthy of
- * complaining about.
- */
- if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
- {
- if (errno != ENOENT)
- ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errcode_for_file_access(),
- errmsg("could not open statistics file \"%s\": %m",
- statfile)));
- return false;
- }
-
- /*
- * Verify it's of the expected format.
- */
- if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
- format_id != PGSTAT_FILE_FORMAT_ID)
- {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errmsg("corrupted statistics file \"%s\"", statfile)));
- FreeFile(fpin);
- return false;
- }
-
- /*
- * Read global stats struct
- */
- if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-  fpin) != sizeof(myGlobalStats))
- {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errmsg("corrupted statistics file \"%s\"", statfile)));
- FreeFile(fpin);
- return false;
- }
-
- /*
- * Read archiver stats struct
- */
- if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-  fpin) != sizeof(myArchiverStats))
- {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errmsg("corrupted statistics file \"%s\"", statfile)));
- FreeFile(fpin);
- return false;
- }
-
- /* By default, we're going to return the timestamp of the global file. */
- *ts = myGlobalStats.stats_timestamp;
-
- /*
- * We found an existing collector stats file.  Read it and look for a
- * record for the requested database.  If found, use its timestamp.
- */
- for (;;)
- {
- switch (fgetc(fpin))
- {
- /*
- * 'D' A PgStat_StatDBEntry struct describing a database
- * follows.
- */
- case 'D':
- if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-  fpin) != offsetof(PgStat_StatDBEntry, tables))
- {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errmsg("corrupted statistics file \"%s\"",
- statfile)));
- goto done;
- }
-
- /*
- * If this is the DB we're looking for, save its timestamp and
- * we're done.
- */
- if (dbentry.databaseid == databaseid)
- {
- *ts = dbentry.stats_timestamp;
- goto done;
- }
-
- break;
-
- case 'E':
- goto done;
-
- default:
- ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errmsg("corrupted statistics file \"%s\"",
- statfile)));
- goto done;
- }
- }
+ if (tabhash)
+ dshash_detach(tabhash);
+ if (funchash)
+ dshash_detach(funchash);
 
-done:
  FreeFile(fpin);
- return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
- TimestampTz min_ts = 0;
- TimestampTz ref_ts = 0;
- Oid inquiry_db;
- int count;
-
- /* already read it? */
- if (pgStatDBHash)
- return;
- Assert(!pgStatRunningInCollector);
-
- /*
- * In a normal backend, we check staleness of the data for our own DB, and
- * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
- * check staleness of the shared-catalog data, and send InvalidOid in
- * inquiry messages so as not to force writing unnecessary data.
- */
- if (IsAutoVacuumLauncherProcess())
- inquiry_db = InvalidOid;
- else
- inquiry_db = MyDatabaseId;
-
- /*
- * Loop until fresh enough stats file is available or we ran out of time.
- * The stats inquiry message is sent repeatedly in case collector drops
- * it; but not every single time, as that just swamps the collector.
- */
- for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
- {
- bool ok;
- TimestampTz file_ts = 0;
- TimestampTz cur_ts;
-
- CHECK_FOR_INTERRUPTS();
-
- ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
- cur_ts = GetCurrentTimestamp();
- /* Calculate min acceptable timestamp, if we didn't already */
- if (count == 0 || cur_ts < ref_ts)
- {
- /*
- * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
- * msec before now.  This indirectly ensures that the collector
- * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
- * an autovacuum worker, however, we want a lower delay to avoid
- * using stale data, so we use PGSTAT_RETRY_DELAY (since the
- * number of workers is low, this shouldn't be a problem).
- *
- * We don't recompute min_ts after sleeping, except in the
- * unlikely case that cur_ts went backwards.  So we might end up
- * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
- * practice that shouldn't happen, though, as long as the sleep
- * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
- * tell the collector that our cutoff time is less than what we'd
- * actually accept.
- */
- ref_ts = cur_ts;
- if (IsAutoVacuumWorkerProcess())
- min_ts = TimestampTzPlusMilliseconds(ref_ts,
- -PGSTAT_RETRY_DELAY);
- else
- min_ts = TimestampTzPlusMilliseconds(ref_ts,
- -PGSTAT_STAT_INTERVAL);
- }
-
- /*
- * If the file timestamp is actually newer than cur_ts, we must have
- * had a clock glitch (system time went backwards) or there is clock
- * skew between our processor and the stats collector's processor.
- * Accept the file, but send an inquiry message anyway to make
- * pgstat_recv_inquiry do a sanity check on the collector's time.
- */
- if (ok && file_ts > cur_ts)
- {
- /*
- * A small amount of clock skew between processors isn't terribly
- * surprising, but a large difference is worth logging.  We
- * arbitrarily define "large" as 1000 msec.
- */
- if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
- {
- char   *filetime;
- char   *mytime;
-
- /* Copy because timestamptz_to_str returns a static buffer */
- filetime = pstrdup(timestamptz_to_str(file_ts));
- mytime = pstrdup(timestamptz_to_str(cur_ts));
- elog(LOG, "stats collector's time %s is later than backend local time %s",
- filetime, mytime);
- pfree(filetime);
- pfree(mytime);
- }
-
- pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
- break;
- }
-
- /* Normal acceptance case: file is not older than cutoff time */
- if (ok && file_ts >= min_ts)
- break;
-
- /* Not there or too old, so kick the collector and wait a bit */
- if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
- pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
- pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
- }
 
- if (count >= PGSTAT_POLL_LOOP_COUNT)
- ereport(LOG,
- (errmsg("using stale statistics instead of current ones "
- "because stats collector is not responding")));
-
- /*
- * Autovacuum launcher wants stats about all databases, but a shallow read
- * is sufficient.  Regular backends want a deep read for just the tables
- * they can see (MyDatabaseId + shared catalogs).
- */
- if (IsAutoVacuumLauncherProcess())
- pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
- else
- pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+ elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+ unlink(statfile);
 }
 
 
 /* ----------
  * pgstat_setup_memcxt() -
  *
- * Create pgStatLocalContext, if not already done.
+ * Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
  * ----------
  */
 static void
 pgstat_setup_memcxt(void)
 {
  if (!pgStatLocalContext)
- pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-   "Statistics snapshot",
-   ALLOCSET_SMALL_SIZES);
+ pgStatLocalContext =
+ AllocSetContextCreate(TopMemoryContext,
+  "Backend statistics snapshot",
+  ALLOCSET_SMALL_SIZES);
+
+ if (!pgStatSnapshotContext)
+ pgStatSnapshotContext =
+ AllocSetContextCreate(TopMemoryContext,
+  "Database statistics snapshot",
+  ALLOCSET_SMALL_SIZES);
 }
 
 
@@ -5668,739 +5247,185 @@ pgstat_clear_snapshot(void)
 {
  /* Release memory, if any was allocated */
  if (pgStatLocalContext)
+ {
  MemoryContextDelete(pgStatLocalContext);
 
- /* Reset variables */
- pgStatLocalContext = NULL;
- pgStatDBHash = NULL;
- localBackendStatusTable = NULL;
- localNumBackends = 0;
-}
+ /* Reset variables */
+ pgStatLocalContext = NULL;
+ localBackendStatusTable = NULL;
+ localNumBackends = 0;
+ }
 
+ if (pgStatSnapshotContext)
+ clear_snapshot = true;
+}
 
-/* ----------
- * pgstat_recv_inquiry() -
- *
- * Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+   bool nowait)
 {
- PgStat_StatDBEntry *dbentry;
+ PgStat_StatTabEntry *tabent;
+ bool found;
 
- elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
+ if (tabhash == NULL)
+ return false;
 
- /*
- * If there's already a write request for this DB, there's nothing to do.
- *
- * Note that if a request is found, we return early and skip the below
- * check for clock skew.  This is okay, since the only way for a DB
- * request to be present in the list is that we have been here since the
- * last write round.  It seems sufficient to check for clock skew once per
- * write round.
- */
- if (list_member_oid(pending_write_requests, msg->databaseid))
- return;
+ tabent = (PgStat_StatTabEntry *)
+ dshash_find_extended(tabhash, (void *) &(stat->t_id),
+ true, nowait, true, &found);
 
- /*
- * Check to see if we last wrote this database at a time >= the requested
- * cutoff time.  If so, this is a stale request that was generated before
- * we updated the DB file, and we don't need to do so again.
- *
- * If the requestor's local clock time is older than stats_timestamp, we
- * should suspect a clock glitch, ie system time going backwards; though
- * the more likely explanation is just delayed message receipt.  It is
- * worth expending a GetCurrentTimestamp call to be sure, since a large
- * retreat in the system clock reading could otherwise cause us to neglect
- * to update the stats file for a long time.
- */
- dbentry = pgstat_get_db_entry(msg->databaseid, false);
- if (dbentry == NULL)
+ /* failed to acquire lock */
+ if (tabent == NULL)
+ return false;
+
+ if (!found)
  {
  /*
- * We have no data for this DB.  Enter a write request anyway so that
- * the global stats will get updated.  This is needed to prevent
- * backend_read_statsfile from waiting for data that we cannot supply,
- * in the case of a new DB that nobody has yet reported any stats for.
- * See the behavior of pgstat_read_db_statsfile_timestamp.
+ * If it's a new table entry, initialize counters to the values we
+ * just got.
  */
+ tabent->numscans = stat->t_counts.t_numscans;
+ tabent->tuples_returned = stat->t_counts.t_tuples_returned;
+ tabent->tuples_fetched = stat->t_counts.t_tuples_fetched;
+ tabent->tuples_inserted = stat->t_counts.t_tuples_inserted;
+ tabent->tuples_updated = stat->t_counts.t_tuples_updated;
+ tabent->tuples_deleted = stat->t_counts.t_tuples_deleted;
+ tabent->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+ tabent->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+ tabent->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+ tabent->changes_since_analyze = stat->t_counts.t_changed_tuples;
+ tabent->blocks_fetched = stat->t_counts.t_blocks_fetched;
+ tabent->blocks_hit = stat->t_counts.t_blocks_hit;
+
+ tabent->vacuum_timestamp = 0;
+ tabent->vacuum_count = 0;
+ tabent->autovac_vacuum_timestamp = 0;
+ tabent->autovac_vacuum_count = 0;
+ tabent->analyze_timestamp = 0;
+ tabent->analyze_count = 0;
+ tabent->autovac_analyze_timestamp = 0;
+ tabent->autovac_analyze_count = 0;
  }
- else if (msg->clock_time < dbentry->stats_timestamp)
+ else
  {
- TimestampTz cur_ts = GetCurrentTimestamp();
-
- if (cur_ts < dbentry->stats_timestamp)
- {
- /*
- * Sure enough, time went backwards.  Force a new stats file write
- * to get back in sync; but first, log a complaint.
- */
- char   *writetime;
- char   *mytime;
-
- /* Copy because timestamptz_to_str returns a static buffer */
- writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
- mytime = pstrdup(timestamptz_to_str(cur_ts));
- elog(LOG,
- "stats_timestamp %s is later than collector's time %s for database %u",
- writetime, mytime, dbentry->databaseid);
- pfree(writetime);
- pfree(mytime);
- }
- else
+ /*
+ * Otherwise add the values to the existing entry.
+ */
+ tabent->numscans += stat->t_counts.t_numscans;
+ tabent->tuples_returned += stat->t_counts.t_tuples_returned;
+ tabent->tuples_fetched += stat->t_counts.t_tuples_fetched;
+ tabent->tuples_inserted += stat->t_counts.t_tuples_inserted;
+ tabent->tuples_updated += stat->t_counts.t_tuples_updated;
+ tabent->tuples_deleted += stat->t_counts.t_tuples_deleted;
+ tabent->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+ /* If table was truncated, first reset the live/dead counters */
+ if (stat->t_counts.t_truncated)
  {
- /*
- * Nope, it's just an old request.  Assuming msg's clock_time is
- * >= its cutoff_time, it must be stale, so we can ignore it.
- */
- return;
+ tabent->n_live_tuples = 0;
+ tabent->n_dead_tuples = 0;
  }
- }
- else if (msg->cutoff_time <= dbentry->stats_timestamp)
- {
- /* Stale request, ignore it */
- return;
+ tabent->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+ tabent->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+ tabent->changes_since_analyze += stat->t_counts.t_changed_tuples;
+ tabent->blocks_fetched += stat->t_counts.t_blocks_fetched;
+ tabent->blocks_hit += stat->t_counts.t_blocks_hit;
  }
 
- /*
- * We need to write this DB, so create a request.
- */
- pending_write_requests = lappend_oid(pending_write_requests,
- msg->databaseid);
+ /* Clamp n_live_tuples in case of negative delta_live_tuples */
+ tabent->n_live_tuples = Max(tabent->n_live_tuples, 0);
+ /* Likewise for n_dead_tuples */
+ tabent->n_dead_tuples = Max(tabent->n_dead_tuples, 0);
+
+ dshash_release_lock(tabhash, tabent);
+
+ return true;
 }
 
 
-/* ----------
- * pgstat_recv_tabstat() -
+/*
+ * Lookup shared stats hash table for the specified database. Returns NULL
+ * when PGSTAT_NOWAIT and required lock cannot be acquired.
  *
- * Count what the backend has done.
- * ----------
  */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
+static PgStat_StatDBEntry *
+pgstat_get_db_entry(Oid databaseid, bool exclusive, bool nowait, bool create)
 {
- PgStat_StatDBEntry *dbentry;
- PgStat_StatTabEntry *tabentry;
- int i;
- bool found;
+ PgStat_StatDBEntry *result;
+ bool found = true;
 
- dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+ if (!IsUnderPostmaster || !pgStatDBHash)
+ return NULL;
 
- /*
- * Update database-wide stats.
- */
- dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
- dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
- dbentry->n_block_read_time += msg->m_block_read_time;
- dbentry->n_block_write_time += msg->m_block_write_time;
-
- /*
- * Process all table entries in the message.
- */
- for (i = 0; i < msg->m_nentries; i++)
- {
- PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
- tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-   (void *) &(tabmsg->t_id),
-   HASH_ENTER, &found);
+ /* Lookup or create the hash table entry for this database */
+ result = (PgStat_StatDBEntry *)
+ dshash_find_extended(pgStatDBHash, &databaseid,
+ exclusive, nowait, create, &found);
 
- if (!found)
- {
- /*
- * If it's a new table entry, initialize counters to the values we
- * just got.
- */
- tabentry->numscans = tabmsg->t_counts.t_numscans;
- tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
- tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
- tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
- tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
- tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
- tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
- tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
- tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
- tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
- tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
- tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
- tabentry->vacuum_timestamp = 0;
- tabentry->vacuum_count = 0;
- tabentry->autovac_vacuum_timestamp = 0;
- tabentry->autovac_vacuum_count = 0;
- tabentry->analyze_timestamp = 0;
- tabentry->analyze_count = 0;
- tabentry->autovac_analyze_timestamp = 0;
- tabentry->autovac_analyze_count = 0;
- }
- else
- {
- /*
- * Otherwise add the values to the existing entry.
- */
- tabentry->numscans += tabmsg->t_counts.t_numscans;
- tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
- tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
- tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
- tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
- tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
- tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
- /* If table was truncated, first reset the live/dead counters */
- if (tabmsg->t_counts.t_truncated)
- {
- tabentry->n_live_tuples = 0;
- tabentry->n_dead_tuples = 0;
- }
- tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
- tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
- tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
- tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
- tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
- }
+ if (result == NULL)
+ return NULL;
 
- /* Clamp n_live_tuples in case of negative delta_live_tuples */
- tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
- /* Likewise for n_dead_tuples */
- tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+ if (create && !found)
+ {
+ Assert(create);
 
  /*
- * Add per-table stats to the per-database entry, too.
+ * Initialize the new entry.  This creates empty hash tables hash,
+ * too.
  */
- dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
- dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
- dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
- dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
- dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
- dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
- dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
- }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- * Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
- int i;
-
- dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
- /*
- * No need to purge if we don't even know the database.
- */
- if (!dbentry || !dbentry->tables)
- return;
-
- /*
- * Process all table entries in the message.
- */
- for (i = 0; i < msg->m_nentries; i++)
- {
- /* Remove from hashtable if present; we don't care if it's not. */
- (void) hash_search(dbentry->tables,
-   (void *) &(msg->m_tableid[i]),
-   HASH_REMOVE, NULL);
- }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- * Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
- Oid dbid = msg->m_databaseid;
- PgStat_StatDBEntry *dbentry;
-
- /*
- * Lookup the database in the hashtable.
- */
- dbentry = pgstat_get_db_entry(dbid, false);
-
- /*
- * If found, remove it (along with the db statfile).
- */
- if (dbentry)
- {
- char statfile[MAXPGPATH];
-
- get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
- elog(DEBUG2, "removing stats file \"%s\"", statfile);
- unlink(statfile);
-
- if (dbentry->tables != NULL)
- hash_destroy(dbentry->tables);
- if (dbentry->functions != NULL)
- hash_destroy(dbentry->functions);
-
- if (hash_search(pgStatDBHash,
- (void *) &dbid,
- HASH_REMOVE, NULL) == NULL)
- ereport(ERROR,
- (errmsg("database hash table corrupted during cleanup --- abort")));
- }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- * Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
-
- /*
- * Lookup the database in the hashtable.  Nothing to do if not there.
- */
- dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
- if (!dbentry)
- return;
-
- /*
- * We simply throw away all the database's table entries by recreating a
- * new hash table for them.
- */
- if (dbentry->tables != NULL)
- hash_destroy(dbentry->tables);
- if (dbentry->functions != NULL)
- hash_destroy(dbentry->functions);
-
- dbentry->tables = NULL;
- dbentry->functions = NULL;
-
- /*
- * Reset database-level stats, too.  This creates empty hash tables for
- * tables and functions.
- */
- reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- * Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
- if (msg->m_resettarget == RESET_BGWRITER)
- {
- /* Reset the global background writer statistics for the cluster. */
- memset(&globalStats, 0, sizeof(globalStats));
- globalStats.stat_reset_timestamp = GetCurrentTimestamp();
- }
- else if (msg->m_resettarget == RESET_ARCHIVER)
- {
- /* Reset the archiver statistics for the cluster. */
- memset(&archiverStats, 0, sizeof(archiverStats));
- archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
- }
-
- /*
- * Presumably the sender of this message validated the target, don't
- * complain here if it's not valid
- */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- * Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
-
- dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
- if (!dbentry)
- return;
-
- /* Set the reset timestamp for the whole database */
- dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
- /* Remove object if it exists, ignore it if not */
- if (msg->m_resettype == RESET_TABLE)
- (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-   HASH_REMOVE, NULL);
- else if (msg->m_resettype == RESET_FUNCTION)
- (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-   HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- * Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
-
- /*
- * Store the last autovacuum time in the database's hashtable entry.
- */
- dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
- dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- * Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
- PgStat_StatTabEntry *tabentry;
-
- /*
- * Store the data in the table's hashtable entry.
- */
- dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
- tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
- tabentry->n_live_tuples = msg->m_live_tuples;
- tabentry->n_dead_tuples = msg->m_dead_tuples;
-
- if (msg->m_autovacuum)
- {
- tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
- tabentry->autovac_vacuum_count++;
- }
- else
- {
- tabentry->vacuum_timestamp = msg->m_vacuumtime;
- tabentry->vacuum_count++;
+ init_dbentry(result);
  }
-}
 
-/* ----------
- * pgstat_recv_analyze() -
- *
- * Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
- PgStat_StatTabEntry *tabentry;
-
- /*
- * Store the data in the table's hashtable entry.
- */
- dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
- tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
- tabentry->n_live_tuples = msg->m_live_tuples;
- tabentry->n_dead_tuples = msg->m_dead_tuples;
-
- /*
- * If commanded, reset changes_since_analyze to zero.  This forgets any
- * changes that were committed while the ANALYZE was in progress, but we
- * have no good way to estimate how many of those there were.
- */
- if (msg->m_resetcounter)
- tabentry->changes_since_analyze = 0;
-
- if (msg->m_autovacuum)
- {
- tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
- tabentry->autovac_analyze_count++;
- }
- else
- {
- tabentry->analyze_timestamp = msg->m_analyzetime;
- tabentry->analyze_count++;
- }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- * Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
- if (msg->m_failed)
- {
- /* Failed archival attempt */
- ++archiverStats.failed_count;
- memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-   sizeof(archiverStats.last_failed_wal));
- archiverStats.last_failed_timestamp = msg->m_timestamp;
- }
- else
- {
- /* Successful archival operation */
- ++archiverStats.archived_count;
- memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-   sizeof(archiverStats.last_archived_wal));
- archiverStats.last_archived_timestamp = msg->m_timestamp;
- }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- * Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
- globalStats.timed_checkpoints += msg->m_timed_checkpoints;
- globalStats.requested_checkpoints += msg->m_requested_checkpoints;
- globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
- globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
- globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
- globalStats.buf_written_clean += msg->m_buf_written_clean;
- globalStats.maxwritten_clean += msg->m_maxwritten_clean;
- globalStats.buf_written_backend += msg->m_buf_written_backend;
- globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
- globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- * Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
-
- dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
- switch (msg->m_reason)
- {
- case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
- /*
- * Since we drop the information about the database as soon as it
- * replicates, there is no point in counting these conflicts.
- */
- break;
- case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
- dbentry->n_conflict_tablespace++;
- break;
- case PROCSIG_RECOVERY_CONFLICT_LOCK:
- dbentry->n_conflict_lock++;
- break;
- case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
- dbentry->n_conflict_snapshot++;
- break;
- case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
- dbentry->n_conflict_bufferpin++;
- break;
- case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
- dbentry->n_conflict_startup_deadlock++;
- break;
- }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- * Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
-
- dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
- dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- * Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
-
- dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
- dbentry->n_checksum_failures += msg->m_failurecount;
- dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- * Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
-
- dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
- dbentry->n_temp_bytes += msg->m_filesize;
- dbentry->n_temp_files += 1;
+ return result;
 }
 
-/* ----------
- * pgstat_recv_funcstat() -
- *
- * Count what the backend has done.
- * ----------
+/*
+ * Lookup the hash table entry for the specified table. Returned entry is
+ * exclusive locked.
+ * If no hash table entry exists, creates it, if create is true.
+ * Else, returns NULL.
  */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
+static PgStat_StatTabEntry *
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
- PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
- PgStat_StatDBEntry *dbentry;
- PgStat_StatFuncEntry *funcentry;
- int i;
+ PgStat_StatTabEntry *result;
  bool found;
 
- dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
- /*
- * Process all function entries in the message.
- */
- for (i = 0; i < msg->m_nentries; i++, funcmsg++)
- {
- funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
- (void *) &(funcmsg->f_id),
- HASH_ENTER, &found);
-
- if (!found)
- {
- /*
- * If it's a new function entry, initialize counters to the values
- * we just got.
- */
- funcentry->f_numcalls = funcmsg->f_numcalls;
- funcentry->f_total_time = funcmsg->f_total_time;
- funcentry->f_self_time = funcmsg->f_self_time;
- }
- else
- {
- /*
- * Otherwise add the values to the existing entry.
- */
- funcentry->f_numcalls += funcmsg->f_numcalls;
- funcentry->f_total_time += funcmsg->f_total_time;
- funcentry->f_self_time += funcmsg->f_self_time;
- }
- }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- * Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
- int i;
-
- dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
+ /* Lookup or create the hash table entry for this table */
+ if (create)
+ result = (PgStat_StatTabEntry *)
+ dshash_find_or_insert(table, &tableoid, &found);
+ else
+ result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
- /*
- * No need to purge if we don't even know the database.
- */
- if (!dbentry || !dbentry->functions)
- return;
+ if (!create && !found)
+ return NULL;
 
- /*
- * Process all function entries in the message.
- */
- for (i = 0; i < msg->m_nentries; i++)
+ /* If not found, initialize the new one. */
+ if (!found)
  {
- /* Remove from hashtable if present; we don't care if it's not. */
- (void) hash_search(dbentry->functions,
-   (void *) &(msg->m_functionid[i]),
-   HASH_REMOVE, NULL);
+ result->numscans = 0;
+ result->tuples_returned = 0;
+ result->tuples_fetched = 0;
+ result->tuples_inserted = 0;
+ result->tuples_updated = 0;
+ result->tuples_deleted = 0;
+ result->tuples_hot_updated = 0;
+ result->n_live_tuples = 0;
+ result->n_dead_tuples = 0;
+ result->changes_since_analyze = 0;
+ result->blocks_fetched = 0;
+ result->blocks_hit = 0;
+ result->vacuum_timestamp = 0;
+ result->vacuum_count = 0;
+ result->autovac_vacuum_timestamp = 0;
+ result->autovac_vacuum_count = 0;
+ result->analyze_timestamp = 0;
+ result->analyze_count = 0;
+ result->autovac_analyze_timestamp = 0;
+ result->autovac_analyze_count = 0;
  }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- * Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
- if (pending_write_requests != NIL)
- return true;
-
- /* Everything was written recently */
- return false;
-}
 
-/* ----------
- * pgstat_db_requested() -
- *
- * Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
- /*
- * If any requests are outstanding at all, we should write the stats for
- * shared catalogs (the "database" with OID 0).  This ensures that
- * backends will see up-to-date stats for shared catalogs, even though
- * they send inquiry messages mentioning only their own DB.
- */
- if (databaseid == InvalidOid && pending_write_requests != NIL)
- return true;
-
- /* Search to see if there's an open request to write this database. */
- if (list_member_oid(pending_write_requests, databaseid))
- return true;
-
- return false;
+ return result;
 }
 
 /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index fab4a9dd51..d418fe3bd0 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
  WalReceiverPID = 0,
  AutoVacPID = 0,
  PgArchPID = 0,
- PgStatPID = 0,
  SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -503,7 +502,6 @@ typedef struct
  PGPROC   *AuxiliaryProcs;
  PGPROC   *PreparedXactProcs;
  PMSignalData *PMSignalState;
- InheritableSocket pgStatSock;
  pid_t PostmasterPid;
  TimestampTz PgStartTime;
  TimestampTz PgReloadTime;
@@ -1326,12 +1324,6 @@ PostmasterMain(int argc, char *argv[])
  */
  RemovePgTempFiles();
 
- /*
- * Initialize stats collection subsystem (this does NOT start the
- * collector process!)
- */
- pgstat_init();
-
  /*
  * Initialize the autovacuum subsystem (again, no process start yet)
  */
@@ -1780,11 +1772,6 @@ ServerLoop(void)
  start_autovac_launcher = false; /* signal processed */
  }
 
- /* If we have lost the stats collector, try to start a new one */
- if (PgStatPID == 0 &&
- (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
- PgStatPID = pgstat_start();
-
  /* If we have lost the archiver, try to start a new one. */
  if (PgArchPID == 0 && PgArchStartupAllowed())
  PgArchPID = StartArchiver();
@@ -2694,8 +2681,6 @@ SIGHUP_handler(SIGNAL_ARGS)
  signal_child(PgArchPID, SIGHUP);
  if (SysLoggerPID != 0)
  signal_child(SysLoggerPID, SIGHUP);
- if (PgStatPID != 0)
- signal_child(PgStatPID, SIGHUP);
 
  /* Reload authentication config files too */
  if (!load_hba())
@@ -3058,8 +3043,6 @@ reaper(SIGNAL_ARGS)
  AutoVacPID = StartAutoVacLauncher();
  if (PgArchStartupAllowed() && PgArchPID == 0)
  PgArchPID = StartArchiver();
- if (PgStatPID == 0)
- PgStatPID = pgstat_start();
 
  /* workers may be scheduled to start now */
  maybe_start_bgworkers();
@@ -3126,13 +3109,6 @@ reaper(SIGNAL_ARGS)
  SignalChildren(SIGUSR2);
 
  pmState = PM_SHUTDOWN_2;
-
- /*
- * We can also shut down the stats collector now; there's
- * nothing left for it to do.
- */
- if (PgStatPID != 0)
- signal_child(PgStatPID, SIGQUIT);
  }
  else
  {
@@ -3205,22 +3181,6 @@ reaper(SIGNAL_ARGS)
  continue;
  }
 
- /*
- * Was it the statistics collector?  If so, just try to start a new
- * one; no need to force reset of the rest of the system.  (If fail,
- * we'll try again in future cycles of the main loop.)
- */
- if (pid == PgStatPID)
- {
- PgStatPID = 0;
- if (!EXIT_STATUS_0(exitstatus))
- LogChildExit(LOG, _("statistics collector process"),
- pid, exitstatus);
- if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
- PgStatPID = pgstat_start();
- continue;
- }
-
  /* Was it the system logger?  If so, try to start a new one */
  if (pid == SysLoggerPID)
  {
@@ -3681,22 +3641,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
  signal_child(PgArchPID, SIGQUIT);
  }
 
- /*
- * Force a power-cycle of the pgstat process too.  (This isn't absolutely
- * necessary, but it seems like a good idea for robustness, and it
- * simplifies the state-machine logic in the case where a shutdown request
- * arrives during crash processing.)
- */
- if (PgStatPID != 0 && take_action)
- {
- ereport(DEBUG2,
- (errmsg_internal("sending %s to process %d",
- "SIGQUIT",
- (int) PgStatPID)));
- signal_child(PgStatPID, SIGQUIT);
- allow_immediate_pgstat_restart();
- }
-
  /* We do NOT restart the syslogger */
 
  if (Shutdown != ImmediateShutdown)
@@ -3892,8 +3836,6 @@ PostmasterStateMachine(void)
  SignalChildren(SIGQUIT);
  if (PgArchPID != 0)
  signal_child(PgArchPID, SIGQUIT);
- if (PgStatPID != 0)
- signal_child(PgStatPID, SIGQUIT);
  }
  }
  }
@@ -3928,8 +3870,7 @@ PostmasterStateMachine(void)
  * normal state transition leading up to PM_WAIT_DEAD_END, or during
  * FatalError processing.
  */
- if (dlist_is_empty(&BackendList) &&
- PgArchPID == 0 && PgStatPID == 0)
+ if (dlist_is_empty(&BackendList) && PgArchPID == 0)
  {
  /* These other guys should be dead already */
  Assert(StartupPID == 0);
@@ -4130,8 +4071,6 @@ TerminateChildren(int signal)
  signal_child(AutoVacPID, signal);
  if (PgArchPID != 0)
  signal_child(PgArchPID, signal);
- if (PgStatPID != 0)
- signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5109,18 +5048,6 @@ SubPostmasterMain(int argc, char *argv[])
 
  StartBackgroundWorker();
  }
- if (strcmp(argv[1], "--forkarch") == 0)
- {
- /* Do not want to attach to shared memory */
-
- PgArchiverMain(argc, argv); /* does not return */
- }
- if (strcmp(argv[1], "--forkcol") == 0)
- {
- /* Do not want to attach to shared memory */
-
- PgstatCollectorMain(argc, argv); /* does not return */
- }
  if (strcmp(argv[1], "--forklog") == 0)
  {
  /* Do not want to attach to shared memory */
@@ -5239,12 +5166,6 @@ sigusr1_handler(SIGNAL_ARGS)
  if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
  pmState == PM_RECOVERY && Shutdown == NoShutdown)
  {
- /*
- * Likewise, start other special children as needed.
- */
- Assert(PgStatPID == 0);
- PgStatPID = pgstat_start();
-
  ereport(LOG,
  (errmsg("database system is ready to accept read only connections")));
 
@@ -6139,7 +6060,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6195,8 +6115,6 @@ save_backend_variables(BackendParameters *param, Port *port,
  param->AuxiliaryProcs = AuxiliaryProcs;
  param->PreparedXactProcs = PreparedXactProcs;
  param->PMSignalState = PMSignalState;
- if (!write_inheritable_socket(&param->pgStatSock, pgStatSock, childPid))
- return false;
 
  param->PostmasterPid = PostmasterPid;
  param->PgStartTime = PgStartTime;
@@ -6431,7 +6349,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
  AuxiliaryProcs = param->AuxiliaryProcs;
  PreparedXactProcs = param->PreparedXactProcs;
  PMSignalState = param->PMSignalState;
- read_inheritable_socket(&pgStatSock, &param->pgStatSock);
 
  PostmasterPid = param->PostmasterPid;
  PgStartTime = param->PgStartTime;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5880054245..04445c4c76 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2000,7 +2000,7 @@ BufferSync(int flags)
  if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
  {
  TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
- BgWriterStats.m_buf_written_checkpoints++;
+ BgWriterStats.buf_written_checkpoints++;
  num_written++;
  }
  }
@@ -2110,7 +2110,7 @@ BgBufferSync(WritebackContext *wb_context)
  strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
  /* Report buffer alloc counts to pgstat */
- BgWriterStats.m_buf_alloc += recent_alloc;
+ BgWriterStats.buf_alloc += recent_alloc;
 
  /*
  * If we're not running the LRU scan, just stop after doing the stats
@@ -2300,7 +2300,7 @@ BgBufferSync(WritebackContext *wb_context)
  reusable_buffers++;
  if (++num_written >= bgwriter_lru_maxpages)
  {
- BgWriterStats.m_maxwritten_clean++;
+ BgWriterStats.maxwritten_clean++;
  break;
  }
  }
@@ -2308,7 +2308,7 @@ BgBufferSync(WritebackContext *wb_context)
  reusable_buffers++;
  }
 
- BgWriterStats.m_buf_written_clean += num_written;
+ BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
  elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..58a442f482 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(void)
  size = add_size(size, BTreeShmemSize());
  size = add_size(size, SyncScanShmemSize());
  size = add_size(size, AsyncShmemSize());
+ size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
  size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(void)
  BTreeShmemInit();
  SyncScanShmemInit();
  AsyncShmemInit();
+ StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4c14e51c67..f61fd3e8ad 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -523,6 +523,7 @@ RegisterLWLockTranches(void)
  LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
  LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
  LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+ LWLockRegisterTranche(LWTRANCHE_STATS, "activity_statistics");
 
  /* Register named tranches. */
  for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 00c77b66c7..e2998f965e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3189,6 +3189,12 @@ ProcessInterrupts(void)
 
  if (ParallelMessagePending)
  HandleParallelMessages();
+
+ if (IdleStatsUpdateTimeoutPending)
+ {
+ IdleStatsUpdateTimeoutPending = false;
+ pgstat_report_stat(true);
+ }
 }
 
 
@@ -3763,6 +3769,7 @@ PostgresMain(int argc, char *argv[],
  sigjmp_buf local_sigjmp_buf;
  volatile bool send_ready_for_query = true;
  bool disable_idle_in_transaction_timeout = false;
+ bool disable_idle_stats_update_timeout = false;
 
  /* Initialize startup process environment if necessary. */
  if (!IsUnderPostmaster)
@@ -4201,6 +4208,8 @@ PostgresMain(int argc, char *argv[],
  }
  else
  {
+ long stats_timeout;
+
  /* Send out notify signals and transmit self-notifies */
  ProcessCompletedNotifies();
 
@@ -4213,8 +4222,13 @@ PostgresMain(int argc, char *argv[],
  if (notifyInterruptPending)
  ProcessNotifyInterrupt();
 
- pgstat_report_stat(false);
-
+ stats_timeout = pgstat_report_stat(false);
+ if (stats_timeout > 0)
+ {
+ disable_idle_stats_update_timeout = true;
+ enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+ stats_timeout);
+ }
  set_ps_display("idle");
  pgstat_report_activity(STATE_IDLE, NULL);
  }
@@ -4249,7 +4263,7 @@ PostgresMain(int argc, char *argv[],
  DoingCommandRead = false;
 
  /*
- * (5) turn off the idle-in-transaction timeout
+ * (5) turn off the idle-in-transaction timeout and stats update timeout
  */
  if (disable_idle_in_transaction_timeout)
  {
@@ -4257,6 +4271,12 @@ PostgresMain(int argc, char *argv[],
  disable_idle_in_transaction_timeout = false;
  }
 
+ if (disable_idle_stats_update_timeout)
+ {
+ disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+ disable_idle_stats_update_timeout = false;
+ }
+
  /*
  * (6) check for any other interesting events that happened while we
  * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index cea01534a5..a1304dc3ce 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -33,9 +33,6 @@
 
 #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var))))
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1244,7 +1241,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_xact_commit);
+ result = (int64) (dbentry->counts.n_xact_commit);
 
  PG_RETURN_INT64(result);
 }
@@ -1260,7 +1257,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_xact_rollback);
+ result = (int64) (dbentry->counts.n_xact_rollback);
 
  PG_RETURN_INT64(result);
 }
@@ -1276,7 +1273,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_blocks_fetched);
+ result = (int64) (dbentry->counts.n_blocks_fetched);
 
  PG_RETURN_INT64(result);
 }
@@ -1292,7 +1289,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_blocks_hit);
+ result = (int64) (dbentry->counts.n_blocks_hit);
 
  PG_RETURN_INT64(result);
 }
@@ -1308,7 +1305,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_tuples_returned);
+ result = (int64) (dbentry->counts.n_tuples_returned);
 
  PG_RETURN_INT64(result);
 }
@@ -1324,7 +1321,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_tuples_fetched);
+ result = (int64) (dbentry->counts.n_tuples_fetched);
 
  PG_RETURN_INT64(result);
 }
@@ -1340,7 +1337,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_tuples_inserted);
+ result = (int64) (dbentry->counts.n_tuples_inserted);
 
  PG_RETURN_INT64(result);
 }
@@ -1356,7 +1353,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_tuples_updated);
+ result = (int64) (dbentry->counts.n_tuples_updated);
 
  PG_RETURN_INT64(result);
 }
@@ -1372,7 +1369,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_tuples_deleted);
+ result = (int64) (dbentry->counts.n_tuples_deleted);
 
  PG_RETURN_INT64(result);
 }
@@ -1405,7 +1402,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = dbentry->n_temp_files;
+ result = dbentry->counts.n_temp_files;
 
  PG_RETURN_INT64(result);
 }
@@ -1421,7 +1418,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = dbentry->n_temp_bytes;
+ result = dbentry->counts.n_temp_bytes;
 
  PG_RETURN_INT64(result);
 }
@@ -1436,7 +1433,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_conflict_tablespace);
+ result = (int64) (dbentry->counts.n_conflict_tablespace);
 
  PG_RETURN_INT64(result);
 }
@@ -1451,7 +1448,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_conflict_lock);
+ result = (int64) (dbentry->counts.n_conflict_lock);
 
  PG_RETURN_INT64(result);
 }
@@ -1466,7 +1463,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_conflict_snapshot);
+ result = (int64) (dbentry->counts.n_conflict_snapshot);
 
  PG_RETURN_INT64(result);
 }
@@ -1481,7 +1478,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_conflict_bufferpin);
+ result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
  PG_RETURN_INT64(result);
 }
@@ -1496,7 +1493,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_conflict_startup_deadlock);
+ result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
  PG_RETURN_INT64(result);
 }
@@ -1511,11 +1508,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_conflict_tablespace +
-  dbentry->n_conflict_lock +
-  dbentry->n_conflict_snapshot +
-  dbentry->n_conflict_bufferpin +
-  dbentry->n_conflict_startup_deadlock);
+ result = (int64) (dbentry->counts.n_conflict_tablespace +
+  dbentry->counts.n_conflict_lock +
+  dbentry->counts.n_conflict_snapshot +
+  dbentry->counts.n_conflict_bufferpin +
+  dbentry->counts.n_conflict_startup_deadlock);
 
  PG_RETURN_INT64(result);
 }
@@ -1530,7 +1527,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_deadlocks);
+ result = (int64) (dbentry->counts.n_deadlocks);
 
  PG_RETURN_INT64(result);
 }
@@ -1548,7 +1545,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_checksum_failures);
+ result = (int64) (dbentry->counts.n_checksum_failures);
 
  PG_RETURN_INT64(result);
 }
@@ -1585,7 +1582,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = ((double) dbentry->n_block_read_time) / 1000.0;
+ result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
  PG_RETURN_FLOAT8(result);
 }
@@ -1601,7 +1598,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = ((double) dbentry->n_block_write_time) / 1000.0;
+ result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
  PG_RETURN_FLOAT8(result);
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..51748c99ad 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d..f65d05c24c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
  RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
  RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
  IdleInTransactionSessionTimeoutHandler);
+ RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+ IdleStatsUpdateTimeoutHandler);
  }
 
  /*
@@ -1241,6 +1244,14 @@ IdleInTransactionSessionTimeoutHandler(void)
  SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+ IdleStatsUpdateTimeoutPending = true;
+ InterruptPending = true;
+ SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3c70499feb..927ae319b1 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 107;
+use Test::More tests => 106;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
- qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+ qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
  is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 619b2f9c71..9f1de1e42f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1a19921f80..4e137140bd 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  * pgstat.h
  *
- * Definitions for the PostgreSQL statistics collector daemon.
+ * Definitions for the PostgreSQL activity statistics facility.
  *
  * Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -15,9 +15,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,33 +43,6 @@ typedef enum TrackFunctionsLevel
  TRACK_FUNC_ALL
 } TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
- PGSTAT_MTYPE_DUMMY,
- PGSTAT_MTYPE_INQUIRY,
- PGSTAT_MTYPE_TABSTAT,
- PGSTAT_MTYPE_TABPURGE,
- PGSTAT_MTYPE_DROPDB,
- PGSTAT_MTYPE_RESETCOUNTER,
- PGSTAT_MTYPE_RESETSHAREDCOUNTER,
- PGSTAT_MTYPE_RESETSINGLECOUNTER,
- PGSTAT_MTYPE_AUTOVAC_START,
- PGSTAT_MTYPE_VACUUM,
- PGSTAT_MTYPE_ANALYZE,
- PGSTAT_MTYPE_ARCHIVER,
- PGSTAT_MTYPE_BGWRITER,
- PGSTAT_MTYPE_FUNCSTAT,
- PGSTAT_MTYPE_FUNCPURGE,
- PGSTAT_MTYPE_RECOVERYCONFLICT,
- PGSTAT_MTYPE_TEMPFILE,
- PGSTAT_MTYPE_DEADLOCK,
- PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -78,9 +53,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -116,13 +90,6 @@ typedef struct PgStat_TableCounts
  PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
- RESET_ARCHIVER,
- RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -181,280 +148,32 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
- StatMsgType m_type;
- int m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
- PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry Sent by a backend to ask the collector
- * to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
- PgStat_MsgHdr m_hdr;
- TimestampTz clock_time; /* observed local clock time */
- TimestampTz cutoff_time; /* minimum acceptable file timestamp */
- Oid databaseid; /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
- Oid t_id;
- PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat Sent by the backend to report table
- * and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
- ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter)) \
- / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
- int m_nentries;
- int m_xact_commit;
- int m_xact_rollback;
- PgStat_Counter m_block_read_time; /* times in microseconds */
- PgStat_Counter m_block_write_time;
- PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge Sent by the backend to tell the collector
- * about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
- ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
- / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
- int m_nentries;
- Oid m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb Sent by the backend to tell the collector
- * about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter Sent by the backend to tell the collector
- * to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- * to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
- PgStat_MsgHdr m_hdr;
- PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- * to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
- PgStat_Single_Reset_Type m_resettype;
- Oid m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart Sent by the autovacuum daemon to signal
- * that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
- TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum Sent by the backend or autovacuum daemon
- * after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
- Oid m_tableoid;
- bool m_autovacuum;
- TimestampTz m_vacuumtime;
- PgStat_Counter m_live_tuples;
- PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze Sent by the backend or autovacuum daemon
- * after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
- Oid m_tableoid;
- bool m_autovacuum;
- bool m_resetcounter;
- TimestampTz m_analyzetime;
- PgStat_Counter m_live_tuples;
- PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
- PgStat_MsgHdr m_hdr;
- bool m_failed; /* Failed attempt */
- char m_xlog[MAX_XFN_CHARS + 1];
- TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
 /* ----------
- * PgStat_MsgBgWriter Sent by the bgwriter to update statistics.
+ * PgStat_BgWriter bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgBgWriter
+typedef struct PgStat_BgWriter
 {
- PgStat_MsgHdr m_hdr;
-
- PgStat_Counter m_timed_checkpoints;
- PgStat_Counter m_requested_checkpoints;
- PgStat_Counter m_buf_written_checkpoints;
- PgStat_Counter m_buf_written_clean;
- PgStat_Counter m_maxwritten_clean;
- PgStat_Counter m_buf_written_backend;
- PgStat_Counter m_buf_fsync_backend;
- PgStat_Counter m_buf_alloc;
- PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
- PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
- PgStat_MsgHdr m_hdr;
-
- Oid m_databaseid;
- int m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
- PgStat_MsgHdr m_hdr;
-
- Oid m_databaseid;
- size_t m_filesize;
-} PgStat_MsgTempFile;
+ PgStat_Counter timed_checkpoints;
+ PgStat_Counter requested_checkpoints;
+ PgStat_Counter buf_written_checkpoints;
+ PgStat_Counter buf_written_clean;
+ PgStat_Counter maxwritten_clean;
+ PgStat_Counter buf_written_backend;
+ PgStat_Counter buf_fsync_backend;
+ PgStat_Counter buf_alloc;
+ PgStat_Counter checkpoint_write_time; /* times in milliseconds */
+ PgStat_Counter checkpoint_sync_time;
+} PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -486,96 +205,8 @@ typedef struct PgStat_FunctionEntry
  PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat Sent by the backend to report function
- * usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES \
- ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
- / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
- int m_nentries;
- PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge Sent by the backend to tell the collector
- * about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
- ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
- / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
- int m_nentries;
- Oid m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock Sent by the backend to tell the collector
- * about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure Sent by the backend to tell the collector
- * about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
- int m_failurecount;
- TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
- PgStat_MsgHdr msg_hdr;
- PgStat_MsgDummy msg_dummy;
- PgStat_MsgInquiry msg_inquiry;
- PgStat_MsgTabstat msg_tabstat;
- PgStat_MsgTabpurge msg_tabpurge;
- PgStat_MsgDropdb msg_dropdb;
- PgStat_MsgResetcounter msg_resetcounter;
- PgStat_MsgResetsharedcounter msg_resetsharedcounter;
- PgStat_MsgResetsinglecounter msg_resetsinglecounter;
- PgStat_MsgAutovacStart msg_autovacuum_start;
- PgStat_MsgVacuum msg_vacuum;
- PgStat_MsgAnalyze msg_analyze;
- PgStat_MsgArchiver msg_archiver;
- PgStat_MsgBgWriter msg_bgwriter;
- PgStat_MsgFuncstat msg_funcstat;
- PgStat_MsgFuncpurge msg_funcpurge;
- PgStat_MsgRecoveryConflict msg_recoveryconflict;
- PgStat_MsgDeadlock msg_deadlock;
- PgStat_MsgTempFile msg_tempfile;
- PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -584,13 +215,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
 
-/* ----------
- * PgStat_StatDBEntry The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
- Oid databaseid;
  PgStat_Counter n_xact_commit;
  PgStat_Counter n_xact_rollback;
  PgStat_Counter n_blocks_fetched;
@@ -600,7 +227,6 @@ typedef struct PgStat_StatDBEntry
  PgStat_Counter n_tuples_inserted;
  PgStat_Counter n_tuples_updated;
  PgStat_Counter n_tuples_deleted;
- TimestampTz last_autovac_time;
  PgStat_Counter n_conflict_tablespace;
  PgStat_Counter n_conflict_lock;
  PgStat_Counter n_conflict_snapshot;
@@ -610,29 +236,52 @@ typedef struct PgStat_StatDBEntry
  PgStat_Counter n_temp_bytes;
  PgStat_Counter n_deadlocks;
  PgStat_Counter n_checksum_failures;
- TimestampTz last_checksum_failure;
  PgStat_Counter n_block_read_time; /* times in microseconds */
  PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatDBEntry The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+ Oid databaseid;
+ TimestampTz last_autovac_time;
+ TimestampTz last_checksum_failure;
  TimestampTz stat_reset_timestamp;
- TimestampTz stats_timestamp; /* time of db stats file update */
+ TimestampTz stats_timestamp; /* time of db stats update */
+
+ PgStat_StatDBCounts counts;
 
  /*
- * tables and functions must be last in the struct, because we don't write
- * the pointers out to the stats file.
+ * The followings must be last in the struct, because we don't write them
+ * out to the stats file.
  */
- HTAB   *tables;
- HTAB   *functions;
+ dshash_table_handle tables; /* current gen tables hash */
+ dshash_table_handle functions; /* current gen functions hash */
+ LWLock lock; /* Lock for the above members */
+
+ /* non-shared members */
+ HTAB   *snapshot_tables; /* table entry snapshot */
+ HTAB   *snapshot_functions; /* function entry snapshot */
+ dshash_table *dshash_tables; /* attached tables dshash */
+ dshash_table *dshash_functions; /* attached functions dshash */
 } PgStat_StatDBEntry;
 
+#define SHARED_DBENT_SIZE offsetof(PgStat_StatDBEntry, snapshot_tables)
 
 /* ----------
- * PgStat_StatTabEntry The collector's data per table (or index)
+ * PgStat_StatTabEntry The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
  Oid tableid;
+ TimestampTz vacuum_timestamp; /* user initiated vacuum */
+ TimestampTz autovac_vacuum_timestamp; /* autovacuum initiated */
+ TimestampTz analyze_timestamp; /* user initiated */
+ TimestampTz autovac_analyze_timestamp; /* autovacuum initiated */
 
  PgStat_Counter numscans;
 
@@ -651,19 +300,15 @@ typedef struct PgStat_StatTabEntry
  PgStat_Counter blocks_fetched;
  PgStat_Counter blocks_hit;
 
- TimestampTz vacuum_timestamp; /* user initiated vacuum */
  PgStat_Counter vacuum_count;
- TimestampTz autovac_vacuum_timestamp; /* autovacuum initiated */
  PgStat_Counter autovac_vacuum_count;
- TimestampTz analyze_timestamp; /* user initiated */
  PgStat_Counter analyze_count;
- TimestampTz autovac_analyze_timestamp; /* autovacuum initiated */
  PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry The collector's data per function
+ * PgStat_StatFuncEntry per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -678,7 +323,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -694,7 +339,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -760,7 +405,6 @@ typedef enum
  WAIT_EVENT_CHECKPOINTER_MAIN,
  WAIT_EVENT_LOGICAL_APPLY_MAIN,
  WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
- WAIT_EVENT_PGSTAT_MAIN,
  WAIT_EVENT_RECOVERY_WAL_ALL,
  WAIT_EVENT_RECOVERY_WAL_STREAM,
  WAIT_EVENT_SYSLOGGER_MAIN,
@@ -1001,7 +645,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1198,13 +842,15 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1219,29 +865,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int pgstat_start(void);
-extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
 
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void pgstat_reset_all(void);
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 
 extern void pgstat_report_autovac(Oid dboid);
@@ -1402,8 +1045,8 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
   void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1412,11 +1055,14 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_snapshot(PgStat_StatDBEntry *dbent, Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern void pgstat_clear_snapshot(void);
 
 #endif /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 8fda8e4f78..13371e8cb7 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
  LWTRANCHE_TBM,
  LWTRANCHE_PARALLEL_APPEND,
  LWTRANCHE_SXACT,
+ LWTRANCHE_STATS,
  LWTRANCHE_FIRST_USER_DEFINED
 } BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
  STANDBY_TIMEOUT,
  STANDBY_LOCK_TIMEOUT,
  IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+ IDLE_STATS_UPDATE_TIMEOUT,
  /* First user-definable timeout reason */
  USER_TIMEOUT,
  /* Maximum number of timeout reasons */
--
2.18.2


From bfc8b896ed12d29b8185a7053b3ed586b23e2487 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v25 7/8] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  29 +++---
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 133 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 91 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c6f95fa688..12c8d19ccb 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8135,9 +8135,9 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3cac340f32..8cd86beb9d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6944,11 +6944,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -6964,14 +6964,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7002,9 +7001,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8022,7 +8021,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previous activity statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8035,7 +8034,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
         fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
+        activity statistics. The total number of heap tuples is stored in
         the index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index bc4d98fe03..d56afa17db 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2357,12 +2357,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 987580d6df..9605e0ebd4 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g. after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   The activity statistics is placed on shared memory. When the server shuts
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +213,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (500 ms unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -596,7 +587,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -914,7 +905,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
       <tbody>
        <row>
-        <entry morerows="64"><literal>LWLock</literal></entry>
+        <entry morerows="65"><literal>LWLock</literal></entry>
         <entry><literal>ShmemIndexLock</literal></entry>
         <entry>Waiting to find or allocate space in shared memory.</entry>
        </row>
@@ -1197,6 +1188,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to allocate or exchange a chunk of memory or update
          counters during Parallel Hash plan execution.</entry>
         </row>
+        <row>
+         <entry><literal>activity_statistics</literal></entry>
+         <entry>Waiting to write out activity statistics to shared memory
+         during transaction end or idle time.</entry>
+        </row>
         <row>
          <entry morerows="9"><literal>Lock</literal></entry>
          <entry><literal>relation</literal></entry>
@@ -1244,7 +1240,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to acquire a pin on a buffer.</entry>
         </row>
         <row>
-         <entry morerows="13"><literal>Activity</literal></entry>
+         <entry morerows="12"><literal>Activity</literal></entry>
          <entry><literal>ArchiverMain</literal></entry>
          <entry>Waiting in main loop of the archiver process.</entry>
         </row>
@@ -1272,10 +1268,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry><literal>LogicalLauncherMain</literal></entry>
          <entry>Waiting in main loop of logical launcher process.</entry>
         </row>
-        <row>
-         <entry><literal>PgStatMain</literal></entry>
-         <entry>Waiting in main loop of the statistics collector process.</entry>
-        </row>
         <row>
          <entry><literal>RecoveryWalAll</literal></entry>
          <entry>Waiting for WAL from any kind of source (local, archive or stream) at recovery.</entry>
@@ -4156,9 +4148,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 13bd320b31..52c61d222a 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1259,11 +1259,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
--
2.18.2


From bb7d2f7184169280fa45c3fa6e69776d37a6de4a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Fri, 13 Mar 2020 17:00:44 +0900
Subject: [PATCH v25 8/8] Remove the GUC stats_temp_directory

The GUC used to specify the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contrib, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward compatibility.
---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 ---------
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 +++---
 src/backend/replication/basebackup.c          | 13 ++----
 src/backend/utils/misc/guc.c                  | 41 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 ++-
 src/test/perl/PostgresNode.pm                 |  4 --
 9 files changed, 13 insertions(+), 88 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index bdc9026c62..2885540362 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8cd86beb9d..7f6056b9e9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7056,25 +7056,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 1c19e863d2..2f04bb68bb 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 34a4005791..4cd8530e91 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -96,15 +96,12 @@ bool pgstat_track_counts = false;
 int pgstat_track_functions = TRACK_FUNC_OFF;
 int pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward compatibility for extensions, having the default
+ * value.
  */
-char   *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char   *pgstat_stat_filename = NULL;
-char   *pgstat_stat_tmpname = NULL;
+char   *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 /* Shared stats bootstrap information, protected by StatsLock */
 typedef struct StatsShmemStruct
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 806d013108..c086ab781b 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -251,15 +251,12 @@ perform_base_backup(basebackup_options *opt)
  TimeLineID endtli;
  StringInfo labelfile;
  StringInfo tblspc_map_file = NULL;
- int datadirpathlen;
  List   *tablespaces = NIL;
 
  backup_total = 0;
  backup_streamed = 0;
  pgstat_progress_start_command(PROGRESS_COMMAND_BASEBACKUP, InvalidOid);
 
- datadirpathlen = strlen(DataDir);
-
  backup_started_in_recovery = RecoveryInProgress();
 
  labelfile = makeStringInfo();
@@ -291,13 +288,9 @@ perform_base_backup(basebackup_options *opt)
  * Calculate the relative path of temporary statistics directory in
  * order to skip the files which are located in that directory later.
  */
- if (is_absolute_path(pgstat_stat_directory) &&
- strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
- statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
- else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
- statrelpath = psprintf("./%s", pgstat_stat_directory);
- else
- statrelpath = pgstat_stat_directory;
+
+ Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+ statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
  /* Add a node for the base directory at the end */
  ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4c6d648662..417fbbdc5d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -197,7 +197,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4193,17 +4192,6 @@ static struct config_string ConfigureNamesString[] =
  NULL, NULL, NULL
  },
 
- {
- {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
- gettext_noop("Writes temporary statistics files to the specified directory."),
- NULL,
- GUC_SUPERUSER_ONLY
- },
- &pgstat_temp_directory,
- PG_STAT_TMP_DIR,
- check_canonical_path, assign_pgstat_temp_directory, NULL
- },
-
  {
  {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
  gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11489,35 +11477,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
- /* check_canonical_path already canonicalized newval for us */
- char   *dname;
- char   *tname;
- char   *fname;
-
- /* directory */
- dname = guc_malloc(ERROR, strlen(newval) + 1); /* runtime dir */
- sprintf(dname, "%s", newval);
-
- /* global stats */
- tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
- sprintf(tname, "%s/global.tmp", newval);
- fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
- sprintf(fname, "%s/global.stat", newval);
-
- if (pgstat_stat_directory)
- free(pgstat_stat_directory);
- pgstat_stat_directory = dname;
- if (pgstat_stat_tmpname)
- free(pgstat_stat_tmpname);
- pgstat_stat_tmpname = tname;
- if (pgstat_stat_filename)
- free(pgstat_stat_filename);
- pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index aa44f0c9bf..207e042e99 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -573,7 +573,6 @@
 #track_io_timing = off
 #track_functions = none # none, pl, all
 #track_activity_query_size = 1024 # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 4e137140bd..062f393941 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,7 +32,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9575268bd7..f3340f726c 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
  print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
   if defined $ENV{TEMP_CONFIG};
 
- # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
- # concurrently must not share a stats_temp_directory.
- print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
  if ($params{allows_streaming})
  {
  if ($params{allows_streaming} eq "logical")
--
2.18.2

Reply | Threaded
Open this post in threaded view
|

Re: shared-memory based stats collector

Andres Freund
Hi,

On 2020-03-19 20:30:04 +0900, Kyotaro Horiguchi wrote:
> > I think we also can get rid of the dshash_delete changes, by instead
> > adding a dshash_delete_current(dshash_seq_stat *status, void *entry) API
> > or such.
>
> [009] (Fixed)
> I'm not sure about the point of having two interfaces that are hard to
> distinguish.  Maybe dshash_delete_current(dshash_seq_stat *status) is
> enough(). I also reverted the dshash_delete().

Well, dshash_delete() cannot generally safely be used together with
iteration. It has to be the current element etc. And I think the locking
changes make dshash less robust. By explicitly tying "delete the current
element" to the iterator, most of that can be avoided.



> > >  /* SIGUSR1 signal handler for archiver process */
> >
> > Hm - this currently doesn't set up a correct sigusr1 handler for a
> > shared memory backend - needs to invoke procsignal_sigusr1_handler
> > somewhere.
> >
> > We can probably just convert to using normal latches here, and remove
> > the current 'wakened' logic? That'll remove the indirection via
> > postmaster too, which is nice.
>
> [018] (Fixed, separate patch 0005)
> It seems better. I added it as a separate patch just after the patch
> that turns archiver an auxiliary process.

I don't think it's correct to do it separately, but I can just merge
that on commit.


> > > @@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType)
> > >
> > >   switch (backendType)
> > >   {
> > > + case B_ARCHIVER:
> > > + backendDesc = "archiver";
> > > + break;
> >
> > should imo include 'WAL' or such.
>
> [019] (Not Fixed)
> It is already named "archiver" by 8e8a0becb3. Do I rename it in this
> patch set?

Oh. No, don't rename it as part of this. Could you reply to the thread
in which Peter made that change, and reference this complaint?


> [021] (Fixed, separate patch 0007)
> However the "statistics collector process" is gone, I'm not sure
> "statistics collector" feature also is gone. But actually the word
> "collector" looks a bit odd in some context. I replaced "the results
> of statistics collector" with "the activity statistics". (I'm not sure
> "the activity statistics" is proper as a subsystem name.) The word
> "collect" is replaced with "track".  I didn't change section IDs
> corresponding to the renaming so that old links can work. I also fixed
> the tranche name for LWTRANCHE_STATS from "activity stats" to
> "activity_statistics"

Without having gone through the changes, that sounds like the correct
direction to me. There's no "collector" anymore, so removing that seems
like the right thing.


> > > diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
> > > index ca5c6376e5..1ffe073a1f 100644
> > > --- a/src/backend/postmaster/pgstat.c
> > > +++ b/src/backend/postmaster/pgstat.c
> > > + *  Collects per-table and per-function usage statistics of all backends on
> > > + *  shared memory. pg_count_*() and friends are the interface to locally store
> > > + *  backend activities during a transaction. Then pgstat_flush_stat() is called
> > > + *  at the end of a transaction to pulish the local stats on shared memory.
> > >   *
> >
> > I'd rather not exhaustively list the different objects this handles -
> > it'll either be annoying to maintain, or just get out of date.
>
> [024] (Fixed, Maybe)
> Although not sure I get you correctly, I rewrote it as the follows.
>
>  *  Collects per-table and per-function usage statistics of all backends on
>  *  shared memory. The activity numbers are once stored locally, then written
>  *  to shared memory at commit time or by idle-timeout.

s/backends on/backends in/

I was thinking of something like:
 *  Collects activity statistics, e.g. per-table access statistics, of
 *  all backends in shared memory. The activity numbers are first stored
 *  locally in each process, then flushed to shared memory at commit
 *  time or by idle-timeout.



> > > - * - Add some automatic call for pgstat vacuuming.
> > > + *  To avoid congestion on the shared memory, we update shared stats no more
> > > + *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). In the case where
> > > + *  all the local numbers cannot be flushed immediately, we postpone updates
> > > + *  and try the next chance after the interval of
> > > + *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but we don't wait for no longer than
> > > + *  PGSTAT_STAT_MAX_INTERVAL(1000ms).
> >
> > I'm not convinced by this backoff logic. The basic interval seems quite
> > high for something going through shared memory, and the max retry seems
> > pretty low.
>
> [025] (Not Fixed)
> Is it the matter of intervals? Is (MIN, RETRY, MAX) = (1000, 500,
> 10000) reasonable?

Partially. I think for access to shared resources we want *increasing*
wait times, rather than shorter retry timeout. The goal should be to be
to make it more likely for all processes to be able to flush their
stats, which can be achieved by flushing less often after hitting
contention.

> > > +/*
> > > + * BgWriter global statistics counters. The name cntains a remnant from the
> > > + * time when the stats collector was a dedicate process, which used sockets to
> > > + * send it.
> > > + */
> > > +PgStat_MsgBgWriter BgWriterStats = {0};
> >
> > I am strongly against keeping the 'Msg' prefix. That seems extremely
> > confusing going forward.
>
> [029] (Fixed) (Related  to [046])
> Mmm. It's following your old suggestion to avoid unsubstantial
> diffs. I'm happy to change it. The functions that have "send" in their
> names are for the same reason. I removed the prefix "m_" of the
> members of the struct. (The comment above (with a typo) explains that).

I don't object to having the rename be a separate patch...


> > > + if (StatsShmem->refcount > 0)
> > > + StatsShmem->refcount++;
> >
> > What prevents us from leaking the refcount here? We could e.g. error out
> > while attaching, no? Which'd mean we'd leak the refcount.
>
> [033] (Fixed)
> We don't attach shared stats on postmaster process, so I want to know
> the first attacher process and the last detacher process of shared
> stats.  It's not leaks that I'm considering here.
> (continued below)
>
> > To me it looks like there's a lot of added complexity just because you
> > want to be able to reset stats via
> >
> > void
> > pgstat_reset_all(void)
> > {
> >
> > /*
> > * We could directly remove files and recreate the shared memory area. But
> > * detach then attach for simplicity.
> > */
> > pgstat_detach_shared_stats(false); /* Don't write */
> > pgstat_attach_shared_stats();
> >
> > Without that you'd not need the complexity of attaching, detaching to
> > the same degree - every backend could just cache lookup data during
> > initialization, instead of having to constantly re-compute that.
>
> Mmm. I don't get that (or I failed to read clear meaning). The
> function is assumed be called only from StartupXLOG().
> (continued)

Oh? I didn't get that you're only using it for that purpose - there's
very little documentation about what it's trying to do.

I don't see why that means we don't need to accurately track the
refcount? Otherwise we'll forget to write out the stats.


> > Nor would the dynamic re-creation of the db dshash table be needed.
>
> Maybe you are mentioning the complexity of reset_dbentry_counters? It
> is actually complex.  Shared stats dshash cannot be destroyed (or
> dshash entry cannot be removed) during someone is working on it. It
> was simpler to wait for another process to end its work but that could
> slow not only the clearing process but also other processes by
> frequent resetting of counters.

I was referring to the fact that the last version of the patch
attached/detached from hashtables regularly. pin_hashes, unpin_hashes,
attach_table_hash, attach_function_hash etc.


> After some thoughts, I decided to rip the all "generation" stuff off
> and it gets far simpler. But counter reset may conflict with other
> backends with a litter higher degree because counter reset needs
> exclusive lock.

That seems harmless to me - stats reset should never happen at a high
enough frequency to make contention it causes problematic. There's also
an argument to be made that it makes sense for the reset to be atomic.


> > > + /* Flush out table stats */
> > > + if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
> > > + pending_stats = true;
> > > +
> > > + /* Flush out function stats */
> > > + if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
> > > + pending_stats = true;
> >
> > This seems weird. pgstat_flush_stat(), pgstat_flush_funcstats() operate
> > on pgStatTabList/pgStatFunctions, but don't actually reset it? Besides
> > being confusing while reading the code, it also made the diff much
> > harder to read.
>
> [035] (Maybe Fixed)
> Is the question that, is there any case where
> pgstat_flush_stat/functions leaves some counters unflushed?

No, the point is that there's knowledge about
pgstat_flush_stat/pgstat_flush_funcstats outside of those functions,
namely the pgStatTabList, pgStatFunctions lists.


> > Why do we still have this? A hashtable lookup is cheap, compared to
> > fetching a file - so it's not to save time. Given how infrequent the
> > pgstat_fetch_* calls are, it's not to avoid contention either.
> >
> > At first one could think it's for consistency - but no, that's not it
> > either, because snapshot_statentry() refetches the snapshot without
> > control from the outside:
>
> [038]
> I don't get the second paragraph. When the function re*create*s a
> snapshot without control from the outside? It keeps snapshots during a
> transaction.  If not, it is broken.
> (continued)

Maybe I just misunderstood the code flow - partially due to the global
clear_snapshot variable. I just had read the
+     * We don't want so frequent update of stats snapshot. Keep it at least
+     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.

comment, and took it to mean that you're unconditionally updating the
snapshot every PGSTAT_STAT_MIN_INTERVAL. Which'd mean we don't actually
have consistent snapshot across all fetches.

(partially this might have been due to the diff:
     /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
+     * We don't want so frequent update of stats snapshot. Keep it at least
+     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
      */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
+    if (clear_snapshot)
+    {
+        clear_snapshot = false;
+
+        if (pgStatSnapshotContext &&
)

But I think my question remains: Why do we need the whole snapshot thing
now? Previously we needed to avoid reading a potentially large file -
but that's not a concern anymore?


> > >   /*
> > >    * We don't want so frequent update of stats snapshot. Keep it at least
> > >    * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
> > >    */
> ...
> > I think we should just remove this entire local caching snapshot layer
> > for lookups.
>
> Currently the behavior is documented as the follows and it seems reasonable.
>
>    Another important point is that when a server process is asked to display
>    any of these statistics, it first fetches the most recent report emitted by
>    the collector process and then continues to use this snapshot for all
>    statistical views and functions until the end of its current transaction.
>    So the statistics will show static information as long as you continue the
>    current transaction.  Similarly, information about the current queries of
>    all sessions is collected when any such information is first requested
>    within a transaction, and the same information will be displayed throughout
>    the transaction.
>    This is a feature, not a bug, because it allows you to perform several
>    queries on the statistics and correlate the results without worrying that
>    the numbers are changing underneath you.  But if you want to see new
>    results with each query, be sure to do the queries outside any transaction
>    block.  Alternatively, you can invoke
>    <function>pg_stat_clear_snapshot</function>(), which will discard the
>    current transaction's statistics snapshot (if any).  The next use of
>    statistical information will cause a new snapshot to be fetched.

I am very unconvinded this is worth the cost. Especially because plenty
of other stats related parts of the system do *NOT* behave this way. How
is a user supposed to understand that pg_stat_database behaves one way,
pg_stat_activity, another, pg_stat_statements a third,
pg_stat_progress_* ...

Perhaps it's best to not touch the semantics here, but I'm also very
wary of introducing significant complications and overhead just to have
this "feature".


> > >       for (i = 0; i < tsa->tsa_used; i++)
> > >       {
> > >           PgStat_TableStatus *entry = &tsa->tsa_entries[i];
> > >
> <many TableStatsArray code>
> > >               hash_entry->tsa_entry = entry;
> > >               dest_elem++;
> > >           }
> >
> > This seems like too much code. Why is this entirely different from the
> > way funcstats works? The difference was already too big before, but this
> > made it *way* worse.
>
> [040]
> We don't flush stats until transaction ends. So the description about
> TabStatuArray is stale?

How is your comment related to my comment above?


> > > bool
> > > pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
> > >                    PgStat_TableStatus *entry)
> > > {
> > >   Oid     dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
> > >   int     table_mode = PGSTAT_EXCLUSIVE;
> > >   bool    updated = false;
> > >   dshash_table *tabhash;
> > >   PgStat_StatDBEntry *dbent;
> > >   int     generation;
> > >
> > >   if (nowait)
> > >       table_mode |= PGSTAT_NOWAIT;
> > >
> > >   /* Attach required table hash if not yet. */
> > >   if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
> > >   {
> > >       /*
> > >        *  Return if we don't have corresponding dbentry. It would've been
> > >        *  removed.
> > >        */
> > >       dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
> > >       if (!dbent)
> > >           return false;
> > >
> > >       /*
> > >        * We don't hold lock on the dbentry since it cannot be dropped while
> > >        * we are working on it.
> > >        */
> > >       generation = pin_hashes(dbent);
> > >       tabhash = attach_table_hash(dbent, generation);
> >
> > This again is just cost incurred by insisting on destroying hashtables
> > instead of keeping them around as long as necessary.
>
> [040]
> Maybe you are insisting the reverse? The pin_hash complexity is left
> in this version. -> [033]

What do you mean? What I'm saying is that we should never end up in a
situation where there's no pgstat entry for the current database. And
that that's trivial, as long as we don't drop the hashtable, but instead
reset counters to 0.


> > >   dbentry = pgstat_get_db_entry(MyDatabaseId,
> > >                                 PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
> > >                                 &status);
> > >
> > >   if (status == LOCK_FAILED)
> > >       return;
> > >
> > >   /* We had a chance to flush immediately */
> > >   pgstat_flush_recovery_conflict(dbentry);
> > >
> > >   dshash_release_lock(pgStatDBHash, dbentry);
> >
> > But I don't understand why? Nor why we'd not just report all pending
> > database wide changes in that case?
> >
> > The fact that you're locking the per-database entry unconditionally once
> > for each table almost guarantees contention - and you're not using the
> > 'conditional lock' approach for that. I don't understand.
>
> [043] (Maybe fixed) (Related to [045].)
> Vacuum, analyze, DROP DB and reset cannot be delayed. So the
> conditional lock is mainly used by
> pgstat_report_stat().

You're saying "cannot be delayed" - but you're not explaining *why* that
is.

Even if true, I don't see why that necessitates doing the flushing and
locking once for each of these functions?


> dshash_find_or_insert didn't allow shared lock. I changed
> dshash_find_extended to allow shared-lock even if it is told to create
> a missing entry. Alrhough it takes exclusive lock at the mement of
> entry creation, most of all cases it doesn't need exclusive lock. This
> allows use shared lock while processing vacuum or analyze stats.

Huh?


> Previously I thought that we can work on a shared database entry while
> lock is not held, but actually there are cases where insertion of a
> new database entry causes rehash (resize). The operation moves entries
> so we need at least shared lock on database entry while we are working
> on it.  So in the attched basically most operations are working by the
> following steps.
> - get shared database entry with shared lock
>   - attach table/function hash
>     - fetch an entry with exclusive lock
>       - update entry
> - release the table/function entry
>   - detach table/function hash
>   if needed
>     - take LW_EXCLUSIVE on database entry
>       - update database numbers
>     - release LWLock
> - release shared database entry

Just to be crystal clear: I am exceedingly unlikely to commit this with
any sort of short term attach/detach operations. Both because of the
runtime overhead/contention it causes is significant, and because of the
code complexity implied by it.


Leaving attach/detach aside: I think it's a complete no-go to acquire
database wide locks at this frequency, and then to hold them over other
operations that are a) not cheap b) can block. The contention due to
that would be *terrible* for scalability, even if it's just a shared
lock.

The way this *should* work is that:
1.1) At backend startup, attach to the database wide hashtable
1.2) At backend startup, attach to the various per-database hashtables
  (including ones for shared tables)
2.1) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-table stats
2.2) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-function stats
2.3) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-database stats
2.4) When flushing stats that need to be flushed (e.g. vacuum): havestats && lock && flush
3.1) When shutting down backend, detach from all hashtables


That way we never need to hold onto the database-wide hashtables for
long, and we can do it with conditional locks (trylock above), unless we
need to force flushing.

It might be worthwhile to merge per-table, per-function, per-database
hashes into a single hash. Where the key is either something like
{hashkind, objoid} (referenced from a per-database hashtable), or even
{hashkind, dboid, objoid} (one global hashtable).


I think the contents of the hashtable should likely just be a single
dsa_pointer (plus some bookkeeping). Several reasons for that:

1) Since one goal of this is to make the stats system more extensible,
  it seems important that we can make the set of stats kept
  runtime configurable. Otherwise everyone will continue to have to pay
  the price for every potential stat that we have an option to track.

2) Having hashtable resizes move fairly large stat entries around is
   expensive. Whereas just moving key + dsa_pointer around is pretty
   cheap. I don't think the cost of a pointer dereference matters in
   *this* case.

3) If the stats contents aren't moved around, there's no need to worry
   about hashtable resizes. Therefore the stats can be referenced
   without holding dshash partition locks.

4) If the stats entries aren't moved around by hashtable resizes, we can
   use atomics, lwlocks, spinlocks etc as part of the stats entry. It's
   not generally correct/safe to have dshash resize to move those
   around.


All of that would be addressed if we instead allocate the stats data
separately from the dshash entry.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: shared-memory based stats collector

Andres Freund
In reply to this post by Thomas Munro-5
Hi,

On 2020-03-19 16:51:59 +1300, Thomas Munro wrote:
> On Fri, Mar 13, 2020 at 4:13 PM Andres Freund <[hidden email]> wrote:
> > Thomas, could you look at the first two patches here, and my review
> > questions?
>
> Ack.

Thanks!


> > >               dsa_pointer item_pointer = hash_table->buckets[i];
> > > @@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
> > >                                                               LW_EXCLUSIVE));
> > >
> > >       delete_item(hash_table, item);
> > > -     hash_table->find_locked = false;
> > > -     hash_table->find_exclusively_locked = false;
> > > -     LWLockRelease(PARTITION_LOCK(hash_table, partition));
> > > +
> > > +     /* We need to keep partition lock while sequential scan */
> > > +     if (!hash_table->seqscan_running)
> > > +     {
> > > +             hash_table->find_locked = false;
> > > +             hash_table->find_exclusively_locked = false;
> > > +             LWLockRelease(PARTITION_LOCK(hash_table, partition));
> > > +     }
> > >  }
> >
> > This seems like a failure prone API.
>
> If I understand correctly, the only purpose of the seqscan_running
> variable is to control that behaviour ^^^.  That is, to make
> dshash_delete_entry() keep the partition lock if you delete an entry
> while doing a seq scan.  Why not get rid of that, and provide a
> separate interface for deleting while scanning?
> dshash_seq_delete(dshash_seq_status *scan, void *entry).  I suppose it
> would be most common to want to delete the "current" item in the seq
> scan, but it could allow you to delete anything in the same partition,
> or any entry if using the "consistent" mode.  Oh, I see that Andres
> said the same thing later.


> > [Andres complaining about comments and language stuff]
>
> I would be happy to proof read and maybe extend the comments (writing
> new comments will also help me understand and review the code!), and
> maybe some code changes to move this forward.  Horiguchi-san, are you
> working on another version now?  If so I'll wait for it before I do
> that.

Cool! Being ESL myself and mildly dyslexic to boot, that'd be
helpful. But I'd hold off for a moment, because I think there'll need to
be some open heart surgery on this patch (see bottom of my last email in
this thread, for minutes ago (don't yet have a message id, sorry)).


> > The fact that you're locking the per-database entry unconditionally once
> > for each table almost guarantees contention - and you're not using the
> > 'conditional lock' approach for that. I don't understand.
>
> Right, I also noticed that:
>
>     /*
>      * Local table stats should be applied to both dbentry and tabentry at
>      * once. Update dbentry only if we could update tabentry.
>      */
>     if (pgstat_update_tabentry(tabhash, entry, nowait))
>     {
>         pgstat_update_dbentry(dbent, entry);
>         updated = true;
>     }
>
> So pgstat_update_tabentry() goes to great trouble to take locks
> conditionally, but then pgstat_update_dbentry() immediately does:
>
>     LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
>     dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
>     dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
>     dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
>     dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
>     dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
>     dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
>     dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
>     LWLockRelease(&dbentry->lock);
>
> Why can't we be "lazy" with the dbentry stats too?  Is it really
> important for the table stats and DB stats to agree with each other?

We *need* to be lazy here, I think.


> Hmm.  Even if you change the above code use a conditional lock, I am
> wondering (admittedly entirely without data) if this approach is still
> too clunky: even trying and failing to acquire the lock creates
> contention, just a bit less.  I wonder if it would make sense to make
> readers do more work, so that writers can avoid contention.  For
> example, maybe PgStat_StatDBEntry could hold an array of N sets of
> counters, and readers have to add them all up.  An advanced version of
> this idea would use a reasonably fresh copy of something like
> sched_getcpu() and numa_node_of_cpu() to select a partition to
> minimise contention and cross-node traffic, with a portable fallback
> based on PID or something.  CPU core/node awareness is something I
> haven't looked into too seriously, but it's been on my mind to solve
> some other problems.

I don't think we really need that for the per-object stats. The easier
way to address that is to instead reduce the rate of flushing to the
shared table. There's not really a problem with the shared state of the
stats lagging by a few hundred ms or so.

The amount of code complexity a scheme like you describe doesn't seem
worth it to me without very clear evidence its needed. If we didn't need
to handle the case were the "static" slots are insufficient to handle
all the stats, it'd be different. But given the number of tables etc
that can exist in systems, I don't think that's achievable.


I think we should go for per-backend counters for other parts of the
system though. I think it should basically be the default for cluster
wide stats like IO (even if we additionally flush it to per table
stats). Currently we have more complicated schemes for those. But that's
imo a separate patch.


Thanks!

Andres


Reply | Threaded
Open this post in threaded view
|

Re: shared-memory based stats collector

Kyotaro Horiguchi-4
In reply to this post by Thomas Munro-5
Thank you for looking this.

At Thu, 19 Mar 2020 16:51:59 +1300, Thomas Munro <[hidden email]> wrote in

> > This seems like a failure prone API.
>
> If I understand correctly, the only purpose of the seqscan_running
> variable is to control that behaviour ^^^.  That is, to make
> dshash_delete_entry() keep the partition lock if you delete an entry
> while doing a seq scan.  Why not get rid of that, and provide a
> separate interface for deleting while scanning?
> dshash_seq_delete(dshash_seq_status *scan, void *entry).  I suppose it
> would be most common to want to delete the "current" item in the seq
> scan, but it could allow you to delete anything in the same partition,
> or any entry if using the "consistent" mode.  Oh, I see that Andres
> said the same thing later.

The attached v25 in [1] is the new version.

> > Why does this patch add the consistent mode? There's no users currently?
> > Without it's not clear that we need a seperate _term function, I think?
>
> +1, let's not do that if we don't need it!

Yes, it is removed.

> > The fact that you're locking the per-database entry unconditionally once
> > for each table almost guarantees contention - and you're not using the
> > 'conditional lock' approach for that. I don't understand.
>
> Right, I also noticed that:

I think I fixed all cases except drop or something like that needs
exclusive lock.

> So pgstat_update_tabentry() goes to great trouble to take locks
> conditionally, but then pgstat_update_dbentry() immediately does:
>
>     LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
>     dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
>     dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
>     dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
>     dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
>     dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
>     dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
>     dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
>     LWLockRelease(&dbentry->lock);
>
> Why can't we be "lazy" with the dbentry stats too?  Is it really
> important for the table stats and DB stats to agree with each other?
> Even if it were, your current coding doesn't achieve that: the table
> stats are updated before the DB stat under different locks, so I'm not
> sure why it can't wait longer.

It is done lazy way.

> Hmm.  Even if you change the above code use a conditional lock, I am
> wondering (admittedly entirely without data) if this approach is still
> too clunky: even trying and failing to acquire the lock creates
> contention, just a bit less.  I wonder if it would make sense to make
> readers do more work, so that writers can avoid contention.  For
> example, maybe PgStat_StatDBEntry could hold an array of N sets of
> counters, and readers have to add them all up.  An advanced version of

I thought that kind of solution but that needs more memory multipled
by the number of backends. If the contention is not negligible, we can
go back to stats collector process connected via sockets then share
the result on shared memory. The motive was the file I/O on reading
stats on backens.

> this idea would use a reasonably fresh copy of something like
> sched_getcpu() and numa_node_of_cpu() to select a partition to
> minimise contention and cross-node traffic, with a portable fallback
> based on PID or something.  CPU core/node awareness is something I
> haven't looked into too seriously, but it's been on my mind to solve
> some other problems.

I have got asked about the CPU core/node awareness several times.  It
might have a certain degree of needs.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: shared-memory based stats collector

Kyotaro Horiguchi-4
In reply to this post by Andres Freund
Hello.

At Thu, 19 Mar 2020 12:54:10 -0700, Andres Freund <[hidden email]> wrote in

> Hi,
>
> On 2020-03-19 20:30:04 +0900, Kyotaro Horiguchi wrote:
> > > I think we also can get rid of the dshash_delete changes, by instead
> > > adding a dshash_delete_current(dshash_seq_stat *status, void *entry) API
> > > or such.
> >
> > [009] (Fixed)
> > I'm not sure about the point of having two interfaces that are hard to
> > distinguish.  Maybe dshash_delete_current(dshash_seq_stat *status) is
> > enough(). I also reverted the dshash_delete().
>
> Well, dshash_delete() cannot generally safely be used together with
> iteration. It has to be the current element etc. And I think the locking
> changes make dshash less robust. By explicitly tying "delete the current
> element" to the iterator, most of that can be avoided.
Sure.  By the way I forgot to remove seqscan_running stuff. Removed.

> > > >  /* SIGUSR1 signal handler for archiver process */
> > >
> > > Hm - this currently doesn't set up a correct sigusr1 handler for a
> > > shared memory backend - needs to invoke procsignal_sigusr1_handler
> > > somewhere.
> > >
> > > We can probably just convert to using normal latches here, and remove
> > > the current 'wakened' logic? That'll remove the indirection via
> > > postmaster too, which is nice.
> >
> > [018] (Fixed, separate patch 0005)
> > It seems better. I added it as a separate patch just after the patch
> > that turns archiver an auxiliary process.
>
> I don't think it's correct to do it separately, but I can just merge
> that on commit.
Yes, it's just for the convenience of reviewing. Merged.

> > > > @@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType)
> > > >
> > > >   switch (backendType)
> > > >   {
> > > > + case B_ARCHIVER:
> > > > + backendDesc = "archiver";
> > > > + break;
> > >
> > > should imo include 'WAL' or such.
> >
> > [019] (Not Fixed)
> > It is already named "archiver" by 8e8a0becb3. Do I rename it in this
> > patch set?
>
> Oh. No, don't rename it as part of this. Could you reply to the thread
> in which Peter made that change, and reference this complaint?
I sent a mail like that.

https://www.postgresql.org/message-id/20200327.163007.128069746774242774.horikyota.ntt%40gmail.com

> > [021] (Fixed, separate patch 0007)
> > However the "statistics collector process" is gone, I'm not sure
> > "statistics collector" feature also is gone. But actually the word
> > "collector" looks a bit odd in some context. I replaced "the results
> > of statistics collector" with "the activity statistics". (I'm not sure
> > "the activity statistics" is proper as a subsystem name.) The word
> > "collect" is replaced with "track".  I didn't change section IDs
> > corresponding to the renaming so that old links can work. I also fixed
> > the tranche name for LWTRANCHE_STATS from "activity stats" to
> > "activity_statistics"
>
> Without having gone through the changes, that sounds like the correct
> direction to me. There's no "collector" anymore, so removing that seems
> like the right thing.
Thanks.

> > > > diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
> > > > index ca5c6376e5..1ffe073a1f 100644
> > > > --- a/src/backend/postmaster/pgstat.c
> > > > +++ b/src/backend/postmaster/pgstat.c
...
> > [024] (Fixed, Maybe)
> > Although not sure I get you correctly, I rewrote it as the follows.
..
> I was thinking of something like:
>  *  Collects activity statistics, e.g. per-table access statistics, of
>  *  all backends in shared memory. The activity numbers are first stored
>  *  locally in each process, then flushed to shared memory at commit
>  *  time or by idle-timeout.

Looks fine. Replaced it with the above.

> > [025] (Not Fixed)
> > Is it the matter of intervals? Is (MIN, RETRY, MAX) = (1000, 500,
> > 10000) reasonable?
>
> Partially. I think for access to shared resources we want *increasing*
> wait times, rather than shorter retry timeout. The goal should be to be
> to make it more likely for all processes to be able to flush their
> stats, which can be achieved by flushing less often after hitting
> contention.

Ah! Indeed. The attached works the following way.

 * To avoid congestion on the shared memory, shared stats is updated no more
 * often than once per PGSTAT_MIN_INTERVAL (1000ms). If some local numbers
 * remain unflushed for lock failure, retry with intervals that is initially
 * PGSTAT_RETRY_MIN_INTERVAL (250ms) then doubled at every retry. Finally we
 * force update after PGSTAT_MAX_INTERVAL (10000ms) since the first trial.

Concretely the interval changes as:

        elapsed interval
-------------+--------------
        0ms (1000ms)
        1000ms 250ms
    1250ms 500ms
        1750ms 1000ms
        2750ms 2000ms
        4759ms 5250ms (not 4000ms)
        10000ms
                               
On the way fixing it I fixed several silly bugs:
  - pgstat_report_stat accessed dbent even if it is NULL.
  - pgstat_flush_tabstats set have_(sh|my)database_stats wrongly.

> > [029] (Fixed) (Related  to [046])
> > Mmm. It's following your old suggestion to avoid unsubstantial
> > diffs. I'm happy to change it. The functions that have "send" in their
> > names are for the same reason. I removed the prefix "m_" of the
> > members of the struct. (The comment above (with a typo) explains that).
>
> I don't object to having the rename be a separate patch...

Nope. I don't want make it a separate patch.

> > > > + if (StatsShmem->refcount > 0)
> > > > + StatsShmem->refcount++;
> > >
> > > What prevents us from leaking the refcount here? We could e.g. error out
> > > while attaching, no? Which'd mean we'd leak the refcount.
> >
> > [033] (Fixed)
> > We don't attach shared stats on postmaster process, so I want to know
> > the first attacher process and the last detacher process of shared
> > stats.  It's not leaks that I'm considering here.
> > (continued below)
> >
> > > To me it looks like there's a lot of added complexity just because you
> > > want to be able to reset stats via
...

> > > Without that you'd not need the complexity of attaching, detaching to
> > > the same degree - every backend could just cache lookup data during
> > > initialization, instead of having to constantly re-compute that.
> >
> > Mmm. I don't get that (or I failed to read clear meaning). The
> > function is assumed be called only from StartupXLOG().
> > (continued)
>
> Oh? I didn't get that you're only using it for that purpose - there's
> very little documentation about what it's trying to do.
Ugg..

> I don't see why that means we don't need to accurately track the
> refcount? Otherwise we'll forget to write out the stats.

Exactly, and I added comments for that.

|  * refcount is used to know whether a process going to detach shared stats is
|  * the last process or not. The last process writes out the stats files.
|  */
| typedef struct StatsShmemStruct

| if (--StatsShmem->refcount < 1)
| {
| /*
| * The process is the last one that is attaching the shared stats
| * memory. Write out the stats files if requested.

> > > Nor would the dynamic re-creation of the db dshash table be needed.
..
> I was referring to the fact that the last version of the patch
> attached/detached from hashtables regularly. pin_hashes, unpin_hashes,
> attach_table_hash, attach_function_hash etc.

pin/unpin is gone. Now there is only one dshash and it is attached for
the life time of process.

> > After some thoughts, I decided to rip the all "generation" stuff off
> > and it gets far simpler. But counter reset may conflict with other
> > backends with a litter higher degree because counter reset needs
> > exclusive lock.
>
> That seems harmless to me - stats reset should never happen at a high
> enough frequency to make contention it causes problematic. There's also
> an argument to be made that it makes sense for the reset to be atomic.

Agreed.

> > > > + /* Flush out table stats */
> > > > + if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
> > > > + pending_stats = true;
> > > > +
> > > > + /* Flush out function stats */
> > > > + if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
> > > > + pending_stats = true;
> > >
> > > This seems weird. pgstat_flush_stat(), pgstat_flush_funcstats() operate
> > > on pgStatTabList/pgStatFunctions, but don't actually reset it? Besides
> > > being confusing while reading the code, it also made the diff much
> > > harder to read.
> >
> > [035] (Maybe Fixed)
> > Is the question that, is there any case where
> > pgstat_flush_stat/functions leaves some counters unflushed?
>
> No, the point is that there's knowledge about
> pgstat_flush_stat/pgstat_flush_funcstats outside of those functions,
> namely the pgStatTabList, pgStatFunctions lists.
Mmm. Anyway the stuff has been largely changed in this version.

> > > Why do we still have this? A hashtable lookup is cheap, compared to
> > > fetching a file - so it's not to save time. Given how infrequent the
> > > pgstat_fetch_* calls are, it's not to avoid contention either.
> > >
> > > At first one could think it's for consistency - but no, that's not it
> > > either, because snapshot_statentry() refetches the snapshot without
> > > control from the outside:
> >
> > [038]
> > I don't get the second paragraph. When the function re*create*s a
> > snapshot without control from the outside? It keeps snapshots during a
> > transaction.  If not, it is broken.
> > (continued)
>
> Maybe I just misunderstood the code flow - partially due to the global
> clear_snapshot variable. I just had read the
> +     * We don't want so frequent update of stats snapshot. Keep it at least
> +     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
>
> comment, and took it to mean that you're unconditionally updating the
> snapshot every PGSTAT_STAT_MIN_INTERVAL. Which'd mean we don't actually
> have consistent snapshot across all fetches.
>
> (partially this might have been due to the diff:
>      /*
> -     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
> -     * msec since we last sent one, or the caller wants to force stats out.
> +     * We don't want so frequent update of stats snapshot. Keep it at least
> +     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
>       */
Wow.. I tried "git config --global diff.algorithm patience" and it
seems works well.

> But I think my question remains: Why do we need the whole snapshot thing
> now? Previously we needed to avoid reading a potentially large file -
> but that's not a concern anymore?
...
> > Currently the behavior is documented as the follows and it seems reasonable.
> >
...
> I am very unconvinded this is worth the cost. Especially because plenty
> of other stats related parts of the system do *NOT* behave this way. How
> is a user supposed to understand that pg_stat_database behaves one way,
> pg_stat_activity, another, pg_stat_statements a third,
> pg_stat_progress_* ...
>
> Perhaps it's best to not touch the semantics here, but I'm also very
> wary of introducing significant complications and overhead just to have
> this "feature".

As a compromise, I removed the "clear_snapshot" stuff.  Snapshot still
works, but now clear_snapshot() immediately clear them. It works the
same way with pg_stat_activity.

> > > >       for (i = 0; i < tsa->tsa_used; i++)
> > > >       {
> > > >           PgStat_TableStatus *entry = &tsa->tsa_entries[i];
> > > >
> > <many TableStatsArray code>
> > > >               hash_entry->tsa_entry = entry;
> > > >               dest_elem++;
> > > >           }
> > >
> > > This seems like too much code. Why is this entirely different from the
> > > way funcstats works? The difference was already too big before, but this
> > > made it *way* worse.
> >
> > [040]
> > We don't flush stats until transaction ends. So the description about
> > TabStatuArray is stale?
>
> How is your comment related to my comment above?
Hmm. It looks like truncated. The TableStatsArray is removed, all
kinds of local stats (except gobal stats) is now stored directly in
pgStatLocalHashEntry. The code gets far simpler.

> > > >       generation = pin_hashes(dbent);
> > > >       tabhash = attach_table_hash(dbent, generation);
> > >
> > > This again is just cost incurred by insisting on destroying hashtables
> > > instead of keeping them around as long as necessary.
> >
> > [040]
> > Maybe you are insisting the reverse? The pin_hash complexity is left
> > in this version. -> [033]
>
> What do you mean? What I'm saying is that we should never end up in a
> situation where there's no pgstat entry for the current database. And
> that that's trivial, as long as we don't drop the hashtable, but instead
> reset counters to 0.
In the previous version that was not sent to ML attach/detaches only
at the start/end time of process. But in this version table/function
dshashes are gone.

> > > The fact that you're locking the per-database entry unconditionally once
> > > for each table almost guarantees contention - and you're not using the
> > > 'conditional lock' approach for that. I don't understand.
> >
> > [043] (Maybe fixed) (Related to [045].)
> > Vacuum, analyze, DROP DB and reset cannot be delayed. So the
> > conditional lock is mainly used by
> > pgstat_report_stat().
>
> You're saying "cannot be delayed" - but you're not explaining *why* that
> is.
>
> Even if true, I don't see why that necessitates doing the flushing and
> locking once for each of these functions?
Sorry, that was wrong.  We can just skip removal on lock failure
during pgstat_vacuum_stat(). It will be retried the next time.  Other
database stats, deadlock, checksum failure, tmpfile and conflicts are
now collected locally then flushed.

> > dshash_find_or_insert didn't allow shared lock. I changed
> > dshash_find_extended to allow shared-lock even if it is told to create
> > a missing entry. Alrhough it takes exclusive lock at the mement of
> > entry creation, most of all cases it doesn't need exclusive lock. This
> > allows use shared lock while processing vacuum or analyze stats.
>
> Huh?

Well, anyway, the shared-insert mode of dshash_find_extended is no
longer needed so I removed the mode in this version.

> Just to be crystal clear: I am exceedingly unlikely to commit this with
> any sort of short term attach/detach operations. Both because of the
> runtime overhead/contention it causes is significant, and because of the
> code complexity implied by it.

I think it is addressed in this version.

> Leaving attach/detach aside: I think it's a complete no-go to acquire
> database wide locks at this frequency, and then to hold them over other
> operations that are a) not cheap b) can block. The contention due to
> that would be *terrible* for scalability, even if it's just a shared
> lock.

> The way this *should* work is that:
> 1.1) At backend startup, attach to the database wide hashtable
> 1.2) At backend startup, attach to the various per-database hashtables
>   (including ones for shared tables)
> 2.1) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-table stats
> 2.2) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-function stats
> 2.3) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-database stats
> 2.4) When flushing stats that need to be flushed (e.g. vacuum): havestats && lock && flush
> 3.1) When shutting down backend, detach from all hashtables
>
>
> That way we never need to hold onto the database-wide hashtables for
> long, and we can do it with conditional locks (trylock above), unless we
> need to force flushing.
I think the attached works the similar way. Table/function stats are
processed togehter, then database stats is processed.

> It might be worthwhile to merge per-table, per-function, per-database
> hashes into a single hash. Where the key is either something like
> {hashkind, objoid} (referenced from a per-database hashtable), or even
> {hashkind, dboid, objoid} (one global hashtable).
>
> I think the contents of the hashtable should likely just be a single
> dsa_pointer (plus some bookkeeping). Several reasons for that:
>
> 1) Since one goal of this is to make the stats system more extensible,
>   it seems important that we can make the set of stats kept
>   runtime configurable. Otherwise everyone will continue to have to pay
>   the price for every potential stat that we have an option to track.
>
> 2) Having hashtable resizes move fairly large stat entries around is
>    expensive. Whereas just moving key + dsa_pointer around is pretty
>    cheap. I don't think the cost of a pointer dereference matters in
>    *this* case.
>
> 3) If the stats contents aren't moved around, there's no need to worry
>    about hashtable resizes. Therefore the stats can be referenced
>    without holding dshash partition locks.
>
> 4) If the stats entries aren't moved around by hashtable resizes, we can
>    use atomics, lwlocks, spinlocks etc as part of the stats entry. It's
>    not generally correct/safe to have dshash resize to move those
>    around.
>
>
> All of that would be addressed if we instead allocate the stats data
> separately from the dshash entry.
OK, I'm convinced by that (and I like it). The attached v27 is largely
changed from the previous version following the suggeston.

1) DB, table, function stats are stored into one hash keyed by (type,
   dbid, objectid) and handled in unified way. Now pgstat_report_stat
   flushes stats the following way.

    while (hash_seq_search on local stats hash)
  {
        switch (ent->stats->type)
                {
                   case PGSTAT_TYPE_DB:  ...
                   case PGSTAT_TYPE_TABLE:  ...
                   case PGSTAT_TYPE_FUNCTION:  ...
                }
    }
                   
2, 3) There's only one dshash table pgStatSharedHash.  Its entry is
   defined as the follows.

   +typedef struct PgStatHashEntry
   +{
   +    PgStatHashEntryKey key; /* hash key */
   +    dsa_pointer stats; /* pointer to shared stats entry in DSA */
   +} PgStatHashEntry;

   key is (type, databaseid, objectid)

   To handle entries of different types common way, the hash entry
   points to the following struct stored in DSA memory.

   +typedef struct PgStatEntry
   +{
   +    PgStatTypes type; /* statistics entry type */
   +    size_t len; /* length of body, fixed per type. */
   +    LWLock lock; /* lightweight lock to protect body */
   +    char body[FLEXIBLE_ARRAY_MEMBER]; /* statistics body */
   +} PgStatEntry;

   The body stores the existing PgStat_Stat*Entry structs.

   To match the shared stats, locally-stored stats entries are changed
   similar way.

4) As shown above, I'm using LWLock in this version.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From c6baa406e0efb15504049cfbd33602f1c1d65b42 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Mon, 16 Mar 2020 17:15:35 +0900
Subject: [PATCH v27 1/7] Use standard crash handler in archiver.

The commit 8e19a82640 changed SIGQUIT handler of almost all processes
not to run atexit callbacks for safety. Archiver process should behave
the same way for the same reason. Exit status changes 1 to 2 but that
doesn't make any behavioral change.
---
 src/backend/postmaster/pgarch.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 01ffd6513c..37be0e2bbb 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -96,7 +96,6 @@ static pid_t pgarch_forkexec(void);
 #endif
 
 NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_exit(SIGNAL_ARGS);
 static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
@@ -229,7 +228,7 @@ PgArchiverMain(int argc, char *argv[])
  pqsignal(SIGHUP, SignalHandlerForConfigReload);
  pqsignal(SIGINT, SIG_IGN);
  pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
- pqsignal(SIGQUIT, pgarch_exit);
+ pqsignal(SIGQUIT, SignalHandlerForCrashExit);
  pqsignal(SIGALRM, SIG_IGN);
  pqsignal(SIGPIPE, SIG_IGN);
  pqsignal(SIGUSR1, pgarch_waken);
@@ -246,14 +245,6 @@ PgArchiverMain(int argc, char *argv[])
  exit(0);
 }
 
-/* SIGQUIT signal handler for archiver process */
-static void
-pgarch_exit(SIGNAL_ARGS)
-{
- /* SIGQUIT means curl up and die ... */
- exit(1);
-}
-
 /* SIGUSR1 signal handler for archiver process */
 static void
 pgarch_waken(SIGNAL_ARGS)
--
2.18.2


From 1592ccaf8fd790a18768dc301338efef99a99af5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v27 2/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 150 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  21 ++++++
 2 files changed, 170 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..fb7e23c4cb 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2) \
  (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2) \
+ (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2) \
  (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2) \
  ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2) \
+ ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash) \
  (hash_table->buckets[ \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
  ensure_valid_bucket_pointers(hash_table);
 
  /* Free all the entries. */
- size = ((size_t) 1) << hash_table->size_log2;
+ size = NUM_BUCKETS(hash_table->size_log2);
  for (i = 0; i < size; ++i)
  {
  dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,146 @@ dshash_memhash(const void *v, size_t size, void *arg)
  return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+ bool exclusive)
+{
+ status->hash_table = hash_table;
+ status->curbucket = 0;
+ status->nbuckets = 0;
+ status->curitem = NULL;
+ status->pnextitem = InvalidDsaPointer;
+ status->curpartition = -1;
+ status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+ dsa_pointer next_item_pointer;
+
+ if (status->curitem == NULL)
+ {
+ int partition;
+
+ Assert (status->curbucket == 0);
+ Assert(!status->hash_table->find_locked);
+
+ /* first shot. grab the first item. */
+ partition =
+ PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+   status->hash_table->size_log2);
+ LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+  status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+ status->curpartition = partition;
+
+ /* resize doesn't happen from now until seq scan ends */
+ status->nbuckets =
+ NUM_BUCKETS(status->hash_table->control->size_log2);
+ ensure_valid_bucket_pointers(status->hash_table);
+
+ next_item_pointer = status->hash_table->buckets[status->curbucket];
+ }
+ else
+ next_item_pointer = status->pnextitem;
+
+ /* Move to the next bucket if we finished the current bucket */
+ while (!DsaPointerIsValid(next_item_pointer))
+ {
+ int next_partition;
+
+ if (++status->curbucket >= status->nbuckets)
+ {
+ /* all buckets have been scanned. finish. */
+ return NULL;
+ }
+
+ /* Also move parititon lock if needed */
+ next_partition =
+ PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+   status->hash_table->size_log2);
+
+ /* Move lock along with partition for the bucket */
+ if (status->curpartition != next_partition)
+ {
+ /*
+ * Lock the next partition then release the current, not in the
+ * reverse order to avoid concurrent resizing. Partitions are
+ * locked in the same order with resize() so dead locks won't
+ * happen.
+ */
+ LWLockAcquire(PARTITION_LOCK(status->hash_table,
+ next_partition),
+  status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+ LWLockRelease(PARTITION_LOCK(status->hash_table,
+ status->curpartition));
+ status->curpartition = next_partition;
+ }
+
+ next_item_pointer = status->hash_table->buckets[status->curbucket];
+ }
+
+ status->curitem =
+ dsa_get_address(status->hash_table->area, next_item_pointer);
+ status->hash_table->find_locked = true;
+ status->hash_table->find_exclusively_locked = status->exclusive;
+
+ /*
+ * The caller may delete the item. Store the next item in case of deletion.
+ */
+ status->pnextitem = status->curitem->next;
+
+ return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+ status->hash_table->find_locked = false;
+ status->hash_table->find_exclusively_locked = false;
+
+ if (status->curpartition >= 0)
+ LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+ dshash_table   *hash_table = status->hash_table;
+ dshash_table_item  *item = status->curitem;
+ size_t partition = PARTITION_FOR_HASH(item->hash);
+
+ Assert(status->exclusive);
+ Assert(hash_table->control->magic == DSHASH_MAGIC);
+ Assert(hash_table->find_locked);
+ Assert(hash_table->find_exclusively_locked);
+ Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+ LW_EXCLUSIVE));
+
+ delete_item(hash_table, item);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..81a929b8d9 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+ dshash_table   *hash_table;
+ int curbucket;
+ int nbuckets;
+ dshash_table_item  *curitem;
+ dsa_pointer pnextitem;
+ int curpartition;
+ bool exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
    const dshash_parameters *params,
@@ -80,6 +95,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+ bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
--
2.18.2


From 9d7d47040b64e32ef58ba4acac381bd5362f9615 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v27 3/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 99 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 56 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index fb7e23c4cb..b4dc8e1ece 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
- dshash_hash hash;
- size_t partition;
- dshash_table_item *item;
-
- hash = hash_key(hash_table, key);
- partition = PARTITION_FOR_HASH(hash);
-
- Assert(hash_table->control->magic == DSHASH_MAGIC);
- Assert(!hash_table->find_locked);
-
- LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-  exclusive ? LW_EXCLUSIVE : LW_SHARED);
- ensure_valid_bucket_pointers(hash_table);
-
- /* Search the active bucket. */
- item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
- if (!item)
- {
- /* Not found. */
- LWLockRelease(PARTITION_LOCK(hash_table, partition));
- return NULL;
- }
- else
- {
- /* The caller will free the lock by calling dshash_release_lock. */
- hash_table->find_locked = true;
- hash_table->find_exclusively_locked = exclusive;
- return ENTRY_FROM_ITEM(item);
- }
+ return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,61 @@ dshash_find_or_insert(dshash_table *hash_table,
   const void *key,
   bool *found)
 {
- dshash_hash hash;
- size_t partition_index;
- dshash_partition *partition;
+ return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "insert" indicates insert mode. In this mode new entry is inserted and set
+ * *found to false. *found is set to true if found. "found" must be non-null in
+ * this mode.
+ *
+ * If nowait is true, the function immediately returns if required lock was not
+ * acquired.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+ bool exclusive, bool nowait, bool insert, bool *found)
+{
+ dshash_hash hash = hash_key(hash_table, key);
+ size_t partidx = PARTITION_FOR_HASH(hash);
+ dshash_partition *partition = &hash_table->control->partitions[partidx];
+ LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
  dshash_table_item *item;
 
- hash = hash_key(hash_table, key);
- partition_index = PARTITION_FOR_HASH(hash);
- partition = &hash_table->control->partitions[partition_index];
-
- Assert(hash_table->control->magic == DSHASH_MAGIC);
- Assert(!hash_table->find_locked);
+ /* must be exclusive when insert allowed */
+ Assert(!insert || (exclusive && found != NULL));
 
 restart:
- LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-  LW_EXCLUSIVE);
+ if (!nowait)
+ LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+ else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+   lockmode))
+ return NULL;
+
  ensure_valid_bucket_pointers(hash_table);
 
  /* Search the active bucket. */
  item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
  if (item)
- *found = true;
+ {
+ if (found)
+ *found = true;
+ }
  else
  {
- *found = false;
+ if (found)
+ *found = false;
+
+ if (!insert)
+ {
+ /* The caller didn't told to add a new entry. */
+ LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+ return NULL;
+ }
 
  /* Check if we are getting too full. */
  if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +484,8 @@ restart:
  * Give up our existing lock first, because resizing needs to
  * reacquire all the locks in the right order to avoid deadlocks.
  */
- LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+ LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
  resize(hash_table, hash_table->size_log2 + 1);
 
  goto restart;
@@ -493,12 +499,13 @@ restart:
  ++partition->count;
  }
 
- /* The caller must release the lock with dshash_release_lock. */
+ /* The caller will free the lock by calling dshash_release_lock. */
  hash_table->find_locked = true;
- hash_table->find_exclusively_locked = true;
+ hash_table->find_exclusively_locked = exclusive;
  return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 81a929b8d9..80a896a99b 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
  const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+  bool exclusive, bool nowait, bool insert,
+  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
--
2.18.2


From 6d52d83a00e9b0e8849f5adfe3871099d602485e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v27 4/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlog.c        |  49 +++++++++++
 src/backend/access/transam/xlogarchive.c |   2 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++---
 src/backend/postmaster/pgarch.c          | 102 ++++-------------------
 src/backend/postmaster/postmaster.c      |  53 ++++++------
 src/include/access/xlog.h                |   2 +
 src/include/access/xlog_internal.h       |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 10 files changed, 111 insertions(+), 127 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7621fc05e2..4da7ed3657 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -680,6 +680,13 @@ typedef struct XLogCtlData
  */
  Latch recoveryWakeupLatch;
 
+ /*
+ * archiverWakeupLatch is used to wake up the archiver process to process
+ * completed WAL segments, if it is waiting for WAL to arrive.
+ * Protected by info_lck.
+ */
+ Latch   *archiverWakeupLatch;
+
  /*
  * During recovery, we keep a copy of the latest checkpoint record here.
  * lastCheckPointRecPtr points to start of checkpoint record and
@@ -8381,6 +8388,48 @@ GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
  return result;
 }
 
+/*
+ * XLogArchiveWakeupEnd - Set up archiver wakeup stuff
+ */
+void
+XLogArchiveWakeupStart(void)
+{
+ Latch *old_latch PG_USED_FOR_ASSERTS_ONLY;
+
+ SpinLockAcquire(&XLogCtl->info_lck);
+ old_latch = XLogCtl->archiverWakeupLatch;
+ XLogCtl->archiverWakeupLatch = MyLatch;
+ SpinLockRelease(&XLogCtl->info_lck);
+ Assert (old_latch == NULL);
+}
+
+/*
+ * XLogArchiveWakeupEnd - Clean up archiver wakeup stuff
+ */
+void
+XLogArchiveWakeupEnd(void)
+{
+ SpinLockAcquire(&XLogCtl->info_lck);
+ XLogCtl->archiverWakeupLatch = NULL;
+ SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * XLogWakeupArchiver - Wake up archiver process
+ */
+void
+XLogArchiveWakeup(void)
+{
+ Latch *latch;
+
+ SpinLockAcquire(&XLogCtl->info_lck);
+ latch = XLogCtl->archiverWakeupLatch;
+ SpinLockRelease(&XLogCtl->info_lck);
+
+ if (latch)
+ SetLatch(latch);
+}
+
 /*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 914ad340ea..47c2b4a373 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -489,7 +489,7 @@ XLogArchiveNotify(const char *xlog)
 
  /* Notify archiver that it's got something to do */
  if (IsUnderPostmaster)
- SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+ XLogArchiveWakeup();
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 5480a024e0..d398ce6f03 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -319,6 +319,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
  case StartupProcess:
  MyBackendType = B_STARTUP;
  break;
+ case ArchiverProcess:
+ MyBackendType = B_ARCHIVER;
+ break;
  case BgWriterProcess:
  MyBackendType = B_BG_WRITER;
  break;
@@ -439,30 +442,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
  proc_exit(1); /* should never return */
 
  case StartupProcess:
- /* don't set signals, startup process has its own agenda */
  StartupProcessMain();
- proc_exit(1); /* should never return */
+ proc_exit(1);
+
+ case ArchiverProcess:
+ PgArchiverMain();
+ proc_exit(1);
 
  case BgWriterProcess:
- /* don't set signals, bgwriter has its own agenda */
  BackgroundWriterMain();
- proc_exit(1); /* should never return */
+ proc_exit(1);
 
  case CheckpointerProcess:
- /* don't set signals, checkpointer has its own agenda */
  CheckpointerMain();
- proc_exit(1); /* should never return */
+ proc_exit(1);
 
  case WalWriterProcess:
- /* don't set signals, walwriter has its own agenda */
  InitXLOGAccess();
  WalWriterMain();
- proc_exit(1); /* should never return */
+ proc_exit(1);
 
  case WalReceiverProcess:
- /* don't set signals, walreceiver has its own agenda */
  WalReceiverMain();
- proc_exit(1); /* should never return */
+ proc_exit(1);
 
  default:
  elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 37be0e2bbb..6fe7a136ba 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,7 +79,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -95,8 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +108,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- * Called from postmaster at startup or after an existing archiver
- * died.  Attempt to fire up a fresh archiver process.
- *
- * Returns PID of child process, or 0 if fail.
- *
- * Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
- time_t curtime;
- pid_t pgArchPid;
-
- /*
- * Do nothing if no archiver needed
- */
- if (!XLogArchivingActive())
- return 0;
-
- /*
- * Do nothing if too soon since last archiver start.  This is a safety
- * valve to protect against continuous respawn attempts if the archiver is
- * dying immediately at launch. Note that since we will be re-called from
- * the postmaster main loop, we will get another chance later.
- */
- curtime = time(NULL);
- if ((unsigned int) (curtime - last_pgarch_start_time) <
- (unsigned int) PGARCH_RESTART_INTERVAL)
- return 0;
- last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
- switch ((pgArchPid = pgarch_forkexec()))
-#else
- switch ((pgArchPid = fork_process()))
-#endif
- {
- case -1:
- ereport(LOG,
- (errmsg("could not fork archiver: %m")));
- return 0;
-
-#ifndef EXEC_BACKEND
- case 0:
- /* in postmaster child ... */
- InitPostmasterChild();
-
- /* Close the postmaster's sockets */
- ClosePostmasterPorts(false);
-
- /* Drop our connection to postmaster's shared memory, as well */
- dsm_detach_all();
- PGSharedMemoryDetach();
-
- PgArchiverMain(0, NULL);
- break;
-#endif
-
- default:
- return (int) pgArchPid;
- }
-
- /* shouldn't get here */
- return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +141,21 @@ pgarch_forkexec(void)
 #endif /* EXEC_BACKEND */
 
 
+/* Clean up notification stuff on exit */
+static void
+PgArchiverKill(int code, Datum arg)
+{
+ XLogArchiveWakeupEnd();
+}
+
 /*
  * PgArchiverMain
  *
  * The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  * since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
  /*
  * Ignore all signals usually bound to some action in the postmaster,
@@ -231,7 +167,7 @@ PgArchiverMain(int argc, char *argv[])
  pqsignal(SIGQUIT, SignalHandlerForCrashExit);
  pqsignal(SIGALRM, SIG_IGN);
  pqsignal(SIGPIPE, SIG_IGN);
- pqsignal(SIGUSR1, pgarch_waken);
+ pqsignal(SIGUSR1, procsignal_sigusr1_handler);
  pqsignal(SIGUSR2, pgarch_waken_stop);
  /* Reset some signals that are accepted by postmaster but not here */
  pqsignal(SIGCHLD, SIG_DFL);
@@ -240,24 +176,14 @@ PgArchiverMain(int argc, char *argv[])
  MyBackendType = B_ARCHIVER;
  init_ps_display(NULL);
 
+ XLogArchiveWakeupStart();
+ on_shmem_exit(PgArchiverKill, 0);
+
  pgarch_MainLoop();
 
  exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
- int save_errno = errno;
-
- /* set flag that there is work to be done */
- wakened = true;
- SetLatch(MyLatch);
-
- errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 2b9ab32293..fab4a9dd51 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC 0x0002 /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND 0x0004 /* walsender process */
 #define BACKEND_TYPE_BGWORKER 0x0008 /* bgworker process */
-#define BACKEND_TYPE_ALL 0x000F /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER 0x0010 /* archiver process */
+#define BACKEND_TYPE_ALL 0x001F /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif /* EXEC_BACKEND */
 
 #define StartupDataBase() StartChildProcess(StartupProcess)
+#define StartArchiver() StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer() StartChildProcess(CheckpointerProcess)
 #define StartWalWriter() StartChildProcess(WalWriterProcess)
@@ -1785,7 +1787,7 @@ ServerLoop(void)
 
  /* If we have lost the archiver, try to start a new one. */
  if (PgArchPID == 0 && PgArchStartupAllowed())
- PgArchPID = pgarch_start();
+ PgArchPID = StartArchiver();
 
  /* If we need to signal the autovacuum launcher, do so now */
  if (avlauncher_needs_signal)
@@ -3055,7 +3057,7 @@ reaper(SIGNAL_ARGS)
  if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
  AutoVacPID = StartAutoVacLauncher();
  if (PgArchStartupAllowed() && PgArchPID == 0)
- PgArchPID = pgarch_start();
+ PgArchPID = StartArchiver();
  if (PgStatPID == 0)
  PgStatPID = pgstat_start();
 
@@ -3190,20 +3192,16 @@ reaper(SIGNAL_ARGS)
  }
 
  /*
- * Was it the archiver?  If so, just try to start a new one; no need
- * to force reset of the rest of the system.  (If fail, we'll try
- * again in future cycles of the main loop.).  Unless we were waiting
- * for it to shut down; don't restart it in that case, and
- * PostmasterStateMachine() will advance to the next shutdown step.
+ * Was it the archiver?  Normal exit can be ignored; we'll start a new
+ * one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
  */
  if (pid == PgArchPID)
  {
  PgArchPID = 0;
  if (!EXIT_STATUS_0(exitstatus))
- LogChildExit(LOG, _("archiver process"),
- pid, exitstatus);
- if (PgArchStartupAllowed())
- PgArchPID = pgarch_start();
+ HandleChildCrash(pid, exitstatus,
+ _("archiver process"));
  continue;
  }
 
@@ -3451,7 +3449,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3656,6 +3654,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
  signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
  }
 
+ /* Take care of the archiver too */
+ if (pid == PgArchPID)
+ PgArchPID = 0;
+ else if (PgArchPID != 0 && take_action)
+ {
+ ereport(DEBUG2,
+ (errmsg_internal("sending %s to process %d",
+ (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ (int) PgArchPID)));
+ signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+ }
+
  /*
  * Force a power-cycle of the pgarch process too.  (This isn't absolutely
  * necessary, but it seems like a good idea for robustness, and it
@@ -3928,6 +3938,7 @@ PostmasterStateMachine(void)
  Assert(CheckpointerPID == 0);
  Assert(WalWriterPID == 0);
  Assert(AutoVacPID == 0);
+ Assert(PgArchPID == 0);
  /* syslogger is not considered here */
  pmState = PM_NO_CHILDREN;
  }
@@ -5208,7 +5219,7 @@ sigusr1_handler(SIGNAL_ARGS)
  */
  Assert(PgArchPID == 0);
  if (XLogArchivingAlways())
- PgArchPID = pgarch_start();
+ PgArchPID = StartArchiver();
 
  /*
  * If we aren't planning to enter hot standby mode later, treat
@@ -5251,16 +5262,6 @@ sigusr1_handler(SIGNAL_ARGS)
  if (StartWorkerNeeded || HaveCrashedWorker)
  maybe_start_bgworkers();
 
- if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
- PgArchPID != 0)
- {
- /*
- * Send SIGUSR1 to archiver process, to wake it up and begin archiving
- * next WAL file.
- */
- signal_child(PgArchPID, SIGUSR1);
- }
-
  /* Tell syslogger to rotate logfile if requested */
  if (SysLoggerPID != 0)
  {
@@ -5493,6 +5494,10 @@ StartChildProcess(AuxProcType type)
  ereport(LOG,
  (errmsg("could not fork startup process: %m")));
  break;
+ case ArchiverProcess:
+ ereport(LOG,
+ (errmsg("could not fork archiver process: %m")));
+ break;
  case BgWriterProcess:
  ereport(LOG,
  (errmsg("could not fork background writer process: %m")));
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 331497bcfb..f38eaee092 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -311,6 +311,8 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 27ded593ab..a272d62b1f 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -331,6 +331,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..619b2f9c71 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -417,6 +417,7 @@ typedef enum
  BootstrapProcess,
  StartupProcess,
  BgWriterProcess,
+ ArchiverProcess,
  CheckpointerProcess,
  WalWriterProcess,
  WalReceiverProcess,
@@ -429,6 +430,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess() (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess() (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess() (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess() (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess() (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess() (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
  PMSIGNAL_RECOVERY_STARTED, /* recovery has started */
  PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
- PMSIGNAL_WAKEN_ARCHIVER, /* send a NOTIFY signal to xlog archiver */
  PMSIGNAL_ROTATE_LOGFILE, /* send SIGUSR1 to syslogger to rotate logfile */
  PMSIGNAL_START_AUTOVAC_LAUNCHER, /* start an autovacuum launcher */
  PMSIGNAL_START_AUTOVAC_WORKER, /* start an autovacuum worker */
--
2.18.2


From 0d409e7384f01a6a69374e90731efcb357905b4f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v27 5/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 500ms and such large data is
serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 500ms. Locks
on the shared statistics is acquired by the units of such like tables,
functions so the expected chance of collision are not so
high. Furthermore, until 1 second has elapsed since the last flushing
to shared stats, lock failure postpones stats flushing so that lock
contention doesn't slow down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/transam/xlog.c            |    4 +-
 src/backend/catalog/index.c                  |   24 +-
 src/backend/postmaster/autovacuum.c          |   54 +-
 src/backend/postmaster/bgwriter.c            |    2 +-
 src/backend/postmaster/checkpointer.c        |   12 +-
 src/backend/postmaster/pgarch.c              |    4 +-
 src/backend/postmaster/pgstat.c              | 4843 +++++++-----------
 src/backend/postmaster/postmaster.c          |   85 +-
 src/backend/storage/buffer/bufmgr.c          |    8 +-
 src/backend/storage/ipc/ipci.c               |    2 +
 src/backend/storage/lmgr/lwlock.c            |    1 +
 src/backend/storage/lmgr/lwlocknames.txt     |    1 +
 src/backend/tcop/postgres.c                  |   26 +-
 src/backend/utils/adt/pgstatfuncs.c          |   53 +-
 src/backend/utils/init/globals.c             |    1 +
 src/backend/utils/init/postinit.c            |   11 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |    4 +-
 src/include/miscadmin.h                      |    2 +
 src/include/pgstat.h                         |  514 +-
 src/include/storage/lwlock.h                 |    1 +
 src/include/utils/timeout.h                  |    1 +
 21 files changed, 2055 insertions(+), 3598 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4da7ed3657..cee0572367 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8528,9 +8528,9 @@ LogCheckpointEnd(bool restartpoint)
  &sync_secs, &sync_usecs);
 
  /* Accumulate checkpoint timing summary data, in milliseconds. */
- BgWriterStats.m_checkpoint_write_time +=
+ BgWriterStats.checkpoint_write_time +=
  write_secs * 1000 + write_usecs / 1000;
- BgWriterStats.m_checkpoint_sync_time +=
+ BgWriterStats.checkpoint_sync_time +=
  sync_secs * 1000 + sync_usecs / 1000;
 
  /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 2d81bc3cbc..4de574ae00 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1687,28 +1687,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
  /*
  * Copy over statistics from old to new index
+ * The data will be sent by the next pgstat_report_stat()
+ * call.
  */
- {
- PgStat_StatTabEntry *tabentry;
-
- tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
- if (tabentry)
- {
- if (newClassRel->pgstat_info)
- {
- newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
- newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
- newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
- newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
- newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
- /*
- * The data will be sent by the next pgstat_report_stat()
- * call.
- */
- }
- }
- }
+ pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
  /* Close relations */
  table_close(pg_class, RowExclusiveLock);
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index da75e755f0..c00b04a624 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -336,9 +336,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
   BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
  TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-  PgStat_StatDBEntry *shared,
-  PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1936,8 +1933,6 @@ do_autovacuum(void)
  HASHCTL ctl;
  HTAB   *table_toast_map;
  ListCell   *volatile cell;
- PgStat_StatDBEntry *shared;
- PgStat_StatDBEntry *dbentry;
  BufferAccessStrategy bstrategy;
  ScanKeyData key;
  TupleDesc pg_class_desc;
@@ -1956,12 +1951,6 @@ do_autovacuum(void)
   ALLOCSET_DEFAULT_SIZES);
  MemoryContextSwitchTo(AutovacMemCxt);
 
- /*
- * may be NULL if we couldn't find an entry (only happens if we are
- * forcing a vacuum for anti-wrap purposes).
- */
- dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
  /* Start a transaction so our commands have one to play into. */
  StartTransactionCommand();
 
@@ -2009,9 +1998,6 @@ do_autovacuum(void)
  /* StartTransactionCommand changed elsewhere */
  MemoryContextSwitchTo(AutovacMemCxt);
 
- /* The database hash where pgstat keeps shared relations */
- shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
  classRel = table_open(RelationRelationId, AccessShareLock);
 
  /* create a copy so we can use it after closing pg_class */
@@ -2090,8 +2076,8 @@ do_autovacuum(void)
 
  /* Fetch reloptions and the pgstat entry for this table */
  relopts = extract_autovac_opts(tuple, pg_class_desc);
- tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
- shared, dbentry);
+ tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+   relid);
 
  /* Check if it needs vacuum or analyze */
  relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2174,8 +2160,8 @@ do_autovacuum(void)
  }
 
  /* Fetch the pgstat entry for this table */
- tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
- shared, dbentry);
+ tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+   relid);
 
  relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
   effective_multixact_freeze_max_age,
@@ -2734,29 +2720,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
  return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-  PgStat_StatDBEntry *dbentry)
-{
- PgStat_StatTabEntry *tabentry = NULL;
-
- if (isshared)
- {
- if (PointerIsValid(shared))
- tabentry = hash_search(shared->tables, &relid,
-   HASH_FIND, NULL);
- }
- else if (PointerIsValid(dbentry))
- tabentry = hash_search(dbentry->tables, &relid,
-   HASH_FIND, NULL);
-
- return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2777,17 +2740,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  bool doanalyze;
  autovac_table *tab = NULL;
  PgStat_StatTabEntry *tabentry;
- PgStat_StatDBEntry *shared;
- PgStat_StatDBEntry *dbentry;
  bool wraparound;
  AutoVacOpts *avopts;
 
  /* use fresh stats */
  autovac_refresh_stats();
 
- shared = pgstat_fetch_stat_dbentry(InvalidOid);
- dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
  /* fetch the relation's relcache entry */
  classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
  if (!HeapTupleIsValid(classTup))
@@ -2811,8 +2769,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  }
 
  /* fetch the pgstat table entry */
- tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
- shared, dbentry);
+ tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+   relid);
 
  relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
   effective_multixact_freeze_max_age,
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427..94bdd664b5 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -236,7 +236,7 @@ BackgroundWriterMain(void)
  /*
  * Send off activity statistics to the stats collector
  */
- pgstat_send_bgwriter();
+ pgstat_report_bgwriter();
 
  if (FirstCallSinceLastCheckpoint())
  {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e354a78725..8a2fd0ddb2 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -350,7 +350,7 @@ CheckpointerMain(void)
  if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
  {
  do_checkpoint = true;
- BgWriterStats.m_requested_checkpoints++;
+ BgWriterStats.requested_checkpoints++;
  }
 
  /*
@@ -364,7 +364,7 @@ CheckpointerMain(void)
  if (elapsed_secs >= CheckPointTimeout)
  {
  if (!do_checkpoint)
- BgWriterStats.m_timed_checkpoints++;
+ BgWriterStats.timed_checkpoints++;
  do_checkpoint = true;
  flags |= CHECKPOINT_CAUSE_TIME;
  }
@@ -492,7 +492,7 @@ CheckpointerMain(void)
  * worth the trouble to split the stats support into two independent
  * stats message types.)
  */
- pgstat_send_bgwriter();
+ pgstat_report_bgwriter();
 
  /*
  * Sleep until we are signaled or it's time for another checkpoint or
@@ -693,7 +693,7 @@ CheckpointWriteDelay(int flags, double progress)
  /*
  * Report interim activity statistics to the stats collector.
  */
- pgstat_send_bgwriter();
+ pgstat_report_bgwriter();
 
  /*
  * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1238,8 +1238,8 @@ AbsorbSyncRequests(void)
  LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
  /* Transfer stats counts into pending pgstats message */
- BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
- BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+ BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+ BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
  CheckpointerShmem->num_backend_writes = 0;
  CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 6fe7a136ba..f0b524ca50 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -401,7 +401,7 @@ pgarch_ArchiverCopyLoop(void)
  * Tell the collector about the WAL file that we successfully
  * archived
  */
- pgstat_send_archiver(xlog, false);
+ pgstat_report_archiver(xlog, false);
 
  break; /* out of inner retry loop */
  }
@@ -411,7 +411,7 @@ pgarch_ArchiverCopyLoop(void)
  * Tell the collector about the WAL file that we failed to
  * archive
  */
- pgstat_send_archiver(xlog, true);
+ pgstat_report_archiver(xlog, true);
 
  if (++failures >= NUM_ARCHIVE_RETRIES)
  {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 4763c24be9..c0760854f4 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- * All the statistics collector stuff hacked up in one big, ugly file.
+ * Activity Statistics facility.
  *
- * TODO: - Separate collector, postmaster and backend stuff
- *  into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- * - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (1000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (250ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (10000ms) since the first trial.
  *
- * - Add a pgstat config column to pg_database, so this
- *  entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  * Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,68 +35,43 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL 500 /* Minimum time between stats file
- * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL 1000 /* Minimum interval of stats data
+ * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY 10 /* How long to wait between checks for a
- * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME 10000 /* Maximum time to wait for a stats
- * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL 640 /* How often to ping the collector for a
- * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60 /* How often to attempt to restart a
- * failed statistics collector; in
- * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL 250 /* Initial retry interval after
+ * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL 10000 /* Longest interval of stats data
+ * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE 16
-#define PGSTAT_TAB_HASH_SIZE 512
+#define PGSTAT_TABLE_HASH_SIZE 512
 #define PGSTAT_FUNCTION_HASH_SIZE 512
 
 
@@ -116,7 +86,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -131,75 +100,162 @@ int pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char   *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char   *pgstat_stat_filename = NULL;
 char   *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-
-/* ----------
- * Local data
- * ----------
+ * Shared stats bootstrap information, protected by StatsLock.
+ *
+ * refcount is used to know whether a process going to detach shared stats is
+ * the last process or not. The last process writes out the stats files.
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
+typedef struct StatsShmemStruct
+{
+ dsa_handle stats_dsa_handle; /* handle for stats data area */
+ dshash_table_handle hash_handle; /* shared dbstat hash */
+ dsa_pointer global_stats; /* DSA pointer to global stats */
+ dsa_pointer archiver_stats; /* Ditto for archiver stats */
+ int refcount; /* # of processes that is attaching the shared
+ * stats memory */
+} StatsShmemStruct;
 
-static time_t last_pgstat_start_time;
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
 
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * Enums and types to define shared statistics structure.
+ *
+ * Statistics entries for each object is stored in individual DSA-allocated
+ * memory. Every entry is pointed from the dshash pgStatSharedHash via
+ * dsa_pointer. The structure makes object-stats entries not moved by dshash
+ * resizing, and allows the dshash can release lock sooner on stats
+ * updates. Also it reduces interfering among write-locks on each stat entry by
+ * not relying on partition lock of dshash. pgStatLocalHashEntry is the equivalent of pgStatSharedHash for local stat entries.
+ *
+ * Each stat entry is enveloped by the type PgStatEnvelope, which stores common
+ * attribute of all kind of statistics and a LWLock lock object.
  *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry (dshash entry)
+ *      (dsa_pointer)-> PgStatEnvelope (dsa memory block)
+ *
+ * Local stats are stored as:
+ *
+ * dshash pgStatLocalHash
+ *    -> PgStatLocalHashEntry (dynahash entry)
+ *      (direct pointer)-> PgStatEnvelope (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM 100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries. */
+typedef enum PgStatTypes
 {
- struct TabStatusArray *tsa_next; /* link to next array, if any */
- int tsa_used; /* # entries currently used */
- PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM]; /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+ PGSTAT_TYPE_ALL, /* Not a type, for the parameters of
+ * pgstat_collect_stat_entries */
+ PGSTAT_TYPE_DB, /* database-wide statistics */
+ PGSTAT_TYPE_TABLE, /* per-table statistics */
+ PGSTAT_TYPE_FUNCTION, /* per-function statistics */
+} PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static size_t pgstat_entsize[] =
 {
- Oid t_id;
- PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+ 0, /* PGSTAT_TYPE_ALL: not an entry */
+ sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+ sizeof(PgStat_StatTabEntry), /* PGSTAT_TYPE_TABLE */
+ sizeof(PgStat_StatFuncEntry) /* PGSTAT_TYPE_FUNCTION */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static size_t pgstat_localentsize[] =
+{
+ 0, /* PGSTAT_TYPE_ALL: not an entry */
+ sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+ sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
+ sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashEntryKey
+{
+ PgStatTypes type; /* statistics entry type */
+ Oid databaseid; /* database ID. InvalidOid for shared objects. */
+ Oid objectid; /* object OID */
+} PgStatHashEntryKey;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Stats numbers that are waiting for flushing out to shared stats are held in
+ * pgStatLocalHash,
  */
-static HTAB *pgStatFunctions = NULL;
+typedef struct PgStatHashEntry
+{
+ PgStatHashEntryKey key; /* hash key */
+ dsa_pointer env; /* pointer to shared stats envelope in DSA */
+} PgStatHashEntry;
+
+/* struct for shared statistics entry pointed from shared hash entry. */
+typedef struct PgStatEnvelope
+{
+ PgStatTypes type; /* statistics entry type */
+ Oid databaseid; /* databaseid */
+ Oid objectid; /* objectid */
+ size_t len; /* length of body, fixed per type. */
+ LWLock lock; /* lightweight lock to protect body */
+ int body[FLEXIBLE_ARRAY_MEMBER]; /* statistics body */
+} PgStatEnvelope;
+
+#define PgStatEnvelopeSize(bodylen) \
+ (offsetof(PgStatEnvelope, body) + (bodylen))
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+ PgStatHashEntryKey key; /* hash key */
+ PgStatEnvelope *env; /* pointer to stats envelope in heap */
+} PgStatLocalHashEntry;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * Snapshot is stats entry that is locally copied to offset stable values for a
+ * transaction.
  */
-static bool have_function_stats = false;
+typedef struct PgStatSnapshot
+{
+ PgStatHashEntryKey key;
+ bool negative;
+ int body[FLEXIBLE_ARRAY_MEMBER]; /* statistics body */
+} PgStatSnapshot;
+
+#define PgStatSnapshotSize(bodylen) \
+ (offsetof(PgStatSnapshot, body) + (bodylen))
+
+/* parameter for shared hashes */
+static const dshash_parameters dsh_rootparams = {
+ sizeof(PgStatHashEntryKey),
+ sizeof(PgStatHashEntry),
+ dshash_memcmp,
+ dshash_memhash,
+ LWTRANCHE_STATS
+};
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
+
+/* Local stats numbers are stored here */
+static HTAB *pgStatLocalHash = NULL;
+
+#define HAVE_ANY_PENDING_STATS() \
+ (pgStatLocalHash != NULL || \
+ pgStatXactCommit != 0 || pgStatXactRollback != 0)
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -236,11 +292,10 @@ typedef struct TwoPhasePgStatRecord
  bool t_truncated; /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatSnapshotHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -249,19 +304,17 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memory and snapshot_* are backend
+ * snapshots.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -270,523 +323,269 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/*
+ * Newly created shared stats entries needs to be initialized before the other
+ * processes get access it. get_stat_entry() calls it for the purpose.
+ */
+typedef void (*entry_initializer) (PgStatEnvelope * env);
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
- Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStatEnvelope * get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+   bool nowait,
+   entry_initializer initfunc, bool *found);
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid);
+static void create_missing_dbentries(void);
+static void pgstat_write_database_stats(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry);
+
+static bool flush_tabstat(PgStatEnvelope * env, bool nowait);
+static bool flush_funcstat(PgStatEnvelope * env, bool nowait);
+static bool flush_dbstat(PgStatEnvelope * env, bool nowait);
+
+static void init_dbentry(PgStatEnvelope * env);
+static void init_funcentry(PgStatEnvelope * env);
+static void init_tabentry(PgStatEnvelope * env);
+
+static bool delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+  bool nowait);
+static PgStatEnvelope * get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+ bool create, bool *found);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static void pgstat_snapshot_global_stats(void);
+
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
-
-static void pgstat_setup_memcxt(void);
-
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
 static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- * Called from postmaster at startup. Create the resources required
- * by the statistics collector process.  If unable to do so, do not
- * fail --- better to let the postmaster start with stats collection
- * disabled.
- * ----------
+/*
+ * StatsShmemSize
+ * Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+ return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
- ACCEPT_TYPE_ARG3 alen;
- struct addrinfo *addrs = NULL,
-   *addr,
- hints;
- int ret;
- fd_set rset;
- struct timeval tv;
- char test_byte;
- int sel_res;
- int tries = 0;
+ bool found;
 
-#define TESTBYTEVAL ((char) 199)
+ StatsShmem = (StatsShmemStruct *)
+ ShmemInitStruct("Stats area", StatsShmemSize(),
+ &found);
 
- /*
- * This static assertion verifies that we didn't mess up the calculations
- * involved in selecting maximum payload sizes for our UDP messages.
- * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
- * be silent performance loss from fragmentation, it seems worth having a
- * compile-time cross-check that we didn't.
- */
- StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
- "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
- /*
- * Create the UDP socket for sending and receiving statistic messages
- */
- hints.ai_flags = AI_PASSIVE;
- hints.ai_family = AF_UNSPEC;
- hints.ai_socktype = SOCK_DGRAM;
- hints.ai_protocol = 0;
- hints.ai_addrlen = 0;
- hints.ai_addr = NULL;
- hints.ai_canonname = NULL;
- hints.ai_next = NULL;
- ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
- if (ret || !addrs)
+ if (!IsUnderPostmaster)
  {
- ereport(LOG,
- (errmsg("could not resolve \"localhost\": %s",
- gai_strerror(ret))));
- goto startup_failed;
+ Assert(!found);
+
+ StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
  }
+}
 
- /*
- * On some platforms, pg_getaddrinfo_all() may return multiple addresses
- * only one of which will actually work (eg, both IPv6 and IPv4 addresses
- * when kernel will reject IPv6).  Worse, the failure may occur at the
- * bind() or perhaps even connect() stage.  So we must loop through the
- * results till we find a working combination. We will generate LOG
- * messages, but no error, for bogus combinations.
- */
- for (addr = addrs; addr; addr = addr->ai_next)
- {
-#ifdef HAVE_UNIX_SOCKETS
- /* Ignore AF_UNIX sockets, if any are returned. */
- if (addr->ai_family == AF_UNIX)
- continue;
-#endif
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ * Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+ if (!pgStatLocalContext)
+ pgStatLocalContext =
+ AllocSetContextCreate(TopMemoryContext,
+  "Backend statistics snapshot",
+  ALLOCSET_SMALL_SIZES);
+
+ if (!pgStatSnapshotContext)
+ pgStatSnapshotContext =
+ AllocSetContextCreate(TopMemoryContext,
+  "Database statistics snapshot",
+  ALLOCSET_SMALL_SIZES);
+}
 
- if (++tries > 1)
- ereport(LOG,
- (errmsg("trying another address for the statistics collector")));
 
- /*
- * Create the socket.
- */
- if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
- {
- ereport(LOG,
- (errcode_for_socket_access(),
- errmsg("could not create socket for statistics collector: %m")));
- continue;
- }
+/* ----------
+ * attach_shared_stats() -
+ *
+ * Attach shared or create stats memory. If we are the first process to use
+ * activity stats system, read saved statistics files if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+ MemoryContext oldcontext;
 
- /*
- * Bind it to a kernel assigned port on localhost and get the assigned
- * port via getsockname().
- */
- if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
- {
- ereport(LOG,
- (errcode_for_socket_access(),
- errmsg("could not bind socket for statistics collector: %m")));
- closesocket(pgStatSock);
- pgStatSock = PGINVALID_SOCKET;
- continue;
- }
+ /*
+ * Don't use dsm under postmaster, or when not tracking counts.
+ */
+ if (!pgstat_track_counts || !IsUnderPostmaster)
+ return;
 
- alen = sizeof(pgStatAddr);
- if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
- {
- ereport(LOG,
- (errcode_for_socket_access(),
- errmsg("could not get address of socket for statistics collector: %m")));
- closesocket(pgStatSock);
- pgStatSock = PGINVALID_SOCKET;
- continue;
- }
+ pgstat_setup_memcxt();
 
- /*
- * Connect the socket to its own address.  This saves a few cycles by
- * not having to respecify the target address on every send. This also
- * provides a kernel-level check that only packets from this same
- * address will be received.
- */
- if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
- {
- ereport(LOG,
- (errcode_for_socket_access(),
- errmsg("could not connect socket for statistics collector: %m")));
- closesocket(pgStatSock);
- pgStatSock = PGINVALID_SOCKET;
- continue;
- }
+ if (area)
+ return;
 
- /*
- * Try to send and receive a one-byte test message on the socket. This
- * is to catch situations where the socket can be created but will not
- * actually pass data (for instance, because kernel packet filtering
- * rules prevent it).
- */
- test_byte = TESTBYTEVAL;
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-retry1:
- if (send(pgStatSock, &test_byte, 1, 0) != 1)
- {
- if (errno == EINTR)
- goto retry1; /* if interrupted, just retry */
- ereport(LOG,
- (errcode_for_socket_access(),
- errmsg("could not send test message on socket for statistics collector: %m")));
- closesocket(pgStatSock);
- pgStatSock = PGINVALID_SOCKET;
- continue;
- }
+ LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
- /*
- * There could possibly be a little delay before the message can be
- * received.  We arbitrarily allow up to half a second before deciding
- * it's broken.
- */
- for (;;) /* need a loop to handle EINTR */
- {
- FD_ZERO(&rset);
- FD_SET(pgStatSock, &rset);
+ /*
+ * The last process is responsible to write out stats files at exit.
+ * Maintain refcount so that processes going to exit can find whether it
+ * is the last or not.
+ */
+ if (StatsShmem->refcount > 0)
+ StatsShmem->refcount++;
+ else
+ {
+ /* We're the first process to attach the shared stats memory */
+ Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
- tv.tv_sec = 0;
- tv.tv_usec = 500000;
- sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
- if (sel_res >= 0 || errno != EINTR)
- break;
- }
- if (sel_res < 0)
- {
- ereport(LOG,
- (errcode_for_socket_access(),
- errmsg("select() failed in statistics collector: %m")));
- closesocket(pgStatSock);
- pgStatSock = PGINVALID_SOCKET;
- continue;
- }
- if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
- {
- /*
- * This is the case we actually think is likely, so take pains to
- * give a specific message for it.
- *
- * errno will not be set meaningfully here, so don't use it.
- */
- ereport(LOG,
- (errcode(ERRCODE_CONNECTION_FAILURE),
- errmsg("test message did not get through on socket for statistics collector")));
- closesocket(pgStatSock);
- pgStatSock = PGINVALID_SOCKET;
- continue;
- }
+ /* Initialize shared memory area */
+ area = dsa_create(LWTRANCHE_STATS);
+ pgStatSharedHash = dshash_create(area, &dsh_rootparams, 0);
 
- test_byte++; /* just make sure variable is changed */
+ StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+ StatsShmem->global_stats =
+ dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+ StatsShmem->archiver_stats =
+ dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+ StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
 
-retry2:
- if (recv(pgStatSock, &test_byte, 1, 0) != 1)
- {
- if (errno == EINTR)
- goto retry2; /* if interrupted, just retry */
- ereport(LOG,
- (errcode_for_socket_access(),
- errmsg("could not receive test message on socket for statistics collector: %m")));
- closesocket(pgStatSock);
- pgStatSock = PGINVALID_SOCKET;
- continue;
- }
+ shared_globalStats = (PgStat_GlobalStats *)
+ dsa_get_address(area, StatsShmem->global_stats);
+ shared_archiverStats = (PgStat_ArchiverStats *)
+ dsa_get_address(area, StatsShmem->archiver_stats);
 
- if (test_byte != TESTBYTEVAL) /* strictly paranoia ... */
- {
- ereport(LOG,
- (errcode(ERRCODE_INTERNAL_ERROR),
- errmsg("incorrect test message transmission on socket for statistics collector")));
- closesocket(pgStatSock);
- pgStatSock = PGINVALID_SOCKET;
- continue;
- }
+ /* Load saved data if any. */
+ pgstat_read_statsfiles();
 
- /* If we get here, we have a working socket */
- break;
+ StatsShmem->refcount = 1;
  }
 
- /* Did we find a working address? */
- if (!addr || pgStatSock == PGINVALID_SOCKET)
- goto startup_failed;
+ LWLockRelease(StatsLock);
 
  /*
- * Set the socket to non-blocking IO.  This ensures that if the collector
- * falls behind, statistics messages will be discarded; backends won't
- * block waiting to send messages to the collector.
+ * If we're not the first process, attach existing shared stats area
+ * outside the StatsLock section.
  */
- if (!pg_set_noblock(pgStatSock))
+ if (!area)
  {
- ereport(LOG,
- (errcode_for_socket_access(),
- errmsg("could not set statistics collector socket to nonblocking mode: %m")));
- goto startup_failed;
+ /* Attach shared area. */
+ area = dsa_attach(StatsShmem->stats_dsa_handle);
+ pgStatSharedHash = dshash_attach(area, &dsh_rootparams,
+ StatsShmem->hash_handle, 0);
+
+ /* Setup local variables */
+ pgStatLocalHash = NULL;
+ shared_globalStats = (PgStat_GlobalStats *)
+ dsa_get_address(area, StatsShmem->global_stats);
+ shared_archiverStats = (PgStat_ArchiverStats *)
+ dsa_get_address(area, StatsShmem->archiver_stats);
  }
 
- /*
- * Try to ensure that the socket's receive buffer is at least
- * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
- * data.  Use of UDP protocol means that we are willing to lose data under
- * heavy load, but we don't want it to happen just because of ridiculously
- * small default buffer sizes (such as 8KB on older Windows versions).
- */
- {
- int old_rcvbuf;
- int new_rcvbuf;
- ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
- if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-   (char *) &old_rcvbuf, &rcvbufsize) < 0)
- {
- elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
- /* if we can't get existing size, always try to set it */
- old_rcvbuf = 0;
- }
-
- new_rcvbuf = PGSTAT_MIN_RCVBUF;
- if (old_rcvbuf < new_rcvbuf)
- {
- if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-   (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
- elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
- }
- }
-
- pg_freeaddrinfo_all(hints.ai_family, addrs);
-
- /* Now that we have a long-lived socket, tell fd.c about it. */
- ReserveExternalFD();
+ MemoryContextSwitchTo(oldcontext);
 
- return;
-
-startup_failed:
- ereport(LOG,
- (errmsg("disabling statistics collector for lack of working socket")));
-
- if (addrs)
- pg_freeaddrinfo_all(hints.ai_family, addrs);
-
- if (pgStatSock != PGINVALID_SOCKET)
- closesocket(pgStatSock);
- pgStatSock = PGINVALID_SOCKET;
-
- /*
- * Adjust GUC variables to suppress useless activity, and for debugging
- * purposes (seeing track_counts off is a clue that we failed here). We
- * use PGC_S_OVERRIDE because there is no point in trying to turn it back
- * on from postgresql.conf without a restart.
- */
- SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+ /* don't detach automatically */
+ dsa_pin_mapping(area);
+ global_snapshot_is_valid = false;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ * Detach shared stats. Write out to file if we're the last process and told
+ * to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_stats)
 {
- DIR   *dir;
- struct dirent *entry;
- char fname[MAXPGPATH * 2];
+ /* immediately return if useless */
+ if (!area || !IsUnderPostmaster)
+ return;
 
- dir = AllocateDir(directory);
- while ((entry = ReadDir(dir, directory)) != NULL)
- {
- int nchars;
- Oid tmp_oid;
+ LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
+ if (--StatsShmem->refcount < 1)
+ {
  /*
- * Skip directory entries that don't match the file names we write.
- * See get_dbstat_filename for the database-specific pattern.
+ * The process is the last one that is attaching the shared stats
+ * memory. Write out the stats files if requested.
  */
- if (strncmp(entry->d_name, "global.", 7) == 0)
- nchars = 7;
- else
- {
- nchars = 0;
- (void) sscanf(entry->d_name, "db_%u.%n",
-  &tmp_oid, &nchars);
- if (nchars <= 0)
- continue;
- /* %u allows leading whitespace, so reject that */
- if (strchr("0123456789", entry->d_name[3]) == NULL)
- continue;
- }
-
- if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
- strcmp(entry->d_name + nchars, "stat") != 0)
- continue;
+ if (write_stats)
+ pgstat_write_statsfiles();
 
- snprintf(fname, sizeof(fname), "%s/%s", directory,
- entry->d_name);
- unlink(fname);
+ /* No one is using the area. */
+ StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
  }
- FreeDir(dir);
+
+ LWLockRelease(StatsLock);
+
+ /*
+ * Detach the area. Automatically destroyed when the last process detached
+ * it.
+ */
+ dsa_detach(area);
+
+ area = NULL;
+ pgStatSharedHash = NULL;
+ shared_globalStats = NULL;
+ shared_archiverStats = NULL;
+ pgStatLocalHash = NULL;
+ global_snapshot_is_valid = false;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
- pgstat_reset_remove_files(pgstat_stat_directory);
- pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+ /* standalone server doesn't use shared stats */
+ if (!IsUnderPostmaster)
+ return;
 
-#ifdef EXEC_BACKEND
+ /* we must have shared stats attached */
+ Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
- char   *av[10];
- int ac = 0;
-
- av[ac++] = "postgres";
- av[ac++] = "--forkcol";
- av[ac++] = NULL; /* filled in by postmaster_forkexec */
-
- av[ac] = NULL;
- Assert(ac < lengthof(av));
-
- return postmaster_forkexec(ac, av);
-}
-#endif /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- * Called from postmaster at startup or after an existing collector
- * died.  Attempt to fire up a fresh statistics collector.
- *
- * Returns PID of child process, or 0 if fail.
- *
- * Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
- time_t curtime;
- pid_t pgStatPid;
-
- /*
- * Check that the socket is there, else pgstat_init failed and we can do
- * nothing useful.
- */
- if (pgStatSock == PGINVALID_SOCKET)
- return 0;
-
- /*
- * Do nothing if too soon since last collector start.  This is a safety
- * valve to protect against continuous respawn attempts if the collector
- * is dying immediately at launch.  Note that since we will be re-called
- * from the postmaster main loop, we will get another chance later.
- */
- curtime = time(NULL);
- if ((unsigned int) (curtime - last_pgstat_start_time) <
- (unsigned int) PGSTAT_RESTART_INTERVAL)
- return 0;
- last_pgstat_start_time = curtime;
+ /* Startup must be the only user of shared stats */
+ Assert(StatsShmem->refcount == 1);
 
  /*
- * Okay, fork off the collector.
+ * We could directly remove files and recreate the shared memory area. But
+ * just discard  then create for simplicity.
  */
-#ifdef EXEC_BACKEND
- switch ((pgStatPid = pgstat_forkexec()))
-#else
- switch ((pgStatPid = fork_process()))
-#endif
- {
- case -1:
- ereport(LOG,
- (errmsg("could not fork statistics collector: %m")));
- return 0;
-
-#ifndef EXEC_BACKEND
- case 0:
- /* in postmaster child ... */
- InitPostmasterChild();
-
- /* Close the postmaster's sockets */
- ClosePostmasterPorts(false);
-
- /* Drop our connection to postmaster's shared memory, as well */
- dsm_detach_all();
- PGSharedMemoryDetach();
-
- PgstatCollectorMain(0, NULL);
- break;
-#endif
-
- default:
- return (int) pgStatPid;
- }
-
- /* shouldn't get here */
- return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
- last_pgstat_start_time = 0;
+ detach_shared_stats(false); /* Don't write files. */
+ attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -794,144 +593,479 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  * Must be called by processes that performs DML: tcop/postgres.c, logical
- * receiver processes, SPI worker, etc. to send the so far collected
- * per-table and function usage statistics to the collector.  Note that this
- * is called only when not within a transaction, so it is fair to use
+ * receiver processes, SPI worker, etc. to apply the so far collected
+ * per-table and function usage statistics to the shared statistics hashes.
+ *
+ * Updates are applied not more frequent than the interval of
+ * PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ * failure if force is false and there's no pending updates longer than
+ * PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ * succeeding calls of this function.
+ *
+ * Returns the time until the next timing when updates are applied in
+ * milliseconds if there are no updates held for more than
+ * PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ * Note that this is called only out of a transaction, so it is fine to use
  * transaction stop time as an approximation of current time.
- * ----------
+ * ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
- /* we assume this inits to all zeroes: */
- static const PgStat_TableCounts all_zeroes;
- static TimestampTz last_report = 0;
-
+ static TimestampTz next_flush = 0;
+ static TimestampTz pending_since = 0;
+ static long retry_interval = 0;
  TimestampTz now;
- PgStat_MsgTabstat regular_msg;
- PgStat_MsgTabstat shared_msg;
- TabStatusArray *tsa;
+ bool nowait = !force; /* Don't use force ever after */
+ HASH_SEQ_STATUS scan;
+ PgStatLocalHashEntry *lent;
+ PgStatLocalHashEntry **dbentlist;
+ int dbentlistlen = 8;
+ int ndbentries = 0;
+ int remains = 0;
  int i;
 
  /* Don't expend a clock check if nothing to do */
- if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
- pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
- !have_function_stats)
- return;
+ if (area == NULL || !HAVE_ANY_PENDING_STATS())
+ return 0;
+
+ dbentlist = palloc(sizeof(PgStatLocalHashEntry *) * dbentlistlen);
 
- /*
- * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
- * msec since we last sent one, or the caller wants to force stats out.
- */
  now = GetCurrentTransactionStopTimestamp();
- if (!force &&
- !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
- return;
- last_report = now;
 
- /*
- * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
- * entries it points to.  (Should we fail partway through the loop below,
- * it's okay to have removed the hashtable already --- the only
- * consequence is we'd get multiple entries for the same table in the
- * pgStatTabList, and that's safe.)
- */
- if (pgStatTabHash)
- hash_destroy(pgStatTabHash);
- pgStatTabHash = NULL;
+ if (nowait)
+ {
+ /*
+ * Don't flush stats too frequently.  Return the time to the next
+ * flush.
+ */
+ if (now < next_flush)
+ {
+ /* Record the epoch time if retrying. */
+ if (pending_since == 0)
+ pending_since = now;
+
+ return (next_flush - now) / 1000;
+ }
+
+ /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+ if (pending_since > 0 &&
+ TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+ nowait = false;
+ }
 
  /*
- * Scan through the TabStatusArray struct(s) to find tables that actually
- * have counts, and build messages to send.  We have to separate shared
- * relations from regular ones because the databaseid field in the message
- * header has to depend on that.
+ * flush_tabstat applies some of stats numbers of flushed entries into
+ * local database stats. So flush-out database stats later.
  */
- regular_msg.m_databaseid = MyDatabaseId;
- shared_msg.m_databaseid = InvalidOid;
- regular_msg.m_nentries = 0;
- shared_msg.m_nentries = 0;
-
- for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+ if (pgStatLocalHash)
  {
- for (i = 0; i < tsa->tsa_used; i++)
+ /* Step 1: flush out other than database stats */
+ hash_seq_init(&scan, pgStatLocalHash);
+ while ((lent = (PgStatLocalHashEntry *) hash_seq_search(&scan)) != NULL)
  {
- PgStat_TableStatus *entry = &tsa->tsa_entries[i];
- PgStat_MsgTabstat *this_msg;
- PgStat_TableEntry *this_ent;
+ bool remove = false;
 
- /* Shouldn't have any pending transaction-dependent counts */
- Assert(entry->trans == NULL);
+ switch (lent->env->type)
+ {
+ case PGSTAT_TYPE_DB:
+ if (ndbentries >= dbentlistlen)
+ {
+ dbentlistlen *= 2;
+ dbentlist = repalloc(dbentlist,
+ sizeof(PgStatLocalHashEntry *) *
+ dbentlistlen);
+ }
+ dbentlist[ndbentries++] = lent;
+ break;
+ case PGSTAT_TYPE_TABLE:
+ if (flush_tabstat(lent->env, nowait))
+ remove = true;
+ break;
+ case PGSTAT_TYPE_FUNCTION:
+ if (flush_funcstat(lent->env, nowait))
+ remove = true;
+ break;
+ default:
+ Assert(false);
+ }
 
- /*
- * Ignore entries that didn't accumulate any actual counts, such
- * as indexes that were opened by the planner but not used.
- */
- if (memcmp(&entry->t_counts, &all_zeroes,
-   sizeof(PgStat_TableCounts)) == 0)
+ if (!remove)
+ {
+ remains++;
  continue;
+ }
 
- /*
- * OK, insert data into the appropriate message, and send if full.
- */
- this_msg = entry->t_shared ? &shared_msg : &regular_msg;
- this_ent = &this_msg->m_entry[this_msg->m_nentries];
- this_ent->t_id = entry->t_id;
- memcpy(&this_ent->t_counts, &entry->t_counts,
-   sizeof(PgStat_TableCounts));
- if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+ /* Remove the successfully flushed entry */
+ pfree(lent->env);
+ hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
+ }
+
+ /* Step 2: flush out database stats */
+ for (i = 0; i < ndbentries; i++)
+ {
+ PgStatLocalHashEntry *lent = dbentlist[i];
+
+ if (flush_dbstat(lent->env, nowait))
  {
- pgstat_send_tabstat(this_msg);
- this_msg->m_nentries = 0;
+ remains--;
+ /* Remove the successfully flushed entry */
+ pfree(lent->env);
+ hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
  }
  }
- /* zero out PgStat_TableStatus structs after use */
- MemSet(tsa->tsa_entries, 0,
-   tsa->tsa_used * sizeof(PgStat_TableStatus));
- tsa->tsa_used = 0;
+ pfree(dbentlist);
+
+ if (remains <= 0)
+ {
+ hash_destroy(pgStatLocalHash);
+ pgStatLocalHash = NULL;
+ }
+ }
+
+ /* Publish the last flush time */
+ LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+ if (shared_globalStats->stats_timestamp < now)
+ shared_globalStats->stats_timestamp = now;
+ LWLockRelease(StatsLock);
+
+ /*
+ * If we have pending local stats, let the caller know the retry interval.
+ */
+ if (HAVE_ANY_PENDING_STATS())
+ {
+ /* Retain the epoch time */
+ if (pending_since == 0)
+ pending_since = now;
+
+ /* The interval is doubled at every retry. */
+ if (retry_interval == 0)
+ retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+ else
+ retry_interval = retry_interval * 2;
+
+ /*
+ * Determine the next retry interval so as not to get shorter than the
+ * previous interval.
+ */
+ if (!TimestampDifferenceExceeds(pending_since,
+ now + 2 * retry_interval,
+ PGSTAT_MAX_INTERVAL))
+ next_flush = now + retry_interval;
+ else
+ {
+ next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+ retry_interval = next_flush - now;
+ }
+
+ return retry_interval / 1000;
  }
 
+ /* Set the next time to update stats */
+ next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+ retry_interval = 0;
+ pending_since = 0;
+
+ return 0;
+}
+
+/*
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_tabstat(PgStatEnvelope * lenv, bool nowait)
+{
+ static const PgStat_TableCounts all_zeroes;
+ Oid dboid; /* database OID of the table */
+ PgStat_TableStatus *lstats; /* local stats entry  */
+ PgStatEnvelope *shenv; /* shared stats envelope */
+ PgStat_StatTabEntry *shtabstats; /* table entry of shared stats */
+ PgStat_StatDBEntry *ldbstats; /* local database entry */
+ bool found;
+
+ Assert(lenv->type == PGSTAT_TYPE_TABLE);
+
+ lstats = (PgStat_TableStatus *) &lenv->body;
+ dboid = lstats->t_shared ? InvalidOid : MyDatabaseId;
+
+ /*
+ * Ignore entries that didn't accumulate any actual counts, such as
+ * indexes that were opened by the planner but not used.
+ */
+ if (memcmp(&lstats->t_counts, &all_zeroes,
+   sizeof(PgStat_TableCounts)) == 0)
+ return true;
+
+ /* find shared table stats entry corresponding to the local entry */
+ shenv = get_stat_entry(PGSTAT_TYPE_TABLE, dboid, lstats->t_id,
+   nowait, init_tabentry, &found);
+
+ /* skip if dshash failed to acquire lock */
+ if (shenv == NULL)
+ return false;
+
+ /* retrieve the shared table stats entry from the envelope */
+ shtabstats = (PgStat_StatTabEntry *) &shenv->body;
+
+ /* lock the shared entry to protect the content, skip if failed */
+ if (!nowait)
+ LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+ else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+ return false;
+
+ /* add the values to the shared entry. */
+ shtabstats->numscans += lstats->t_counts.t_numscans;
+ shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+ shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+ shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+ shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+ shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+ shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+ /*
+ * If table was truncated or vacuum/analyze has ran, first reset the
+ * live/dead counters.
+ */
+ if (lstats->t_counts.t_truncated ||
+ lstats->t_counts.vacuum_count > 0 ||
+ lstats->t_counts.analyze_count > 0 ||
+ lstats->t_counts.autovac_vacuum_count > 0 ||
+ lstats->t_counts.autovac_analyze_count > 0)
+ {
+ shtabstats->n_live_tuples = 0;
+ shtabstats->n_dead_tuples = 0;
+ }
+
+ /* clear the change counter if requested */
+ if (lstats->t_counts.reset_changed_tuples)
+ shtabstats->changes_since_analyze = 0;
+
+ shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+ shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+ shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+ shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+ shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+ /*
+ * Update vacuum/analyze timestamp and counters, so that the values won't
+ * goes back.
+ */
+ if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
+ shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;
+ shtabstats->vacuum_count += lstats->t_counts.vacuum_count;
+
+ if (shtabstats->autovac_vacuum_timestamp < lstats->autovac_vacuum_timestamp)
+ shtabstats->autovac_vacuum_timestamp = lstats->autovac_vacuum_timestamp;
+ shtabstats->autovac_vacuum_count += lstats->t_counts.autovac_vacuum_count;
+
+ if (shtabstats->analyze_timestamp < lstats->analyze_timestamp)
+ shtabstats->analyze_timestamp = lstats->analyze_timestamp;
+ shtabstats->analyze_count += lstats->t_counts.analyze_count;
+
+ if (shtabstats->autovac_analyze_timestamp < lstats->autovac_analyze_timestamp)
+ shtabstats->autovac_analyze_timestamp = lstats->autovac_analyze_timestamp;
+ shtabstats->autovac_analyze_count += lstats->t_counts.autovac_analyze_count;
+
+ /* Clamp n_live_tuples in case of negative delta_live_tuples */
+ shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+ /* Likewise for n_dead_tuples */
+ shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+ LWLockRelease(&shenv->lock);
+
+ /* The entry is successfully flushed so the same to add to database stats */
+ ldbstats = get_local_dbstat_entry(dboid);
+ ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+ ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+ ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+ ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+ ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+ ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+ ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+ return true;
+}
+
+/* ----------
+ * init_tabentry() -
+ *
+ * initializes table stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
+ */
+static void
+init_tabentry(PgStatEnvelope * env)
+{
+ PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
+
  /*
- * Send partial messages.  Make sure that any pending xact commit/abort
- * gets counted, even if there are no table stats to send.
+ * If it's a new table entry, initialize counters to the values we just
+ * got.
  */
- if (regular_msg.m_nentries > 0 ||
- pgStatXactCommit > 0 || pgStatXactRollback > 0)
- pgstat_send_tabstat(&regular_msg);
- if (shared_msg.m_nentries > 0)
- pgstat_send_tabstat(&shared_msg);
-
- /* Now, send function statistics */
- pgstat_send_funcstats();
+ Assert(env->type == PGSTAT_TYPE_TABLE);
+ tabent->tableid = env->objectid;
+ tabent->numscans = 0;
+ tabent->tuples_returned = 0;
+ tabent->tuples_fetched = 0;
+ tabent->tuples_inserted = 0;
+ tabent->tuples_updated = 0;
+ tabent->tuples_deleted = 0;
+ tabent->tuples_hot_updated = 0;
+ tabent->n_live_tuples = 0;
+ tabent->n_dead_tuples = 0;
+ tabent->changes_since_analyze = 0;
+ tabent->blocks_fetched = 0;
+ tabent->blocks_hit = 0;
+
+ tabent->vacuum_timestamp = 0;
+ tabent->vacuum_count = 0;
+ tabent->autovac_vacuum_timestamp = 0;
+ tabent->autovac_vacuum_count = 0;
+ tabent->analyze_timestamp = 0;
+ tabent->analyze_count = 0;
+ tabent->autovac_analyze_timestamp = 0;
+ tabent->autovac_analyze_count = 0;
 }
 
+
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * pgstat_flush_funcstat - flush out a local function stats entry
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatEnvelope * env, bool nowait)
+{
+ /* we assume this inits to all zeroes: */
+ static const PgStat_FunctionCounts all_zeroes;
+ PgStat_BackendFunctionEntry *localent; /* local stats entry */
+ PgStatEnvelope *shenv; /* shared stats envelope */
+ PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
+ bool found;
+
+ Assert(env->type == PGSTAT_TYPE_FUNCTION);
+ localent = (PgStat_BackendFunctionEntry *) &env->body;
+
+ /* Skip it if no counts accumulated for it so far */
+ if (memcmp(&localent->f_counts, &all_zeroes,
+   sizeof(PgStat_FunctionCounts)) == 0)
+ return true;
+
+ /* find shared table stats entry corresponding to the local entry */
+ shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
+   nowait, init_funcentry, &found);
+ /* skip if dshash failed to acquire lock */
+ if (sharedent == NULL)
+ return false; /* failed to acquire lock, skip */
+
+ /* retrieve the shared table stats entry from the envelope */
+ sharedent = (PgStat_StatFuncEntry *) &shenv->body;
+
+ /* lock the shared entry to protect the content, skip if failed */
+ if (!nowait)
+ LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+ else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+ return false; /* failed to acquire lock, skip */
+
+ sharedent->f_numcalls += localent->f_counts.f_numcalls;
+ sharedent->f_total_time +=
+ INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+ sharedent->f_self_time +=
+ INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+ LWLockRelease(&shenv->lock);
+
+ return true;
+}
+
+
+/* ----------
+ * init_funcentry() -
+ *
+ * initializes function stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+init_funcentry(PgStatEnvelope * env)
 {
- int n;
- int len;
+ PgStat_StatFuncEntry *shstat = (PgStat_StatFuncEntry *) &env->body;
+
+ Assert(env->type == PGSTAT_TYPE_FUNCTION);
+ shstat->functionid = env->objectid;
+ shstat->f_numcalls = 0;
+ shstat->f_total_time = 0;
+ shstat->f_self_time = 0;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatEnvelope * env, bool nowait)
+{
+ PgStat_StatDBEntry *localent;
+ PgStatEnvelope *shenv;
+ PgStat_StatDBEntry *sharedent;
+
+ Assert(env->type == PGSTAT_TYPE_DB);
+
+ localent = (PgStat_StatDBEntry *) &env->body;
+
+ /* find shared database stats entry corresponding to the local entry */
+ shenv = get_stat_entry(PGSTAT_TYPE_DB, localent->databaseid, InvalidOid,
+   nowait, init_dbentry, NULL);
+
+ /* skip if dshash failed to acquire lock */
+ if (!shenv)
+ return false;
+
+ /* retrieve the shared stats entry from the envelope */
+ sharedent = (PgStat_StatDBEntry *) &shenv->body;
+
+ /* lock the shared entry to protect the content, skip if failed */
+ if (!nowait)
+ LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+ else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+ return false;
+
+ sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+ sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+ sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+ sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+ sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+ sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+ sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
- /* It's unlikely we'd get here with no socket, but maybe not impossible */
- if (pgStatSock == PGINVALID_SOCKET)
- return;
+ sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+ sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+ sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+ sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
  /*
- * Report and reset accumulated xact commit/rollback and I/O timings
- * whenever we send a normal tabstat message
+ * Accumulate xact commit/rollback and I/O timings to stats entry of the
+ * current database.
  */
- if (OidIsValid(tsmsg->m_databaseid))
+ if (OidIsValid(localent->databaseid))
  {
- tsmsg->m_xact_commit = pgStatXactCommit;
- tsmsg->m_xact_rollback = pgStatXactRollback;
- tsmsg->m_block_read_time = pgStatBlockReadTime;
- tsmsg->m_block_write_time = pgStatBlockWriteTime;
+ sharedent->counts.n_xact_commit += pgStatXactCommit;
+ sharedent->counts.n_xact_rollback += pgStatXactRollback;
+ sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+ sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
  pgStatXactCommit = 0;
  pgStatXactRollback = 0;
  pgStatBlockReadTime = 0;
@@ -939,257 +1073,102 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
  }
  else
  {
- tsmsg->m_xact_commit = 0;
- tsmsg->m_xact_rollback = 0;
- tsmsg->m_block_read_time = 0;
- tsmsg->m_block_write_time = 0;
+ sharedent->counts.n_xact_commit = 0;
+ sharedent->counts.n_xact_rollback = 0;
+ sharedent->counts.n_block_read_time = 0;
+ sharedent->counts.n_block_write_time = 0;
  }
 
- n = tsmsg->m_nentries;
- len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
- n * sizeof(PgStat_TableEntry);
+ LWLockRelease(&shenv->lock);
 
- pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
- pgstat_send(tsmsg, len);
+ return true;
 }
 
+
+/* ----------
+ * init_dbentry() -
+ *
+ * initializes database stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
+ */
+static void
+init_dbentry(PgStatEnvelope * env)
+{
+ PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &env->body;
+
+ Assert(env->type == PGSTAT_TYPE_DB);
+ dbentry->databaseid = env->databaseid;
+ dbentry->last_autovac_time = 0;
+ dbentry->last_checksum_failure = 0;
+ dbentry->stat_reset_timestamp = 0;
+ dbentry->stats_timestamp = 0;
+ /* initialize the new shared entry */
+ MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
+}
+
+
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
  */
 static void
-pgstat_send_funcstats(void)
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
 {
- /* we assume this inits to all zeroes: */
- static const PgStat_FunctionCounts all_zeroes;
-
- PgStat_MsgFuncstat msg;
- PgStat_BackendFunctionEntry *entry;
- HASH_SEQ_STATUS fstat;
-
- if (pgStatFunctions == NULL)
- return;
-
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
- msg.m_databaseid = MyDatabaseId;
- msg.m_nentries = 0;
-
- hash_seq_init(&fstat, pgStatFunctions);
- while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
- {
- PgStat_FunctionEntry *m_ent;
-
- /* Skip it if no counts accumulated since last time */
- if (memcmp(&entry->f_counts, &all_zeroes,
-   sizeof(PgStat_FunctionCounts)) == 0)
- continue;
-
- /* need to convert format of time accumulators */
- m_ent = &msg.m_entry[msg.m_nentries];
- m_ent->f_id = entry->f_id;
- m_ent->f_numcalls = entry->f_counts.f_numcalls;
- m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
- m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
- if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
- {
- pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
- msg.m_nentries * sizeof(PgStat_FunctionEntry));
- msg.m_nentries = 0;
- }
-
- /* reset the entry's counts */
- MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
- }
-
- if (msg.m_nentries > 0)
- pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
- msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
- have_function_stats = false;
+ int printed;
+
+ /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+ printed = snprintf(filename, len, "%s/db_%u.%s",
+   PGSTAT_STAT_PERMANENT_DIRECTORY,
+   databaseid,
+   tempname ? "tmp" : "stat");
+ if (printed >= len)
+ elog(ERROR, "overlength pgstat path");
 }
 
 
 /* ----------
- * pgstat_vacuum_stat() -
+ * collect_stat_entries() -
  *
- * Will tell the collector about objects he can get rid of.
+ * Collect the shared statistics entries specified by type and dbid. Returns a
+ *  list of pointer to shared statistics in palloc'ed memory. If type is
+ *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
+ *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
+ *  PGSTAT_TYPE_DB entries.
  * ----------
  */
-void
-pgstat_vacuum_stat(void)
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
 {
- HTAB   *htab;
- PgStat_MsgTabpurge msg;
- PgStat_MsgFuncpurge f_msg;
- HASH_SEQ_STATUS hstat;
- PgStat_StatDBEntry *dbentry;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatFuncEntry *funcentry;
- int len;
-
- if (pgStatSock == PGINVALID_SOCKET)
- return;
-
- /*
- * If not done for this transaction, read the statistics collector stats
- * file into some hash tables.
- */
- backend_read_statsfile();
-
- /*
- * Read pg_database and make a list of OIDs of all existing databases
- */
- htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
- /*
- * Search the database hash table for dead databases and tell the
- * collector to drop them.
- */
- hash_seq_init(&hstat, pgStatDBHash);
- while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
- {
- Oid dbid = dbentry->databaseid;
-
- CHECK_FOR_INTERRUPTS();
-
- /* the DB entry for shared tables (with InvalidOid) is never dropped */
- if (OidIsValid(dbid) &&
- hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
- pgstat_drop_database(dbid);
- }
-
- /* Clean up */
- hash_destroy(htab);
-
- /*
- * Lookup our own database entry; if not found, nothing more to do.
- */
- dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
- (void *) &MyDatabaseId,
- HASH_FIND, NULL);
- if (dbentry == NULL || dbentry->tables == NULL)
- return;
-
- /*
- * Similarly to above, make a list of all known relations in this DB.
- */
- htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
- /*
- * Initialize our messages table counter to zero
- */
- msg.m_nentries = 0;
-
- /*
- * Check for all tables listed in stats hashtable if they still exist.
- */
- hash_seq_init(&hstat, dbentry->tables);
- while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+ dshash_seq_status hstat;
+ PgStatHashEntry *p;
+ int listlen = 16;
+ PgStatEnvelope **envlist = palloc(sizeof(PgStatEnvelope * *) * listlen);
+ int n = 0;
+
+ dshash_seq_init(&hstat, pgStatSharedHash, false);
+ while ((p = dshash_seq_next(&hstat)) != NULL)
  {
- Oid tabid = tabentry->tableid;
-
- CHECK_FOR_INTERRUPTS();
-
- if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+ if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
+ (type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
  continue;
 
- /*
- * Not there, so add this table's Oid to the message
- */
- msg.m_tableid[msg.m_nentries++] = tabid;
-
- /*
- * If the message is full, send it out and reinitialize to empty
- */
- if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
+ if (n >= listlen - 1)
  {
- len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
- + msg.m_nentries * sizeof(Oid);
-
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
- msg.m_databaseid = MyDatabaseId;
- pgstat_send(&msg, len);
-
- msg.m_nentries = 0;
+ listlen *= 2;
+ envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
  }
+ envlist[n++] = dsa_get_address(area, p->env);
  }
+ dshash_seq_term(&hstat);
 
- /*
- * Send the rest
- */
- if (msg.m_nentries > 0)
- {
- len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
- + msg.m_nentries * sizeof(Oid);
-
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
- msg.m_databaseid = MyDatabaseId;
- pgstat_send(&msg, len);
- }
-
- /* Clean up */
- hash_destroy(htab);
-
- /*
- * Now repeat the above steps for functions.  However, we needn't bother
- * in the common case where no function stats are being collected.
- */
- if (dbentry->functions != NULL &&
- hash_get_num_entries(dbentry->functions) > 0)
- {
- htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
- pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
- f_msg.m_databaseid = MyDatabaseId;
- f_msg.m_nentries = 0;
-
- hash_seq_init(&hstat, dbentry->functions);
- while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
- {
- Oid funcid = funcentry->functionid;
-
- CHECK_FOR_INTERRUPTS();
+ envlist[n] = NULL;
 
- if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
- continue;
-
- /*
- * Not there, so add this function's Oid to the message
- */
- f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
- /*
- * If the message is full, send it out and reinitialize to empty
- */
- if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
- {
- len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
- + f_msg.m_nentries * sizeof(Oid);
-
- pgstat_send(&f_msg, len);
-
- f_msg.m_nentries = 0;
- }
- }
-
- /*
- * Send the rest
- */
- if (f_msg.m_nentries > 0)
- {
- len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
- + f_msg.m_nentries * sizeof(Oid);
-
- pgstat_send(&f_msg, len);
- }
-
- hash_destroy(htab);
- }
+ return envlist;
 }
 
 
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  * Collect the OIDs of all objects listed in the specified system catalog
  * into a temporary hash table.  Caller should hash_destroy the result
@@ -1198,7 +1177,7 @@ pgstat_vacuum_stat(void)
  * ----------
  */
 static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
  HTAB   *htab;
  HASHCTL hash_ctl;
@@ -1212,7 +1191,7 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
  hash_ctl.entrysize = sizeof(Oid);
  hash_ctl.hcxt = CurrentMemoryContext;
  htab = hash_create("Temporary table of OIDs",
-   PGSTAT_TAB_HASH_SIZE,
+   PGSTAT_TABLE_HASH_SIZE,
    &hash_ctl,
    HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
@@ -1239,65 +1218,184 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 }
 
 
+/* ----------
+ * pgstat_vacuum_stat() -
+ *
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
+ * ----------
+ */
+void
+pgstat_vacuum_stat(void)
+{
+ HTAB   *dbids; /* database ids */
+ HTAB   *relids; /* relation ids in the current database */
+ HTAB   *funcids; /* function ids in the current database */
+ PgStatEnvelope **victims; /* victim entry list */
+ int arraylen = 0; /* storage size of the above */
+ int nvictims = 0; /* # of entries of the above */
+ dshash_seq_status dshstat;
+ PgStatHashEntry *ent;
+ int i;
+
+ /* we don't collect stats under standalone mode */
+ if (!IsUnderPostmaster)
+ return;
+
+ /* collect oids of existent objects */
+ dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+ relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+ funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+
+ /* collect victims from shared stats */
+ arraylen = 16;
+ victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
+ nvictims = 0;
+
+ dshash_seq_init(&dshstat, pgStatSharedHash, false);
+
+ while ((ent = dshash_seq_next(&dshstat)) != NULL)
+ {
+ HTAB   *oidtab;
+ Oid   *key;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Don't drop entries for other than database objects not of the
+ * current database.
+ */
+ if (ent->key.type != PGSTAT_TYPE_DB &&
+ ent->key.databaseid != MyDatabaseId)
+ continue;
+
+ switch (ent->key.type)
+ {
+ case PGSTAT_TYPE_DB:
+ /* don't remove database entry for shared tables */
+ if (ent->key.databaseid == 0)
+ continue;
+ oidtab = dbids;
+ key = &ent->key.databaseid;
+ break;
+
+ case PGSTAT_TYPE_TABLE:
+ oidtab = relids;
+ key = &ent->key.objectid;
+ break;
+
+ case PGSTAT_TYPE_FUNCTION:
+ oidtab = funcids;
+ key = &ent->key.objectid;
+ break;
+ case PGSTAT_TYPE_ALL:
+ Assert(false);
+ break;
+ }
+
+ /* Skip existent objects. */
+ if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
+ continue;
+
+ /* extend the list if needed */
+ if (nvictims >= arraylen)
+ {
+ arraylen *= 2;
+ victims = repalloc(victims, sizeof(PgStatEnvelope * *) * arraylen);
+ }
+
+ victims[nvictims++] = dsa_get_address(area, ent->env);
+ }
+ dshash_seq_term(&dshstat);
+ hash_destroy(dbids);
+ hash_destroy(relids);
+ hash_destroy(funcids);
+
+ /* Now try removing the victim entries */
+ for (i = 0; i < nvictims; i++)
+ {
+ PgStatEnvelope *p = victims[i];
+
+ delete_stat_entry(p->type, p->databaseid, p->objectid, true);
+ }
+}
+
+
+/* ----------
+ * delete_stat_entry -
+ *
+ *  Deletes the specified entry from shared stats hash
+ *
+ *  Returns true when successfully deleted.
+ * ----------
+ */
+static bool
+delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait)
+{
+ PgStatHashEntryKey key;
+ PgStatHashEntry *ent;
+
+ key.type = type;
+ key.databaseid = dbid;
+ key.objectid = objid;
+ ent = dshash_find_extended(pgStatSharedHash, &key,
+   true, nowait, false, NULL);
+
+ if (!ent)
+ return false; /* lock failed or not found */
+
+ /* The entry is exclusively locked, so we can free the chunk first. */
+ dsa_free(area, ent->env);
+ dshash_delete_entry(pgStatSharedHash, ent);
+
+ return true;
+}
+
+
 /* ----------
  * pgstat_drop_database() -
  *
- * Tell the collector that we just dropped a database.
- * (If the message gets lost, we will still clean the dead DB eventually
- * via future invocations of pgstat_vacuum_stat().)
- * ----------
+ * Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ * flushed after this but we will still clean the dead DB eventually via
+ * future invocations of pgstat_vacuum_stat().
+ * ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
- PgStat_MsgDropdb msg;
+ PgStatEnvelope **envlist;
+ PgStatEnvelope **p;
 
- if (pgStatSock == PGINVALID_SOCKET)
- return;
-
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
- msg.m_databaseid = databaseid;
- pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_drop_relation() -
- *
- * Tell the collector that we just dropped a relation.
- * (If the message gets lost, we will still clean the dead entry eventually
- * via future invocations of pgstat_vacuum_stat().)
- *
- * Currently not used for lack of any good place to call it; we rely
- * entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
- PgStat_MsgTabpurge msg;
- int len;
+ Assert(OidIsValid(databaseid));
 
- if (pgStatSock == PGINVALID_SOCKET)
+ if (!IsUnderPostmaster || !pgStatSharedHash)
  return;
 
- msg.m_tableid[0] = relid;
- msg.m_nentries = 1;
+ envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
 
- len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
+ for (p = envlist; *p != NULL; p++)
+ delete_stat_entry((*p)->type, (*p)->databaseid, (*p)->objectid, true);
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
- msg.m_databaseid = MyDatabaseId;
- pgstat_send(&msg, len);
+ pfree(envlist);
 }
-#endif /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- * Tell the statistics collector to reset counters for our database.
+ * Reset counters for our database.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -1306,20 +1404,47 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
- PgStat_MsgResetcounter msg;
+ PgStatEnvelope **envlist;
+ PgStatEnvelope **p;
 
- if (pgStatSock == PGINVALID_SOCKET)
- return;
+ /* Lookup the entries of the current database in the stats hash. */
+ envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
+ for (p = envlist; *p != NULL; p++)
+ {
+ PgStatEnvelope *env = *p;
+ PgStat_StatDBEntry *dbstat;
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
- msg.m_databaseid = MyDatabaseId;
- pgstat_send(&msg, sizeof(msg));
+ LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+
+ switch (env->type)
+ {
+ case PGSTAT_TYPE_TABLE:
+ init_tabentry(env);
+ break;
+
+ case PGSTAT_TYPE_FUNCTION:
+ init_funcentry(env);
+ break;
+
+ case PGSTAT_TYPE_DB:
+ init_dbentry(env);
+ dbstat = (PgStat_StatDBEntry *) &env->body;
+ dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+ break;
+ default:
+ Assert(false);
+ }
+
+ LWLockRelease(&env->lock);
+ }
+
+ pfree(envlist);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- * Tell the statistics collector to reset cluster-wide shared counters.
+ * Reset cluster-wide shared counters.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -1328,29 +1453,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
- PgStat_MsgResetsharedcounter msg;
-
- if (pgStatSock == PGINVALID_SOCKET)
- return;
-
+ /* Reset the archiver statistics for the cluster. */
  if (strcmp(target, "archiver") == 0)
- msg.m_resettarget = RESET_ARCHIVER;
+ {
+ TimestampTz now = GetCurrentTimestamp();
+
+ LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+ MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+ shared_archiverStats->stat_reset_timestamp = now;
+ LWLockRelease(StatsLock);
+ }
+ /* Reset the bgwriter statistics for the cluster. */
  else if (strcmp(target, "bgwriter") == 0)
- msg.m_resettarget = RESET_BGWRITER;
+ {
+ TimestampTz now = GetCurrentTimestamp();
+
+ LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+ MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+ shared_globalStats->stat_reset_timestamp = now;
+ LWLockRelease(StatsLock);
+ }
  else
  ereport(ERROR,
  (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  errmsg("unrecognized reset target: \"%s\"", target),
  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
- pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- * Tell the statistics collector to reset a single counter.
+ * Reset a single counter.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -1359,17 +1492,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
- PgStat_MsgResetsinglecounter msg;
+ PgStatEnvelope *env;
+ PgStat_StatDBEntry *dbentry;
+ PgStatTypes stattype;
+ TimestampTz ts;
 
- if (pgStatSock == PGINVALID_SOCKET)
- return;
+ env = get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid,
+ false, NULL, NULL);
+ Assert(env);
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
- msg.m_databaseid = MyDatabaseId;
- msg.m_resettype = type;
- msg.m_objectid = objoid;
+ /* Set the reset timestamp for the whole database */
+ dbentry = (PgStat_StatDBEntry *) &env->body;
+ ts = GetCurrentTimestamp();
+ LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+ dbentry->stat_reset_timestamp = ts;
+ LWLockRelease(&env->lock);
 
- pgstat_send(&msg, sizeof(msg));
+ /* Remove object if it exists, ignore if not */
+ switch (type)
+ {
+ case RESET_TABLE:
+ stattype = PGSTAT_TYPE_TABLE;
+ break;
+ case RESET_FUNCTION:
+ stattype = PGSTAT_TYPE_FUNCTION;
+ }
+
+ env = get_stat_entry(stattype, MyDatabaseId, objoid, false, NULL, NULL);
+ LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+ if (env->type == PGSTAT_TYPE_TABLE)
+ init_tabentry(env);
+ else
+ {
+ Assert(env->type == PGSTAT_TYPE_FUNCTION);
+ init_funcentry(env);
+ }
+ LWLockRelease(&env->lock);
 }
 
 /* ----------
@@ -1383,48 +1541,63 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
- PgStat_MsgAutovacStart msg;
+ PgStat_StatDBEntry *dbentry;
+ TimestampTz ts;
 
- if (pgStatSock == PGINVALID_SOCKET)
+ /* return if activity stats is not active */
+ if (!area)
  return;
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
- msg.m_databaseid = dboid;
- msg.m_start_time = GetCurrentTimestamp();
+ ts = GetCurrentTimestamp();
 
- pgstat_send(&msg, sizeof(msg));
+ /*
+ * Store the last autovacuum time in the database's hash table entry.
+ */
+ dbentry = get_local_dbstat_entry(dboid);
+ dbentry->last_autovac_time = ts;
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- * Tell the collector about the table we just vacuumed.
+ * Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
  PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
- PgStat_MsgVacuum msg;
+ PgStat_TableStatus *tabentry;
+ TimestampTz ts;
 
- if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+ /* return if we are not collecting stats */
+ if (!area)
  return;
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
- msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
- msg.m_tableoid = tableoid;
- msg.m_autovacuum = IsAutoVacuumWorkerProcess();
- msg.m_vacuumtime = GetCurrentTimestamp();
- msg.m_live_tuples = livetuples;
- msg.m_dead_tuples = deadtuples;
- pgstat_send(&msg, sizeof(msg));
+ /* Store the data in the table's hash table entry. */
+ ts = GetCurrentTimestamp();
+ tabentry = get_local_tabstat_entry(tableoid, shared);
+
+ tabentry->t_counts.t_delta_live_tuples = livetuples;
+ tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+ if (IsAutoVacuumWorkerProcess())
+ {
+ tabentry->autovac_vacuum_timestamp = ts;
+ tabentry->t_counts.autovac_vacuum_count++;
+ }
+ else
+ {
+ tabentry->vacuum_timestamp = ts;
+ tabentry->t_counts.vacuum_count++;
+ }
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- * Tell the collector about the table we just analyzed.
+ * Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1435,9 +1608,10 @@ pgstat_report_analyze(Relation rel,
   PgStat_Counter livetuples, PgStat_Counter deadtuples,
   bool resetcounter)
 {
- PgStat_MsgAnalyze msg;
+ PgStat_TableStatus *tabentry;
 
- if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+ /* return if we are not collecting stats */
+ if (!area)
  return;
 
  /*
@@ -1445,10 +1619,10 @@ pgstat_report_analyze(Relation rel,
  * already inserted and/or deleted rows in the target table. ANALYZE will
  * have counted such rows as live or dead respectively. Because we will
  * report our counts of such rows at transaction end, we should subtract
- * off these counts from what we send to the collector now, else they'll
- * be double-counted after commit.  (This approach also ensures that the
- * collector ends up with the right numbers if we abort instead of
- * committing.)
+ * off these counts from what is already written to shared stats now, else
+ * they'll be double-counted after commit.  (This approach also ensures
+ * that the shared stats ends up with the right numbers if we abort
+ * instead of committing.)
  */
  if (rel->pgstat_info != NULL)
  {
@@ -1466,158 +1640,172 @@ pgstat_report_analyze(Relation rel,
  deadtuples = Max(deadtuples, 0);
  }
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
- msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
- msg.m_tableoid = RelationGetRelid(rel);
- msg.m_autovacuum = IsAutoVacuumWorkerProcess();
- msg.m_resetcounter = resetcounter;
- msg.m_analyzetime = GetCurrentTimestamp();
- msg.m_live_tuples = livetuples;
- msg.m_dead_tuples = deadtuples;
- pgstat_send(&msg, sizeof(msg));
+ /* Store the data in the table's hash table entry. */
+ tabentry = get_local_tabstat_entry(RelationGetRelid(rel),
+   rel->rd_rel->relisshared);
+
+ tabentry->t_counts.t_delta_live_tuples = livetuples;
+ tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+ /*
+ * If commanded, reset changes_since_analyze to zero.  This forgets any
+ * changes that were committed while the ANALYZE was in progress, but we
+ * have no good way to estimate how many of those there were.
+ */
+ if (resetcounter)
+ tabentry->t_counts.reset_changed_tuples = true;
+
+ if (IsAutoVacuumWorkerProcess())
+ {
+ tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+ tabentry->t_counts.autovac_analyze_count++;
+ }
+ else
+ {
+ tabentry->analyze_timestamp = GetCurrentTimestamp();
+ tabentry->t_counts.analyze_count++;
+ }
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- * Tell the collector about a Hot Standby recovery conflict.
+ * Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
- PgStat_MsgRecoveryConflict msg;
+ PgStat_StatDBEntry *dbent;
 
- if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+ /* return if we are not collecting stats */
+ if (!area)
  return;
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
- msg.m_databaseid = MyDatabaseId;
- msg.m_reason = reason;
- pgstat_send(&msg, sizeof(msg));
+ dbent = get_local_dbstat_entry(MyDatabaseId);
+
+ switch (reason)
+ {
+ case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+ /*
+ * Since we drop the information about the database as soon as it
+ * replicates, there is no point in counting these conflicts.
+ */
+ break;
+ case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+ dbent->counts.n_conflict_tablespace++;
+ break;
+ case PROCSIG_RECOVERY_CONFLICT_LOCK:
+ dbent->counts.n_conflict_lock++;
+ break;
+ case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+ dbent->counts.n_conflict_snapshot++;
+ break;
+ case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+ dbent->counts.n_conflict_bufferpin++;
+ break;
+ case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+ dbent->counts.n_conflict_startup_deadlock++;
+ break;
+ }
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- * Tell the collector about a deadlock detected.
+ * Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
- PgStat_MsgDeadlock msg;
+ PgStat_StatDBEntry *dbent;
 
- if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+ /* return if we are not collecting stats */
+ if (!area)
  return;
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
- msg.m_databaseid = MyDatabaseId;
- pgstat_send(&msg, sizeof(msg));
-}
-
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- * Tell the collector about one or more checksum failures.
- * --------
- */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
-{
- PgStat_MsgChecksumFailure msg;
-
- if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
- return;
-
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
- msg.m_databaseid = dboid;
- msg.m_failurecount = failurecount;
- msg.m_failure_time = GetCurrentTimestamp();
-
- pgstat_send(&msg, sizeof(msg));
+ dbent = get_local_dbstat_entry(MyDatabaseId);
+ dbent->counts.n_deadlocks++;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- * Tell the collector about a checksum failure.
+ * Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
- pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+ PgStat_StatDBEntry *dbent;
+
+ /* return if we are not collecting stats */
+ if (!area)
+ return;
+
+ dbent = get_local_dbstat_entry(MyDatabaseId);
+ dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- * Tell the collector about a temporary file.
+ * Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
- PgStat_MsgTempFile msg;
+ PgStat_StatDBEntry *dbent;
 
- if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+ /* return if we are not collecting stats */
+ if (!area)
  return;
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
- msg.m_databaseid = MyDatabaseId;
- msg.m_filesize = filesize;
- pgstat_send(&msg, sizeof(msg));
+ if (filesize == 0) /* Is there a case where filesize is really 0? */
+ return;
+
+ dbent = get_local_dbstat_entry(MyDatabaseId);
+ dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+ dbent->counts.n_temp_files++;
 }
 
 
-/* ----------
- * pgstat_ping() -
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- * Send some junk data to the collector to increase traffic.
- * ----------
+ * Reports about one or more checksum failures.
+ * --------
  */
 void
-pgstat_ping(void)
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
- PgStat_MsgDummy msg;
+ PgStat_StatDBEntry *dbentry;
 
- if (pgStatSock == PGINVALID_SOCKET)
+ /* return if we are not active */
+ if (!area)
  return;
 
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
- pgstat_send(&msg, sizeof(msg));
+ dbentry = get_local_dbstat_entry(dboid);
+
+ /* add accumulated count to the parameter */
+ dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- * Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
- PgStat_MsgInquiry msg;
-
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
- msg.clock_time = clock_time;
- msg.cutoff_time = cutoff_time;
- msg.databaseid = databaseid;
- pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
    PgStat_FunctionCallUsage *fcu)
 {
+ PgStatEnvelope *env;
  PgStat_BackendFunctionEntry *htabent;
  bool found;
 
@@ -1628,26 +1816,15 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
  return;
  }
 
- if (!pgStatFunctions)
- {
- /* First time through - initialize function stat table */
- HASHCTL hash_ctl;
+ env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+   fcinfo->flinfo->fn_oid, true, &found);
+ htabent = (PgStat_BackendFunctionEntry *) &env->body;
 
- memset(&hash_ctl, 0, sizeof(hash_ctl));
- hash_ctl.keysize = sizeof(Oid);
- hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
- pgStatFunctions = hash_create("Function stat entries",
-  PGSTAT_FUNCTION_HASH_SIZE,
-  &hash_ctl,
-  HASH_ELEM | HASH_BLOBS);
- }
-
- /* Get the stats entry for this function, create if necessary */
- htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-  HASH_ENTER, &found);
  if (!found)
  MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
 
+ htabent->f_id = fcinfo->flinfo->fn_oid;
+
  fcu->fs = &htabent->f_counts;
 
  /* save stats for this function, later used to compensate for recursion */
@@ -1660,31 +1837,38 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
  INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- * for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
- if (pgStatFunctions == NULL)
+ PgStatEnvelope *env;
+
+ env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+   func_id, false, NULL);
+ if (!env)
  return NULL;
 
- return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-   (void *) &func_id,
-   HASH_FIND, NULL);
+ return (PgStat_BackendFunctionEntry *) &env->body;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1725,9 +1909,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  fs->f_numcalls++;
  fs->f_total_time = f_total;
  INSTR_TIME_ADD(fs->f_self_time, f_self);
-
- /* indicate that we have something to send */
- have_function_stats = true;
 }
 
 
@@ -1739,8 +1920,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  * We assume that a relcache entry's pgstat_info field is zeroed by
  * relcache.c when the relcache entry is made; thereafter it is long-lived
- * data.  We can avoid repeated searches of the TabStatus arrays when the
- * same relation is touched repeatedly within a transaction.
+ * data.
  * ----------
  */
 void
@@ -1760,7 +1940,8 @@ pgstat_initstats(Relation rel)
  return;
  }
 
- if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+ /* return if we are not collecting stats */
+ if (!area)
  {
  /* We're not counting at all */
  rel->pgstat_info = NULL;
@@ -1776,116 +1957,157 @@ pgstat_initstats(Relation rel)
  return;
 
  /* Else find or make the PgStat_TableStatus entry, and update link */
- rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+ rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
-/*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize body part of the returned envelope.
+ * ----------
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+static PgStatEnvelope *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+ bool create, bool *found)
 {
- TabStatHashEntry *hash_entry;
- PgStat_TableStatus *entry;
- TabStatusArray *tsa;
- bool found;
+ PgStatHashEntryKey key;
+ PgStatLocalHashEntry *entry;
 
- /*
- * Create hash table if we don't have it already.
- */
- if (pgStatTabHash == NULL)
+ if (pgStatLocalHash == NULL)
  {
  HASHCTL ctl;
 
- memset(&ctl, 0, sizeof(ctl));
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(TabStatHashEntry);
-
- pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
- TABSTAT_QUANTUM,
- &ctl,
- HASH_ELEM | HASH_BLOBS);
+ MemSet(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(PgStatHashEntryKey);
+ ctl.entrysize = sizeof(PgStatLocalHashEntry);
+
+ pgStatLocalHash = hash_create("Local stat entries",
+  PGSTAT_TABLE_HASH_SIZE,
+  &ctl,
+  HASH_ELEM | HASH_BLOBS);
+ }
+
+ /* Find an entry or create a new one. */
+ key.type = type;
+ key.databaseid = dbid;
+ key.objectid = objid;
+ entry = hash_search(pgStatLocalHash, &key,
+ create ? HASH_ENTER : HASH_FIND, found);
+
+ if (!create && !entry)
+ return NULL;
+
+ if (create && !*found)
+ {
+ int len = pgstat_localentsize[type];
+
+ entry->env = MemoryContextAlloc(CacheMemoryContext,
+ PgStatEnvelopeSize(len));
+ entry->env->type = type;
+ entry->env->len = len;
  }
 
+ return entry->env;
+}
+
+/* ----------
+ * get_local_dbstat_entry() -
+ *
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
+ */
+static PgStat_StatDBEntry *
+get_local_dbstat_entry(Oid dbid)
+{
+ PgStatEnvelope *env;
+ PgStat_StatDBEntry *dbentry;
+ bool found;
+
  /*
  * Find an entry or create a new one.
  */
- hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+ env = get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+   true, &found);
+ dbentry = (PgStat_StatDBEntry *) &env->body;
+
  if (!found)
  {
- /* initialize new entry with null pointer */
- hash_entry->tsa_entry = NULL;
+ dbentry->databaseid = dbid;
+ dbentry->last_autovac_time = 0;
+ dbentry->last_checksum_failure = 0;
+ dbentry->stat_reset_timestamp = 0;
+ dbentry->stats_timestamp = 0;
+ MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
  }
 
- /*
- * If entry is already valid, we're done.
- */
- if (hash_entry->tsa_entry)
- return hash_entry->tsa_entry;
-
- /*
- * Locate the first pgStatTabList entry with free space, making a new list
- * entry if needed.  Note that we could get an OOM failure here, but if so
- * we have left the hashtable and the list in a consistent state.
- */
- if (pgStatTabList == NULL)
- {
- /* Set up first pgStatTabList entry */
- pgStatTabList = (TabStatusArray *)
- MemoryContextAllocZero(TopMemoryContext,
-   sizeof(TabStatusArray));
- }
+ return dbentry;
+}
 
- tsa = pgStatTabList;
- while (tsa->tsa_used >= TABSTAT_QUANTUM)
- {
- if (tsa->tsa_next == NULL)
- tsa->tsa_next = (TabStatusArray *)
- MemoryContextAllocZero(TopMemoryContext,
-   sizeof(TabStatusArray));
- tsa = tsa->tsa_next;
- }
 
- /*
- * Allocate a PgStat_TableStatus entry within this list entry.  We assume
- * the entry was already zeroed, either at creation or after last use.
- */
- entry = &tsa->tsa_entries[tsa->tsa_used++];
- entry->t_id = rel_id;
- entry->t_shared = isshared;
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
+ */
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
+{
+ PgStatEnvelope *env;
+ PgStat_TableStatus *tabentry;
+ bool found;
 
- /*
- * Now we can fill the entry in pgStatTabHash.
- */
- hash_entry->tsa_entry = entry;
+ env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
+   isshared ? InvalidOid : MyDatabaseId,
+   rel_id, true, &found);
 
- return entry;
+ tabentry = (PgStat_TableStatus *) &env->body;
+
+ if (!found)
+ {
+ tabentry->t_id = rel_id;
+ tabentry->t_shared = isshared;
+ tabentry->trans = NULL;
+ MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
+ tabentry->vacuum_timestamp = 0;
+ tabentry->autovac_vacuum_timestamp = 0;
+ tabentry->analyze_timestamp = 0;
+ tabentry->autovac_analyze_timestamp = 0;
+ }
+
+ return tabentry;
 }
 
-/*
- * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
+
+/* ----------
+ * find_tabstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel from the current
+ *  database then from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry, return NULL, don't create a new one.
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
- TabStatHashEntry *hash_entry;
+ PgStatEnvelope *env;
 
- /* If hashtable doesn't exist, there are no entries at all */
- if (!pgStatTabHash)
- return NULL;
+ env = get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+   false, NULL);
+ if (!env)
+ env = get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+   false, NULL);
+ if (env)
+ return (PgStat_TableStatus *) &env->body;
 
- hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
- if (!hash_entry)
- return NULL;
-
- /* Note that this step could also return NULL, but that's correct */
- return hash_entry->tsa_entry;
+ return NULL;
 }
 
 /*
@@ -2362,7 +2584,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
  PgStat_TableStatus *pgstat_info;
 
  /* Find or create a tabstat entry for the rel */
- pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+ pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
  /* Same math as in AtEOXact_PgStat, commit case */
  pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2398,7 +2620,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
  PgStat_TableStatus *pgstat_info;
 
  /* Find or create a tabstat entry for the rel */
- pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+ pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
  /* Same math as in AtEOXact_PgStat, abort case */
  if (rec->t_truncated)
@@ -2415,88 +2637,176 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 }
 
 
+/* ----------
+ * snapshot_statentry() -
+ *
+ *  Common routine for functions pgstat_fetch_stat_*entry()
+ *
+ *  Returns the pointer to the snapshot of the shared entry for the key or NULL
+ *  if not found. Returned snapshots are stable during the current transaction
+ *  or until pgstat_clear_snapshot() is called.
+ *
+ *  Created snapshots are stored in pgStatSnapshotHash.
+ */
+static void *
+snapshot_statentry(const PgStatTypes type, const Oid dbid, const Oid objid)
+{
+ PgStatSnapshot *snap = NULL;
+ bool found;
+ PgStatHashEntryKey key;
+ size_t statentsize = pgstat_entsize[type];
+
+ Assert(type != PGSTAT_TYPE_ALL);
+
+ /*
+ * Create new hash, with rather arbitrary initial number of entries since
+ * we don't know how this hash will grow.
+ */
+ if (!pgStatSnapshotHash)
+ {
+ HASHCTL ctl;
+
+ /*
+ * Create the hash in the stats context
+ *
+ * The entry is prepended by common header part represented by
+ * PgStatSnapshot.
+ */
+
+ ctl.keysize = sizeof(PgStatHashEntryKey);
+ ctl.entrysize = PgStatSnapshotSize(statentsize);
+ ctl.hcxt = pgStatSnapshotContext;
+ pgStatSnapshotHash = hash_create("pgstat snapshot hash", 32, &ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ }
+
+ /* Find a snapshot  */
+ key.type = type;
+ key.databaseid = dbid;
+ key.objectid = objid;
+
+ snap = hash_search(pgStatSnapshotHash, &key, HASH_ENTER, &found);
+
+ /*
+ * Refer shared hash if not found in the snapshot hash.
+ *
+ * In transaction state, it is obvious that we should create a snapshot
+ * entriy for consistency. If we are not, we return an up-to-date entry.
+ * Having said that, we need a snapshot since shared stats entry can be
+ * modified anytime. We share the same snapshot entry for the purpose.
+ */
+ if (!found || !IsTransactionState())
+ {
+ PgStatEnvelope *shenv;
+
+ shenv = get_stat_entry(type, dbid, objid, true, NULL, NULL);
+
+ if (shenv)
+ memcpy(&snap->body, &shenv->body, statentsize);
+
+ snap->negative = !shenv;
+ }
+
+ if (snap->negative)
+ return NULL;
+
+ return &snap->body;
+}
+
+
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- * Support function for the SQL-callable pgstat* functions. Returns
- * the collected statistics for one database or NULL. NULL doesn't mean
- * that the database doesn't exist, it is just not yet known by the
- * collector, so the caller is better off to report ZERO instead.
+ * Find database stats entry on backends. The returned entries are cached
+ * until transaction end or pgstat_clear_snapshot() is called.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
- /*
- * If not done for this transaction, read the statistics collector stats
- * file into some hash tables.
- */
- backend_read_statsfile();
+ /* should be called from backends */
+ Assert(IsUnderPostmaster);
 
- /*
- * Lookup the requested database; return NULL if not found
- */
- return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-  (void *) &dbid,
-  HASH_FIND, NULL);
+ /* If not done for this transaction, take a snapshot of global stats */
+ pgstat_snapshot_global_stats();
+
+ /* caller doesn't have a business with snapshot-local members */
+ return (PgStat_StatDBEntry *)
+ snapshot_statentry(PGSTAT_TYPE_DB, dbid, InvalidOid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  * Support function for the SQL-callable pgstat* functions. Returns
- * the collected statistics for one table or NULL. NULL doesn't mean
+ * the activity statistics for one table or NULL. NULL doesn't mean
  * that the table doesn't exist, it is just not yet known by the
- * collector, so the caller is better off to report ZERO instead.
+ * activity statistics facilities, so the caller is better off to
+ * report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
- Oid dbid;
- PgStat_StatDBEntry *dbentry;
  PgStat_StatTabEntry *tabentry;
 
- /*
- * If not done for this transaction, read the statistics collector stats
- * file into some hash tables.
- */
- backend_read_statsfile();
-
- /*
- * Lookup our database, then look in its table hash table.
- */
- dbid = MyDatabaseId;
- dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
- (void *) &dbid,
- HASH_FIND, NULL);
- if (dbentry != NULL && dbentry->tables != NULL)
- {
- tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-   (void *) &relid,
-   HASH_FIND, NULL);
- if (tabentry)
- return tabentry;
- }
+ tabentry = pgstat_fetch_stat_tabentry_snapshot(false, relid);
+ if (tabentry != NULL)
+ return tabentry;
 
  /*
  * If we didn't find it, maybe it's a shared table.
  */
- dbid = InvalidOid;
- dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
- (void *) &dbid,
- HASH_FIND, NULL);
- if (dbentry != NULL && dbentry->tables != NULL)
- {
- tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-   (void *) &relid,
-   HASH_FIND, NULL);
- if (tabentry)
- return tabentry;
- }
-
- return NULL;
+ tabentry = pgstat_fetch_stat_tabentry_snapshot(true, relid);
+ return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_snapshot() -
+ *
+ * Find table stats entry on backends in dbent. The returned entry is cached
+ * until transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_snapshot(bool shared, Oid reloid)
+{
+ Oid dboid = (shared ? InvalidOid : MyDatabaseId);
+
+ /* should be called from backends */
+ Assert(IsUnderPostmaster);
+
+ return (PgStat_StatTabEntry *)
+ snapshot_statentry(PGSTAT_TYPE_TABLE, dboid, reloid);
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ * Support function for index swapping. Copy a portion of the counters of the
+ * relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+ PgStat_StatTabEntry *tabentry;
+
+ /* No point fetching tabentry when dst is NULL */
+ if (!dst)
+ return;
+
+ tabentry = pgstat_fetch_stat_tabentry(relid);
+
+ if (!tabentry)
+ return;
+
+ dst->t_counts.t_numscans = tabentry->numscans;
+ dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+ dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+ dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+ dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2510,24 +2820,48 @@ pgstat_fetch_stat_tabentry(Oid relid)
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
- PgStat_StatDBEntry *dbentry;
- PgStat_StatFuncEntry *funcentry = NULL;
-
- /* load the stats file if needed */
- backend_read_statsfile();
-
- /* Lookup our database, then find the requested function.  */
- dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
- if (dbentry != NULL && dbentry->functions != NULL)
- {
- funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
- (void *) &func_id,
- HASH_FIND, NULL);
- }
-
- return funcentry;
+ /* should be called from backends */
+ Assert(IsUnderPostmaster);
+
+ return (PgStat_StatFuncEntry *)
+ snapshot_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, func_id);
 }
 
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ * ----------
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+ MemoryContext oldcontext;
+
+ attach_shared_stats();
+
+ /* Nothing to do if already done */
+ if (global_snapshot_is_valid)
+ return;
+
+ oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+ LWLockAcquire(StatsLock, LW_SHARED);
+ memcpy(&snapshot_globalStats, shared_globalStats,
+   sizeof(PgStat_GlobalStats));
+
+ memcpy(&snapshot_archiverStats, shared_archiverStats,
+   sizeof(PgStat_ArchiverStats));
+ LWLockRelease(StatsLock);
+
+ global_snapshot_is_valid = true;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ return;
+}
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
@@ -2599,9 +2933,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
- backend_read_statsfile();
+ /* If not done for this transaction, take a stats snapshot */
+ pgstat_snapshot_global_stats();
 
- return &archiverStats;
+ return &snapshot_archiverStats;
 }
 
 
@@ -2616,9 +2951,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
- backend_read_statsfile();
+ /* If not done for this transaction, take a stats snapshot */
+ pgstat_snapshot_global_stats();
 
- return &globalStats;
+ return &snapshot_globalStats;
 }
 
 
@@ -2832,8 +3168,8 @@ pgstat_initialize(void)
  MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
  }
 
- /* Set up a process-exit hook to clean up */
- on_shmem_exit(pgstat_beshutdown_hook, 0);
+ /* need to be called before dsm shutdown */
+ before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3009,12 +3345,15 @@ pgstat_bestart(void)
  /* Update app name to current GUC setting */
  if (application_name)
  pgstat_report_appname(application_name);
+
+ /* attach shared database stats area */
+ attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3027,7 +3366,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
  /*
  * If we got as far as discovering our own database ID, we can report what
- * we did to the collector.  Otherwise, we'd be sending an invalid
+ * we did to the shares stats.  Otherwise, we'd be sending an invalid
  * database ID, so forget it.  (This means that accesses to pg_database
  * during failed backend starts might never get counted.)
  */
@@ -3044,6 +3383,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
  beentry->st_procpid = 0; /* mark invalid */
 
  PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+ detach_shared_stats(true);
 }
 
 
@@ -3304,7 +3645,8 @@ pgstat_read_current_status(void)
 #endif
  int i;
 
- Assert(!pgStatRunningInCollector);
+ Assert(IsUnderPostmaster);
+
  if (localBackendStatusTable)
  return; /* already done */
 
@@ -3599,9 +3941,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
  case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
  event_name = "LogicalLauncherMain";
  break;
- case WAIT_EVENT_PGSTAT_MAIN:
- event_name = "PgStatMain";
- break;
  case WAIT_EVENT_RECOVERY_WAL_STREAM:
  event_name = "RecoveryWalStream";
  break;
@@ -4230,94 +4569,71 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- * Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
- hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- * Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
- int rc;
-
- if (pgStatSock == PGINVALID_SOCKET)
- return;
-
- ((PgStat_MsgHdr *) msg)->m_size = len;
-
- /* We'll retry after EINTR, but ignore all other failures */
- do
- {
- rc = send(pgStatSock, msg, len, 0);
- } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
- /* In debug builds, log send failures ... */
- if (rc < 0)
- elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- * Tell the collector about the WAL file that we successfully
- * archived or failed to archive.
+ * Report archiver statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
+pgstat_report_archiver(const char *xlog, bool failed)
 {
- PgStat_MsgArchiver msg;
+ TimestampTz now = GetCurrentTimestamp();
 
- /*
- * Prepare and send the message
- */
- pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
- msg.m_failed = failed;
- StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
- msg.m_timestamp = GetCurrentTimestamp();
- pgstat_send(&msg, sizeof(msg));
+ if (failed)
+ {
+ /* Failed archival attempt */
+ LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+ ++shared_archiverStats->failed_count;
+ memcpy(shared_archiverStats->last_failed_wal, xlog,
+   sizeof(shared_archiverStats->last_failed_wal));
+ shared_archiverStats->last_failed_timestamp = now;
+ LWLockRelease(StatsLock);
+ }
+ else
+ {
+ /* Successful archival operation */
+ LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+ ++shared_archiverStats->archived_count;
+ memcpy(shared_archiverStats->last_archived_wal, xlog,
+   sizeof(shared_archiverStats->last_archived_wal));
+ shared_archiverStats->last_archived_timestamp = now;
+ LWLockRelease(StatsLock);
+ }
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- * Send bgwriter statistics to the collector
+ * Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
  /* We assume this initializes to zeroes */
- static const PgStat_MsgBgWriter all_zeroes;
+ static const PgStat_BgWriter all_zeroes;
+
+ PgStat_BgWriter *l = &BgWriterStats;
 
  /*
  * This function can be called even if nothing at all has happened. In
- * this case, avoid sending a completely empty message to the stats
- * collector.
+ * this case, avoid taking lock for a completely empty stats.
  */
- if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+ if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
  return;
 
- /*
- * Prepare and send the message
- */
- pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
- pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+ LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+ shared_globalStats->timed_checkpoints += l->timed_checkpoints;
+ shared_globalStats->requested_checkpoints += l->requested_checkpoints;
+ shared_globalStats->checkpoint_write_time += l->checkpoint_write_time;
+ shared_globalStats->checkpoint_sync_time += l->checkpoint_sync_time;
+ shared_globalStats->buf_written_checkpoints += l->buf_written_checkpoints;
+ shared_globalStats->buf_written_clean += l->buf_written_clean;
+ shared_globalStats->maxwritten_clean += l->maxwritten_clean;
+ shared_globalStats->buf_written_backend += l->buf_written_backend;
+ shared_globalStats->buf_fsync_backend += l->buf_fsync_backend;
+ shared_globalStats->buf_alloc += l->buf_alloc;
+ LWLockRelease(StatsLock);
 
  /*
  * Clear out the statistics buffer, so it can be re-used.
@@ -4326,424 +4642,30 @@ pgstat_send_bgwriter(void)
 }
 
 
-/* ----------
- * PgstatCollectorMain() -
- *
- * Start up the statistics collector process.  This is the body of the
- * postmaster child process.
- *
- * The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
- int len;
- PgStat_Msg msg;
- int wr;
-
- /*
- * Ignore all signals usually bound to some action in the postmaster,
- * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
- * support latch operations, because we only use a local latch.
- */
- pqsignal(SIGHUP, SignalHandlerForConfigReload);
- pqsignal(SIGINT, SIG_IGN);
- pqsignal(SIGTERM, SIG_IGN);
- pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
- pqsignal(SIGALRM, SIG_IGN);
- pqsignal(SIGPIPE, SIG_IGN);
- pqsignal(SIGUSR1, SIG_IGN);
- pqsignal(SIGUSR2, SIG_IGN);
- /* Reset some signals that are accepted by postmaster but not here */
- pqsignal(SIGCHLD, SIG_DFL);
- PG_SETMASK(&UnBlockSig);
-
- MyBackendType = B_STATS_COLLECTOR;
- init_ps_display(NULL);
-
- /*
- * Read in existing stats files or initialize the stats to zero.
- */
- pgStatRunningInCollector = true;
- pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
- /*
- * Loop to process messages until we get SIGQUIT or detect ungraceful
- * death of our parent postmaster.
- *
- * For performance reasons, we don't want to do ResetLatch/WaitLatch after
- * every message; instead, do that only after a recv() fails to obtain a
- * message.  (This effectively means that if backends are sending us stuff
- * like mad, we won't notice postmaster death until things slack off a
- * bit; which seems fine.) To do that, we have an inner loop that
- * iterates as long as recv() succeeds.  We do check ConfigReloadPending
- * inside the inner loop, which means that such interrupts will get
- * serviced but the latch won't get cleared until next time there is a
- * break in the action.
- */
- for (;;)
- {
- /* Clear any already-pending wakeups */
- ResetLatch(MyLatch);
-
- /*
- * Quit if we get SIGQUIT from the postmaster.
- */
- if (ShutdownRequestPending)
- break;
-
- /*
- * Inner loop iterates as long as we keep getting messages, or until
- * ShutdownRequestPending becomes set.
- */
- while (!ShutdownRequestPending)
- {
- /*
- * Reload configuration if we got SIGHUP from the postmaster.
- */
- if (ConfigReloadPending)
- {
- ConfigReloadPending = false;
- ProcessConfigFile(PGC_SIGHUP);
- }
-
- /*
- * Write the stats file(s) if a new request has arrived that is
- * not satisfied by existing file(s).
- */
- if (pgstat_write_statsfile_needed())
- pgstat_write_statsfiles(false, false);
-
- /*
- * Try to receive and process a message.  This will not block,
- * since the socket is set to non-blocking mode.
- *
- * XXX On Windows, we have to force pgwin32_recv to cooperate,
- * despite the previous use of pg_set_noblock() on the socket.
- * This is extremely broken and should be fixed someday.
- */
-#ifdef WIN32
- pgwin32_noblock = 1;
-#endif
-
- len = recv(pgStatSock, (char *) &msg,
-   sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
- pgwin32_noblock = 0;
-#endif
-
- if (len < 0)
- {
- if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
- break; /* out of inner loop */
- ereport(ERROR,
- (errcode_for_socket_access(),
- errmsg("could not read statistics message: %m")));
- }
-
- /*
- * We ignore messages that are smaller than our common header
- */
- if (len < sizeof(PgStat_MsgHdr))
- continue;
-
- /*
- * The received length must match the length in the header
- */
- if (msg.msg_hdr.m_size != len)
- continue;
-
- /*
- * O.K. - we accept this message.  Process it.
- */
- switch (msg.msg_hdr.m_type)
- {
- case PGSTAT_MTYPE_DUMMY:
- break;
-
- case PGSTAT_MTYPE_INQUIRY:
- pgstat_recv_inquiry(&msg.msg_inquiry, len);
- break;
-
- case PGSTAT_MTYPE_TABSTAT:
- pgstat_recv_tabstat(&msg.msg_tabstat, len);
- break;
-
- case PGSTAT_MTYPE_TABPURGE:
- pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
- break;
-
- case PGSTAT_MTYPE_DROPDB:
- pgstat_recv_dropdb(&msg.msg_dropdb, len);
- break;
-
- case PGSTAT_MTYPE_RESETCOUNTER:
- pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
- break;
-
- case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
- pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-   len);
- break;
-
- case PGSTAT_MTYPE_RESETSINGLECOUNTER:
- pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-   len);
- break;
-
- case PGSTAT_MTYPE_AUTOVAC_START:
- pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
- break;
-
- case PGSTAT_MTYPE_VACUUM:
- pgstat_recv_vacuum(&msg.msg_vacuum, len);
- break;
-
- case PGSTAT_MTYPE_ANALYZE:
- pgstat_recv_analyze(&msg.msg_analyze, len);
- break;
-
- case PGSTAT_MTYPE_ARCHIVER:
- pgstat_recv_archiver(&msg.msg_archiver, len);
- break;
-
- case PGSTAT_MTYPE_BGWRITER:
- pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
- break;
-
- case PGSTAT_MTYPE_FUNCSTAT:
- pgstat_recv_funcstat(&msg.msg_funcstat, len);
- break;
-
- case PGSTAT_MTYPE_FUNCPURGE:
- pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
- break;
-
- case PGSTAT_MTYPE_RECOVERYCONFLICT:
- pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
- len);
- break;
-
- case PGSTAT_MTYPE_DEADLOCK:
- pgstat_recv_deadlock(&msg.msg_deadlock, len);
- break;
-
- case PGSTAT_MTYPE_TEMPFILE:
- pgstat_recv_tempfile(&msg.msg_tempfile, len);
- break;
-
- case PGSTAT_MTYPE_CHECKSUMFAILURE:
- pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
- len);
- break;
-
- default:
- break;
- }
- } /* end of inner message-processing loop */
-
- /* Sleep until there's something to do */
-#ifndef WIN32
- wr = WaitLatchOrSocket(MyLatch,
-   WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-   pgStatSock, -1L,
-   WAIT_EVENT_PGSTAT_MAIN);
-#else
-
- /*
- * Windows, at least in its Windows Server 2003 R2 incarnation,
- * sometimes loses FD_READ events.  Waking up and retrying the recv()
- * fixes that, so don't sleep indefinitely.  This is a crock of the
- * first water, but until somebody wants to debug exactly what's
- * happening there, this is the best we can do.  The two-second
- * timeout matches our pre-9.2 behavior, and needs to be short enough
- * to not provoke "using stale statistics" complaints from
- * backend_read_statsfile.
- */
- wr = WaitLatchOrSocket(MyLatch,
-   WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-   pgStatSock,
-   2 * 1000L /* msec */ ,
-   WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
- /*
- * Emergency bailout if postmaster has died.  This is to avoid the
- * necessity for manual cleanup of all postmaster children.
- */
- if (wr & WL_POSTMASTER_DEATH)
- break;
- } /* end of outer loop */
-
- /*
- * Save the final stats to reuse at next startup.
- */
- pgstat_write_statsfiles(true, true);
-
- exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
- HASHCTL hash_ctl;
-
- dbentry->n_xact_commit = 0;
- dbentry->n_xact_rollback = 0;
- dbentry->n_blocks_fetched = 0;
- dbentry->n_blocks_hit = 0;
- dbentry->n_tuples_returned = 0;
- dbentry->n_tuples_fetched = 0;
- dbentry->n_tuples_inserted = 0;
- dbentry->n_tuples_updated = 0;
- dbentry->n_tuples_deleted = 0;
- dbentry->last_autovac_time = 0;
- dbentry->n_conflict_tablespace = 0;
- dbentry->n_conflict_lock = 0;
- dbentry->n_conflict_snapshot = 0;
- dbentry->n_conflict_bufferpin = 0;
- dbentry->n_conflict_startup_deadlock = 0;
- dbentry->n_temp_files = 0;
- dbentry->n_temp_bytes = 0;
- dbentry->n_deadlocks = 0;
- dbentry->n_checksum_failures = 0;
- dbentry->last_checksum_failure = 0;
- dbentry->n_block_read_time = 0;
- dbentry->n_block_write_time = 0;
-
- dbentry->stat_reset_timestamp = GetCurrentTimestamp();
- dbentry->stats_timestamp = 0;
-
- memset(&hash_ctl, 0, sizeof(hash_ctl));
- hash_ctl.keysize = sizeof(Oid);
- hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
- dbentry->tables = hash_create("Per-database table",
-  PGSTAT_TAB_HASH_SIZE,
-  &hash_ctl,
-  HASH_ELEM | HASH_BLOBS);
-
- hash_ctl.keysize = sizeof(Oid);
- hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
- dbentry->functions = hash_create("Per-database function",
- PGSTAT_FUNCTION_HASH_SIZE,
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
- PgStat_StatDBEntry *result;
- bool found;
- HASHACTION action = (create ? HASH_ENTER : HASH_FIND);
-
- /* Lookup or create the hash table entry for this database */
- result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
- &databaseid,
- action, &found);
-
- if (!create && !found)
- return NULL;
-
- /*
- * If not found, initialize the new one.  This creates empty hash tables
- * for tables and functions, too.
- */
- if (!found)
- reset_dbentry_counters(result);
-
- return result;
-}
-
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
-{
- PgStat_StatTabEntry *result;
- bool found;
- HASHACTION action = (create ? HASH_ENTER : HASH_FIND);
-
- /* Lookup or create the hash table entry for this table */
- result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
- &tableoid,
- action, &found);
-
- if (!create && !found)
- return NULL;
-
- /* If not found, initialize the new one. */
- if (!found)
- {
- result->numscans = 0;
- result->tuples_returned = 0;
- result->tuples_fetched = 0;
- result->tuples_inserted = 0;
- result->tuples_updated = 0;
- result->tuples_deleted = 0;
- result->tuples_hot_updated = 0;
- result->n_live_tuples = 0;
- result->n_dead_tuples = 0;
- result->changes_since_analyze = 0;
- result->blocks_fetched = 0;
- result->blocks_hit = 0;
- result->vacuum_timestamp = 0;
- result->vacuum_count = 0;
- result->autovac_vacuum_timestamp = 0;
- result->autovac_vacuum_count = 0;
- result->analyze_timestamp = 0;
- result->analyze_count = 0;
- result->autovac_analyze_timestamp = 0;
- result->autovac_analyze_count = 0;
- }
-
- return result;
-}
-
-
 /* ----------
  * pgstat_write_statsfiles() -
- * Write the global statistics file, as well as requested DB files.
- *
- * 'permanent' specifies writing to the permanent files not temporary ones.
- * When true (happens only when the collector is shutting down), also remove
- * the temporary files so that backends starting up under a new postmaster
- * can't read old data before the new collector is ready.
- *
- * When 'allDbs' is false, only the requested databases (listed in
- * pending_write_requests) will be written; otherwise, all databases
- * will be written.
+ * Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
- HASH_SEQ_STATUS hstat;
- PgStat_StatDBEntry *dbentry;
  FILE   *fpout;
  int32 format_id;
- const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
- const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+ const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+ const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
  int rc;
+ PgStatEnvelope **envlist;
+ PgStatEnvelope **penv;
+
+ /* stats is not initialized yet. just return. */
+ if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+ return;
 
  elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
+ create_missing_dbentries();
+
  /*
  * Open the statistics temp file to write out the current values.
  */
@@ -4760,7 +4682,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  /*
  * Set the timestamp of the stats file.
  */
- globalStats.stats_timestamp = GetCurrentTimestamp();
+ shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
  /*
  * Write the file header --- currently just a format ID.
@@ -4772,32 +4694,31 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  /*
  * Write global stats struct
  */
- rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+ rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
  (void) rc; /* we'll check for error with ferror */
 
  /*
  * Write archiver stats struct
  */
- rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+ rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
  (void) rc; /* we'll check for error with ferror */
 
  /*
  * Walk through the database table.
  */
- hash_seq_init(&hstat, pgStatDBHash);
- while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+ envlist = collect_stat_entries(PGSTAT_TYPE_DB, InvalidOid);
+ for (penv = envlist; *penv != NULL; penv++)
  {
+ PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &(*penv)->body;
+
  /*
  * Write out the table and function stats for this DB into the
  * appropriate per-DB stat file, if required.
  */
- if (allDbs || pgstat_db_requested(dbentry->databaseid))
- {
- /* Make DB's timestamp consistent with the global stats */
- dbentry->stats_timestamp = globalStats.stats_timestamp;
+ /* Make DB's timestamp consistent with the global stats */
+ dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
- pgstat_write_db_statsfile(dbentry, permanent);
- }
+ pgstat_write_database_stats(dbentry);
 
  /*
  * Write out the DB entry. We don't write the tables or functions
@@ -4808,6 +4729,8 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  (void) rc; /* we'll check for error with ferror */
  }
 
+ pfree(envlist);
+
  /*
  * No more output to be done. Close the temp file and replace the old
  * pgstat.stat with it.  The ferror() check replaces testing for error
@@ -4840,55 +4763,19 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  tmpfile, statfile)));
  unlink(tmpfile);
  }
-
- if (permanent)
- unlink(pgstat_stat_filename);
-
- /*
- * Now throw away the list of requests.  Note that requests sent after we
- * started the write are still waiting on the network socket.
- */
- list_free(pending_write_requests);
- pending_write_requests = NIL;
 }
 
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
- char *filename, int len)
-{
- int printed;
-
- /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
- printed = snprintf(filename, len, "%s/db_%u.%s",
-   permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-   pgstat_stat_directory,
-   databaseid,
-   tempname ? "tmp" : "stat");
- if (printed >= len)
- elog(ERROR, "overlength pgstat path");
-}
 
 /* ----------
- * pgstat_write_db_statsfile() -
- * Write the stat file for a single database.
- *
- * If writing to the permanent file (happens when the collector is
- * shutting down only), remove the temporary file so that backends
- * starting up under a new postmaster can't read the old data before
- * the new collector is ready.
+ * pgstat_write_database_stats() -
+ *  Write the stat file for a single database.
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
 {
- HASH_SEQ_STATUS tstat;
- HASH_SEQ_STATUS fstat;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatFuncEntry *funcentry;
+ PgStatEnvelope **envlist;
+ PgStatEnvelope **penv;
  FILE   *fpout;
  int32 format_id;
  Oid dbid = dbentry->databaseid;
@@ -4896,8 +4783,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
  char tmpfile[MAXPGPATH];
  char statfile[MAXPGPATH];
 
- get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
- get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+ get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+ get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
  elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4924,24 +4811,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
  /*
  * Walk through the database's access stats per table.
  */
- hash_seq_init(&tstat, dbentry->tables);
- while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+ envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
+ for (penv = envlist; *penv != NULL; penv++)
  {
+ PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
+
  fputc('T', fpout);
  rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
  (void) rc; /* we'll check for error with ferror */
  }
+ pfree(envlist);
 
  /*
  * Walk through the database's function stats table.
  */
- hash_seq_init(&fstat, dbentry->functions);
- while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+ envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
+ for (penv = envlist; *penv != NULL; penv++)
  {
+ PgStat_StatFuncEntry *funcentry =
+ (PgStat_StatFuncEntry *) &(*penv)->body;
+
  fputc('F', fpout);
  rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
  (void) rc; /* we'll check for error with ferror */
  }
+ pfree(envlist);
 
  /*
  * No more output to be done. Close the temp file and replace the old
@@ -4975,94 +4869,165 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
  tmpfile, statfile)));
  unlink(tmpfile);
  }
+}
 
- if (permanent)
+/* ----------
+ * create_missing_dbentries() -
+ *
+ *  There may be the case where database entry is missing for the database
+ *  where object stats are recorded. This function creates such missing
+ *  dbentries so that so that all stats entries can be written out to files.
+ * ----------
+ */
+static void
+create_missing_dbentries(void)
+{
+ dshash_seq_status hstat;
+ PgStatHashEntry *p;
+ HTAB   *oidhash;
+ HASHCTL ctl;
+ HASH_SEQ_STATUS scan;
+ Oid   *poid;
+
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(Oid);
+ ctl.hcxt = CurrentMemoryContext;
+ oidhash = hash_create("Temporary table of OIDs",
+  PGSTAT_TABLE_HASH_SIZE,
+  &ctl,
+  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+ /* Collect OID from the shared stats hash */
+ dshash_seq_init(&hstat, pgStatSharedHash, false);
+ while ((p = dshash_seq_next(&hstat)) != NULL)
+ hash_search(oidhash, &p->key.databaseid, HASH_ENTER, NULL);
+ dshash_seq_term(&hstat);
+
+ /* Create missing database entries if not exists. */
+ hash_seq_init(&scan, oidhash);
+ while ((poid = (Oid *) hash_seq_search(&scan)) != NULL)
+ (void) get_stat_entry(PGSTAT_TYPE_DB, *poid, InvalidOid,
+  false, init_dbentry, NULL);
+
+ hash_destroy(oidhash);
+}
+
+
+/* ----------
+ * get_stat_entry() -
+ *
+ * get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new envelope. If found is not NULL, it is set to true if
+ *  existing entry is found or false if not.
+ * ----------
+ */
+static PgStatEnvelope *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+   bool nowait, entry_initializer initfunc, bool *found)
+{
+ bool create = (initfunc != NULL);
+ PgStatHashEntry *shent;
+ PgStatEnvelope *shenv = NULL;
+ PgStatHashEntryKey key;
+ bool myfound;
+
+ Assert(type != PGSTAT_TYPE_ALL);
+
+ key.type = type;
+ key.databaseid = dbid;
+ key.objectid = objid;
+ shent = dshash_find_extended(pgStatSharedHash, &key,
+ create, nowait, create, &myfound);
+ if (shent)
  {
- get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
+ if (create && !myfound)
+ {
+ /* Create new stats envelope. */
+ size_t envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
+ dsa_pointer chunk = dsa_allocate0(area, envsize);
 
- elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
- unlink(statfile);
+ shenv = dsa_get_address(area, chunk);
+ shenv->type = type;
+ shenv->databaseid = dbid;
+ shenv->objectid = objid;
+ shenv->len = pgstat_entsize[type];
+ LWLockInitialize(&shenv->lock, LWTRANCHE_STATS);
+
+ /*
+ * The lock on dshsh is released just after. Call initializer
+ * callback before it is exposed to other process.
+ */
+ if (initfunc)
+ initfunc(shenv);
+
+ /* Link the new entry from the hash entry. */
+ shent->env = chunk;
+ }
+ else
+ shenv = dsa_get_address(area, shent->env);
+
+ dshash_release_lock(pgStatSharedHash, shent);
  }
+
+ if (found)
+ *found = myfound;
+
+ return shenv;
 }
 
+
 /* ----------
  * pgstat_read_statsfiles() -
  *
- * Reads in some existing statistics collector files and returns the
- * databases hash table that is the top level of the data.
+ * Reads in existing activity statistics files into the shared stats hash.
  *
- * If 'onlydb' is not InvalidOid, it means we only want data for that DB
- * plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- * table for all databases, but we don't bother even creating table/function
- * hash tables for other databases.
- *
- * 'permanent' specifies reading from the permanent files not temporary ones.
- * When true (happens only when the collector is starting up), remove the
- * files after reading; the in-memory status is now authoritative, and the
- * files would be out of date in case somebody else reads them.
- *
- * If a 'deep' read is requested, table/function stats are read, otherwise
- * the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
+ PgStatEnvelope *env;
  PgStat_StatDBEntry *dbentry;
  PgStat_StatDBEntry dbbuf;
- HASHCTL hash_ctl;
- HTAB   *dbhash;
  FILE   *fpin;
  int32 format_id;
  bool found;
- const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+ const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 
- /*
- * The tables will live in pgStatLocalContext.
- */
- pgstat_setup_memcxt();
+ /* shouldn't be called from postmaster */
+ Assert(IsUnderPostmaster);
 
- /*
- * Create the DB hashtable
- */
- memset(&hash_ctl, 0, sizeof(hash_ctl));
- hash_ctl.keysize = sizeof(Oid);
- hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
- hash_ctl.hcxt = pgStatLocalContext;
- dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /*
- * Clear out global and archiver statistics so they start from zero in
- * case we can't load an existing statsfile.
- */
- memset(&globalStats, 0, sizeof(globalStats));
- memset(&archiverStats, 0, sizeof(archiverStats));
+ elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
  /*
  * Set the current timestamp (will be kept only in case we can't load an
  * existing statsfile).
  */
- globalStats.stat_reset_timestamp = GetCurrentTimestamp();
- archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+ shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+ shared_archiverStats->stat_reset_timestamp =
+ shared_globalStats->stat_reset_timestamp;
 
  /*
  * Try to open the stats file. If it doesn't exist, the backends simply
- * return zero for anything and the collector simply starts from scratch
- * with empty counters.
+ * returns zero for anything and the activity statistics simply starts
+ * from scratch with empty counters.
  *
- * ENOENT is a possibility if the stats collector is not running or has
- * not yet written the stats file the first time.  Any other failure
+ * ENOENT is a possibility if the activity statistics is not running or
+ * has not yet written the stats file the first time.  Any other failure
  * condition is suspicious.
  */
  if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
  {
  if (errno != ENOENT)
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errcode_for_file_access(),
  errmsg("could not open statistics file \"%s\": %m",
  statfile)));
- return dbhash;
+ return;
  }
 
  /*
@@ -5071,7 +5036,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
  if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
  format_id != PGSTAT_FILE_FORMAT_ID)
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"", statfile)));
  goto done;
  }
@@ -5079,38 +5044,30 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
  /*
  * Read global stats struct
  */
- if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+ if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+ sizeof(*shared_globalStats))
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"", statfile)));
- memset(&globalStats, 0, sizeof(globalStats));
+ MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
  goto done;
  }
 
- /*
- * In the collector, disregard the timestamp we read from the permanent
- * stats file; we should be willing to write a temp stats file immediately
- * upon the first request from any backend.  This only matters if the old
- * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
- * an unusual scenario.
- */
- if (pgStatRunningInCollector)
- globalStats.stats_timestamp = 0;
-
  /*
  * Read archiver stats struct
  */
- if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+ if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+ sizeof(*shared_archiverStats))
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"", statfile)));
- memset(&archiverStats, 0, sizeof(archiverStats));
+ MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
  goto done;
  }
 
  /*
- * We found an existing collector stats file. Read it and put all the
- * hashtable entries into place.
+ * We found an existing activity statistics file. Read it and put all the
+ * hash table entries into place.
  */
  for (;;)
  {
@@ -5124,7 +5081,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
  if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
   fpin) != offsetof(PgStat_StatDBEntry, tables))
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"",
  statfile)));
  goto done;
@@ -5133,76 +5090,33 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
  /*
  * Add to the DB hash
  */
- dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
- (void *) &dbbuf.databaseid,
- HASH_ENTER,
- &found);
+
+ env = get_stat_entry(PGSTAT_TYPE_DB, dbbuf.databaseid,
+ InvalidOid,
+ false, init_dbentry, &found);
+ dbentry = (PgStat_StatDBEntry *) &env->body;
+
+ /* don't allow duplicate dbentries */
  if (found)
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"",
  statfile)));
  goto done;
  }
 
- memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
- dbentry->tables = NULL;
- dbentry->functions = NULL;
-
- /*
- * In the collector, disregard the timestamp we read from the
- * permanent stats file; we should be willing to write a temp
- * stats file immediately upon the first request from any
- * backend.
- */
- if (pgStatRunningInCollector)
- dbentry->stats_timestamp = 0;
-
- /*
- * Don't create tables/functions hashtables for uninteresting
- * databases.
- */
- if (onlydb != InvalidOid)
- {
- if (dbbuf.databaseid != onlydb &&
- dbbuf.databaseid != InvalidOid)
- break;
- }
-
- memset(&hash_ctl, 0, sizeof(hash_ctl));
- hash_ctl.keysize = sizeof(Oid);
- hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
- hash_ctl.hcxt = pgStatLocalContext;
- dbentry->tables = hash_create("Per-database table",
-  PGSTAT_TAB_HASH_SIZE,
-  &hash_ctl,
-  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- hash_ctl.keysize = sizeof(Oid);
- hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
- hash_ctl.hcxt = pgStatLocalContext;
- dbentry->functions = hash_create("Per-database function",
- PGSTAT_FUNCTION_HASH_SIZE,
- &hash_ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /*
- * If requested, read the data from the database-specific
- * file.  Otherwise we just leave the hashtables empty.
- */
- if (deep)
- pgstat_read_db_statsfile(dbentry->databaseid,
- dbentry->tables,
- dbentry->functions,
- permanent);
+ memcpy(dbentry, &dbbuf,
+   offsetof(PgStat_StatDBEntry, tables));
 
+ /* Read the data from the database-specific file. */
+ pgstat_read_db_statsfile(dbentry);
  break;
 
  case 'E':
  goto done;
 
  default:
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"",
  statfile)));
  goto done;
@@ -5212,59 +5126,50 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
  FreeFile(fpin);
 
- /* If requested to read the permanent file, also get rid of it. */
- if (permanent)
- {
- elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
- unlink(statfile);
- }
+ elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+ unlink(statfile);
 
- return dbhash;
+ return;
 }
 
 
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- * Reads in the existing statistics collector file for the given database,
- * filling the passed-in tables and functions hash tables.
- *
- * As in pgstat_read_statsfiles, if the permanent file is requested, it is
- * removed after reading.
- *
- * Note: this code has the ability to skip storing per-table or per-function
- * data, if NULL is passed for the corresponding hashtable.  That's not used
- * at the moment though.
+ * Reads in the at-rest statistics file and create shared statistics
+ * tables. The file is removed after reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
- bool permanent)
+pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
+ PgStatEnvelope *env;
  PgStat_StatTabEntry *tabentry;
  PgStat_StatTabEntry tabbuf;
  PgStat_StatFuncEntry funcbuf;
  PgStat_StatFuncEntry *funcentry;
+ dshash_table *tabhash = NULL;
+ dshash_table *funchash = NULL;
  FILE   *fpin;
  int32 format_id;
  bool found;
  char statfile[MAXPGPATH];
 
- get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+ get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
 
  /*
  * Try to open the stats file. If it doesn't exist, the backends simply
- * return zero for anything and the collector simply starts from scratch
- * with empty counters.
+ * returns zero for anything and the activity statistics simply starts
+ * from scratch with empty counters.
  *
- * ENOENT is a possibility if the stats collector is not running or has
- * not yet written the stats file the first time.  Any other failure
+ * ENOENT is a possibility if the activity statistics is not running or
+ * has not yet written the stats file the first time.  Any other failure
  * condition is suspicious.
  */
  if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
  {
  if (errno != ENOENT)
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errcode_for_file_access(),
  errmsg("could not open statistics file \"%s\": %m",
  statfile)));
@@ -5277,14 +5182,14 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
  if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
  format_id != PGSTAT_FILE_FORMAT_ID)
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"", statfile)));
  goto done;
  }
 
  /*
- * We found an existing collector stats file. Read it and put all the
- * hashtable entries into place.
+ * We found an existing activity statistics file. Read it and put all the
+ * hash table entries into place.
  */
  for (;;)
  {
@@ -5297,25 +5202,21 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
  if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
   fpin) != sizeof(PgStat_StatTabEntry))
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"",
  statfile)));
  goto done;
  }
 
- /*
- * Skip if table data not wanted.
- */
- if (tabhash == NULL)
- break;
-
- tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-   (void *) &tabbuf.tableid,
-   HASH_ENTER, &found);
+ env = get_stat_entry(PGSTAT_TYPE_TABLE,
+ dbentry->databaseid, tabbuf.tableid,
+ false, init_tabentry, &found);
+ tabentry = (PgStat_StatTabEntry *) &env->body;
 
+ /* don't allow duplicate entries */
  if (found)
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"",
  statfile)));
  goto done;
@@ -5331,25 +5232,20 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
  if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
   fpin) != sizeof(PgStat_StatFuncEntry))
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"",
  statfile)));
  goto done;
  }
 
- /*
- * Skip if function data not wanted.
- */
- if (funchash == NULL)
- break;
-
- funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
- (void *) &funcbuf.functionid,
- HASH_ENTER, &found);
+ env = get_stat_entry(PGSTAT_TYPE_TABLE, dbentry->databaseid,
+ funcbuf.functionid,
+ false, init_funcentry, &found);
+ funcentry = (PgStat_StatFuncEntry *) &env->body;
 
  if (found)
  {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"",
  statfile)));
  goto done;
@@ -5365,7 +5261,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
  goto done;
 
  default:
- ereport(pgStatRunningInCollector ? LOG : WARNING,
+ ereport(LOG,
  (errmsg("corrupted statistics file \"%s\"",
  statfile)));
  goto done;
@@ -5373,292 +5269,15 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
  }
 
 done:
- FreeFile(fpin);
-
- if (permanent)
- {
- elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
- unlink(statfile);
- }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- * Attempt to determine the timestamp of the last db statfile write.
- * Returns true if successful; the timestamp is stored in *ts.
- *
- * This needs to be careful about handling databases for which no stats file
- * exists, such as databases without a stat entry or those not yet written:
- *
- * - if there's a database entry in the global file, return the corresponding
- * stats_timestamp value.
- *
- * - if there's no db stat entry (e.g. for a new or inactive database),
- * there's no stats_timestamp value, but also nothing to write so we return
- * the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-   TimestampTz *ts)
-{
- PgStat_StatDBEntry dbentry;
- PgStat_GlobalStats myGlobalStats;
- PgStat_ArchiverStats myArchiverStats;
- FILE   *fpin;
- int32 format_id;
- const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
- /*
- * Try to open the stats file.  As above, anything but ENOENT is worthy of
- * complaining about.
- */
- if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
- {
- if (errno != ENOENT)
- ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errcode_for_file_access(),
- errmsg("could not open statistics file \"%s\": %m",
- statfile)));
- return false;
- }
-
- /*
- * Verify it's of the expected format.
- */
- if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
- format_id != PGSTAT_FILE_FORMAT_ID)
- {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errmsg("corrupted statistics file \"%s\"", statfile)));
- FreeFile(fpin);
- return false;
- }
-
- /*
- * Read global stats struct
- */
- if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-  fpin) != sizeof(myGlobalStats))
- {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errmsg("corrupted statistics file \"%s\"", statfile)));
- FreeFile(fpin);
- return false;
- }
-
- /*
- * Read archiver stats struct
- */
- if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-  fpin) != sizeof(myArchiverStats))
- {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errmsg("corrupted statistics file \"%s\"", statfile)));
- FreeFile(fpin);
- return false;
- }
-
- /* By default, we're going to return the timestamp of the global file. */
- *ts = myGlobalStats.stats_timestamp;
-
- /*
- * We found an existing collector stats file.  Read it and look for a
- * record for the requested database.  If found, use its timestamp.
- */
- for (;;)
- {
- switch (fgetc(fpin))
- {
- /*
- * 'D' A PgStat_StatDBEntry struct describing a database
- * follows.
- */
- case 'D':
- if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-  fpin) != offsetof(PgStat_StatDBEntry, tables))
- {
- ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errmsg("corrupted statistics file \"%s\"",
- statfile)));
- goto done;
- }
-
- /*
- * If this is the DB we're looking for, save its timestamp and
- * we're done.
- */
- if (dbentry.databaseid == databaseid)
- {
- *ts = dbentry.stats_timestamp;
- goto done;
- }
-
- break;
-
- case 'E':
- goto done;
-
- default:
- ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errmsg("corrupted statistics file \"%s\"",
- statfile)));
- goto done;
- }
- }
+ if (tabhash)
+ dshash_detach(tabhash);
+ if (funchash)
+ dshash_detach(funchash);
 
-done:
  FreeFile(fpin);
- return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
- TimestampTz min_ts = 0;
- TimestampTz ref_ts = 0;
- Oid inquiry_db;
- int count;
-
- /* already read it? */
- if (pgStatDBHash)
- return;
- Assert(!pgStatRunningInCollector);
-
- /*
- * In a normal backend, we check staleness of the data for our own DB, and
- * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
- * check staleness of the shared-catalog data, and send InvalidOid in
- * inquiry messages so as not to force writing unnecessary data.
- */
- if (IsAutoVacuumLauncherProcess())
- inquiry_db = InvalidOid;
- else
- inquiry_db = MyDatabaseId;
-
- /*
- * Loop until fresh enough stats file is available or we ran out of time.
- * The stats inquiry message is sent repeatedly in case collector drops
- * it; but not every single time, as that just swamps the collector.
- */
- for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
- {
- bool ok;
- TimestampTz file_ts = 0;
- TimestampTz cur_ts;
-
- CHECK_FOR_INTERRUPTS();
-
- ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
- cur_ts = GetCurrentTimestamp();
- /* Calculate min acceptable timestamp, if we didn't already */
- if (count == 0 || cur_ts < ref_ts)
- {
- /*
- * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
- * msec before now.  This indirectly ensures that the collector
- * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
- * an autovacuum worker, however, we want a lower delay to avoid
- * using stale data, so we use PGSTAT_RETRY_DELAY (since the
- * number of workers is low, this shouldn't be a problem).
- *
- * We don't recompute min_ts after sleeping, except in the
- * unlikely case that cur_ts went backwards.  So we might end up
- * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
- * practice that shouldn't happen, though, as long as the sleep
- * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
- * tell the collector that our cutoff time is less than what we'd
- * actually accept.
- */
- ref_ts = cur_ts;
- if (IsAutoVacuumWorkerProcess())
- min_ts = TimestampTzPlusMilliseconds(ref_ts,
- -PGSTAT_RETRY_DELAY);
- else
- min_ts = TimestampTzPlusMilliseconds(ref_ts,
- -PGSTAT_STAT_INTERVAL);
- }
-
- /*
- * If the file timestamp is actually newer than cur_ts, we must have
- * had a clock glitch (system time went backwards) or there is clock
- * skew between our processor and the stats collector's processor.
- * Accept the file, but send an inquiry message anyway to make
- * pgstat_recv_inquiry do a sanity check on the collector's time.
- */
- if (ok && file_ts > cur_ts)
- {
- /*
- * A small amount of clock skew between processors isn't terribly
- * surprising, but a large difference is worth logging.  We
- * arbitrarily define "large" as 1000 msec.
- */
- if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
- {
- char   *filetime;
- char   *mytime;
-
- /* Copy because timestamptz_to_str returns a static buffer */
- filetime = pstrdup(timestamptz_to_str(file_ts));
- mytime = pstrdup(timestamptz_to_str(cur_ts));
- elog(LOG, "stats collector's time %s is later than backend local time %s",
- filetime, mytime);
- pfree(filetime);
- pfree(mytime);
- }
-
- pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
- break;
- }
-
- /* Normal acceptance case: file is not older than cutoff time */
- if (ok && file_ts >= min_ts)
- break;
-
- /* Not there or too old, so kick the collector and wait a bit */
- if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
- pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
- pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
- }
-
- if (count >= PGSTAT_POLL_LOOP_COUNT)
- ereport(LOG,
- (errmsg("using stale statistics instead of current ones "
- "because stats collector is not responding")));
-
- /*
- * Autovacuum launcher wants stats about all databases, but a shallow read
- * is sufficient.  Regular backends want a deep read for just the tables
- * they can see (MyDatabaseId + shared catalogs).
- */
- if (IsAutoVacuumLauncherProcess())
- pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
- else
- pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
 
-/* ----------
- * pgstat_setup_memcxt() -
- *
- * Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
- if (!pgStatLocalContext)
- pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-   "Statistics snapshot",
-   ALLOCSET_SMALL_SIZES);
+ elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+ unlink(statfile);
 }
 
 
@@ -5677,741 +5296,25 @@ pgstat_clear_snapshot(void)
 {
  /* Release memory, if any was allocated */
  if (pgStatLocalContext)
+ {
  MemoryContextDelete(pgStatLocalContext);
 
- /* Reset variables */
- pgStatLocalContext = NULL;
- pgStatDBHash = NULL;
- localBackendStatusTable = NULL;
- localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- * Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
-
- elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
- /*
- * If there's already a write request for this DB, there's nothing to do.
- *
- * Note that if a request is found, we return early and skip the below
- * check for clock skew.  This is okay, since the only way for a DB
- * request to be present in the list is that we have been here since the
- * last write round.  It seems sufficient to check for clock skew once per
- * write round.
- */
- if (list_member_oid(pending_write_requests, msg->databaseid))
- return;
-
- /*
- * Check to see if we last wrote this database at a time >= the requested
- * cutoff time.  If so, this is a stale request that was generated before
- * we updated the DB file, and we don't need to do so again.
- *
- * If the requestor's local clock time is older than stats_timestamp, we
- * should suspect a clock glitch, ie system time going backwards; though
- * the more likely explanation is just delayed message receipt.  It is
- * worth expending a GetCurrentTimestamp call to be sure, since a large
- * retreat in the system clock reading could otherwise cause us to neglect
- * to update the stats file for a long time.
- */
- dbentry = pgstat_get_db_entry(msg->databaseid, false);
- if (dbentry == NULL)
- {
- /*
- * We have no data for this DB.  Enter a write request anyway so that
- * the global stats will get updated.  This is needed to prevent
- * backend_read_statsfile from waiting for data that we cannot supply,
- * in the case of a new DB that nobody has yet reported any stats for.
- * See the behavior of pgstat_read_db_statsfile_timestamp.
- */
+ /* Reset variables */
+ pgStatLocalContext = NULL;
+ localBackendStatusTable = NULL;
+ localNumBackends = 0;
  }
- else if (msg->clock_time < dbentry->stats_timestamp)
- {
- TimestampTz cur_ts = GetCurrentTimestamp();
-
- if (cur_ts < dbentry->stats_timestamp)
- {
- /*
- * Sure enough, time went backwards.  Force a new stats file write
- * to get back in sync; but first, log a complaint.
- */
- char   *writetime;
- char   *mytime;
-
- /* Copy because timestamptz_to_str returns a static buffer */
- writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
- mytime = pstrdup(timestamptz_to_str(cur_ts));
- elog(LOG,
- "stats_timestamp %s is later than collector's time %s for database %u",
- writetime, mytime, dbentry->databaseid);
- pfree(writetime);
- pfree(mytime);
- }
- else
- {
- /*
- * Nope, it's just an old request.  Assuming msg's clock_time is
- * >= its cutoff_time, it must be stale, so we can ignore it.
- */
- return;
- }
- }
- else if (msg->cutoff_time <= dbentry->stats_timestamp)
- {
- /* Stale request, ignore it */
- return;
- }
-
- /*
- * We need to write this DB, so create a request.
- */
- pending_write_requests = lappend_oid(pending_write_requests,
- msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- * Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
- PgStat_StatTabEntry *tabentry;
- int i;
- bool found;
 
- dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
- /*
- * Update database-wide stats.
- */
- dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
- dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
- dbentry->n_block_read_time += msg->m_block_read_time;
- dbentry->n_block_write_time += msg->m_block_write_time;
-
- /*
- * Process all table entries in the message.
- */
- for (i = 0; i < msg->m_nentries; i++)
+ if (pgStatSnapshotContext)
  {
- PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
- tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-   (void *) &(tabmsg->t_id),
-   HASH_ENTER, &found);
-
- if (!found)
- {
- /*
- * If it's a new table entry, initialize counters to the values we
- * just got.
- */
- tabentry->numscans = tabmsg->t_counts.t_numscans;
- tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
- tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
- tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
- tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
- tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
- tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
- tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
- tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
- tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
- tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
- tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
+ MemoryContextReset(pgStatSnapshotContext);
 
- tabentry->vacuum_timestamp = 0;
- tabentry->vacuum_count = 0;
- tabentry->autovac_vacuum_timestamp = 0;
- tabentry->autovac_vacuum_count = 0;
- tabentry->analyze_timestamp = 0;
- tabentry->analyze_count = 0;
- tabentry->autovac_analyze_timestamp = 0;
- tabentry->autovac_analyze_count = 0;
- }
- else
- {
- /*
- * Otherwise add the values to the existing entry.
- */
- tabentry->numscans += tabmsg->t_counts.t_numscans;
- tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
- tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
- tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
- tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
- tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
- tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
- /* If table was truncated, first reset the live/dead counters */
- if (tabmsg->t_counts.t_truncated)
- {
- tabentry->n_live_tuples = 0;
- tabentry->n_dead_tuples = 0;
- }
- tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
- tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
- tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
- tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
- tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
- }
-
- /* Clamp n_live_tuples in case of negative delta_live_tuples */
- tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
- /* Likewise for n_dead_tuples */
- tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
- /*
- * Add per-table stats to the per-database entry, too.
- */
- dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
- dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
- dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
- dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
- dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
- dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
- dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
+ /* Reset variables that pointed to the context */
+ global_snapshot_is_valid = false;
+ pgStatSnapshotHash = NULL;
  }
 }
 
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- * Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
- int i;
-
- dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
- /*
- * No need to purge if we don't even know the database.
- */
- if (!dbentry || !dbentry->tables)
- return;
-
- /*
- * Process all table entries in the message.
- */
- for (i = 0; i < msg->m_nentries; i++)
- {
- /* Remove from hashtable if present; we don't care if it's not. */
- (void) hash_search(dbentry->tables,
-   (void *) &(msg->m_tableid[i]),
-   HASH_REMOVE, NULL);
- }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- * Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
- Oid dbid = msg->m_databaseid;
- PgStat_StatDBEntry *dbentry;
-
- /*
- * Lookup the database in the hashtable.
- */
- dbentry = pgstat_get_db_entry(dbid, false);
-
- /*
- * If found, remove it (along with the db statfile).
- */
- if (dbentry)
- {
- char statfile[MAXPGPATH];
-
- get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
- elog(DEBUG2, "removing stats file \"%s\"", statfile);
- unlink(statfile);
-
- if (dbentry->tables != NULL)
- hash_destroy(dbentry->tables);
- if (dbentry->functions != NULL)
- hash_destroy(dbentry->functions);
-
- if (hash_search(pgStatDBHash,
- (void *) &dbid,
- HASH_REMOVE, NULL) == NULL)
- ereport(ERROR,
- (errmsg("database hash table corrupted during cleanup --- abort")));
- }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- * Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
-
- /*
- * Lookup the database in the hashtable.  Nothing to do if not there.
- */
- dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
- if (!dbentry)
- return;
-
- /*
- * We simply throw away all the database's table entries by recreating a
- * new hash table for them.
- */
- if (dbentry->tables != NULL)
- hash_destroy(dbentry->tables);
- if (dbentry->functions != NULL)
- hash_destroy(dbentry->functions);
-
- dbentry->tables = NULL;
- dbentry->functions = NULL;
-
- /*
- * Reset database-level stats, too.  This creates empty hash tables for
- * tables and functions.
- */
- reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- * Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
- if (msg->m_resettarget == RESET_BGWRITER)
- {
- /* Reset the global background writer statistics for the cluster. */
- memset(&globalStats, 0, sizeof(globalStats));
- globalStats.stat_reset_timestamp = GetCurrentTimestamp();
- }
- else if (msg->m_resettarget == RESET_ARCHIVER)
- {
- /* Reset the archiver statistics for the cluster. */
- memset(&archiverStats, 0, sizeof(archiverStats));
- archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
- }
-
- /*
- * Presumably the sender of this message validated the target, don't
- * complain here if it's not valid
- */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- * Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
-
- dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
- if (!dbentry)
- return;
-
- /* Set the reset timestamp for the whole database */
- dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
- /* Remove object if it exists, ignore it if not */
- if (msg->m_resettype == RESET_TABLE)
- (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-   HASH_REMOVE, NULL);
- else if (msg->m_resettype == RESET_FUNCTION)
- (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-   HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- * Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
-
- /*
- * Store the last autovacuum time in the database's hashtable entry.
- */
- dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
- dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- * Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
- PgStat_StatTabEntry *tabentry;
-
- /*
- * Store the data in the table's hashtable entry.
- */
- dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
- tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
- tabentry->n_live_tuples = msg->m_live_tuples;
- tabentry->n_dead_tuples = msg->m_dead_tuples;
-
- if (msg->m_autovacuum)
- {
- tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
- tabentry->autovac_vacuum_count++;
- }
- else
- {
- tabentry->vacuum_timestamp = msg->m_vacuumtime;
- tabentry->vacuum_count++;
- }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- * Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
- PgStat_StatTabEntry *tabentry;
-
- /*
- * Store the data in the table's hashtable entry.
- */
- dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
- tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
- tabentry->n_live_tuples = msg->m_live_tuples;
- tabentry->n_dead_tuples = msg->m_dead_tuples;
-
- /*
- * If commanded, reset changes_since_analyze to zero.  This forgets any
- * changes that were committed while the ANALYZE was in progress, but we
- * have no good way to estimate how many of those there were.
- */
- if (msg->m_resetcounter)
- tabentry->changes_since_analyze = 0;
-
- if (msg->m_autovacuum)
- {
- tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
- tabentry->autovac_analyze_count++;
- }
- else
- {
- tabentry->analyze_timestamp = msg->m_analyzetime;
- tabentry->analyze_count++;
- }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- * Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
- if (msg->m_failed)
- {
- /* Failed archival attempt */
- ++archiverStats.failed_count;
- memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-   sizeof(archiverStats.last_failed_wal));
- archiverStats.last_failed_timestamp = msg->m_timestamp;
- }
- else
- {
- /* Successful archival operation */
- ++archiverStats.archived_count;
- memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-   sizeof(archiverStats.last_archived_wal));
- archiverStats.last_archived_timestamp = msg->m_timestamp;
- }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- * Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
- globalStats.timed_checkpoints += msg->m_timed_checkpoints;
- globalStats.requested_checkpoints += msg->m_requested_checkpoints;
- globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
- globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
- globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
- globalStats.buf_written_clean += msg->m_buf_written_clean;
- globalStats.maxwritten_clean += msg->m_maxwritten_clean;
- globalStats.buf_written_backend += msg->m_buf_written_backend;
- globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
- globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- * Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
-
- dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
- switch (msg->m_reason)
- {
- case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
- /*
- * Since we drop the information about the database as soon as it
- * replicates, there is no point in counting these conflicts.
- */
- break;
- case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
- dbentry->n_conflict_tablespace++;
- break;
- case PROCSIG_RECOVERY_CONFLICT_LOCK:
- dbentry->n_conflict_lock++;
- break;
- case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
- dbentry->n_conflict_snapshot++;
- break;
- case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
- dbentry->n_conflict_bufferpin++;
- break;
- case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
- dbentry->n_conflict_startup_deadlock++;
- break;
- }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- * Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
-
- dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
- dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- * Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
-
- dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
- dbentry->n_checksum_failures += msg->m_failurecount;
- dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- * Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
-
- dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
- dbentry->n_temp_bytes += msg->m_filesize;
- dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- * Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
- PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
- PgStat_StatDBEntry *dbentry;
- PgStat_StatFuncEntry *funcentry;
- int i;
- bool found;
-
- dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
- /*
- * Process all function entries in the message.
- */
- for (i = 0; i < msg->m_nentries; i++, funcmsg++)
- {
- funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
- (void *) &(funcmsg->f_id),
- HASH_ENTER, &found);
-
- if (!found)
- {
- /*
- * If it's a new function entry, initialize counters to the values
- * we just got.
- */
- funcentry->f_numcalls = funcmsg->f_numcalls;
- funcentry->f_total_time = funcmsg->f_total_time;
- funcentry->f_self_time = funcmsg->f_self_time;
- }
- else
- {
- /*
- * Otherwise add the values to the existing entry.
- */
- funcentry->f_numcalls += funcmsg->f_numcalls;
- funcentry->f_total_time += funcmsg->f_total_time;
- funcentry->f_self_time += funcmsg->f_self_time;
- }
- }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- * Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
- PgStat_StatDBEntry *dbentry;
- int i;
-
- dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
- /*
- * No need to purge if we don't even know the database.
- */
- if (!dbentry || !dbentry->functions)
- return;
-
- /*
- * Process all function entries in the message.
- */
- for (i = 0; i < msg->m_nentries; i++)
- {
- /* Remove from hashtable if present; we don't care if it's not. */
- (void) hash_search(dbentry->functions,
-   (void *) &(msg->m_functionid[i]),
-   HASH_REMOVE, NULL);
- }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- * Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
- if (pending_write_requests != NIL)
- return true;
-
- /* Everything was written recently */
- return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- * Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
- /*
- * If any requests are outstanding at all, we should write the stats for
- * shared catalogs (the "database" with OID 0).  This ensures that
- * backends will see up-to-date stats for shared catalogs, even though
- * they send inquiry messages mentioning only their own DB.
- */
- if (databaseid == InvalidOid && pending_write_requests != NIL)
- return true;
-
- /* Search to see if there's an open request to write this database. */
- if (list_member_oid(pending_write_requests, databaseid))
- return true;
-
- return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index fab4a9dd51..d418fe3bd0 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
  WalReceiverPID = 0,
  AutoVacPID = 0,
  PgArchPID = 0,
- PgStatPID = 0,
  SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -503,7 +502,6 @@ typedef struct
  PGPROC   *AuxiliaryProcs;
  PGPROC   *PreparedXactProcs;
  PMSignalData *PMSignalState;
- InheritableSocket pgStatSock;
  pid_t PostmasterPid;
  TimestampTz PgStartTime;
  TimestampTz PgReloadTime;
@@ -1326,12 +1324,6 @@ PostmasterMain(int argc, char *argv[])
  */
  RemovePgTempFiles();
 
- /*
- * Initialize stats collection subsystem (this does NOT start the
- * collector process!)
- */
- pgstat_init();
-
  /*
  * Initialize the autovacuum subsystem (again, no process start yet)
  */
@@ -1780,11 +1772,6 @@ ServerLoop(void)
  start_autovac_launcher = false; /* signal processed */
  }
 
- /* If we have lost the stats collector, try to start a new one */
- if (PgStatPID == 0 &&
- (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
- PgStatPID = pgstat_start();
-
  /* If we have lost the archiver, try to start a new one. */
  if (PgArchPID == 0 && PgArchStartupAllowed())
  PgArchPID = StartArchiver();
@@ -2694,8 +2681,6 @@ SIGHUP_handler(SIGNAL_ARGS)
  signal_child(PgArchPID, SIGHUP);
  if (SysLoggerPID != 0)
  signal_child(SysLoggerPID, SIGHUP);
- if (PgStatPID != 0)
- signal_child(PgStatPID, SIGHUP);
 
  /* Reload authentication config files too */
  if (!load_hba())
@@ -3058,8 +3043,6 @@ reaper(SIGNAL_ARGS)
  AutoVacPID = StartAutoVacLauncher();
  if (PgArchStartupAllowed() && PgArchPID == 0)
  PgArchPID = StartArchiver();
- if (PgStatPID == 0)
- PgStatPID = pgstat_start();
 
  /* workers may be scheduled to start now */
  maybe_start_bgworkers();
@@ -3126,13 +3109,6 @@ reaper(SIGNAL_ARGS)
  SignalChildren(SIGUSR2);
 
  pmState = PM_SHUTDOWN_2;
-
- /*
- * We can also shut down the stats collector now; there's
- * nothing left for it to do.
- */
- if (PgStatPID != 0)
- signal_child(PgStatPID, SIGQUIT);
  }
  else
  {
@@ -3205,22 +3181,6 @@ reaper(SIGNAL_ARGS)
  continue;
  }
 
- /*
- * Was it the statistics collector?  If so, just try to start a new
- * one; no need to force reset of the rest of the system.  (If fail,
- * we'll try again in future cycles of the main loop.)
- */
- if (pid == PgStatPID)
- {
- PgStatPID = 0;
- if (!EXIT_STATUS_0(exitstatus))
- LogChildExit(LOG, _("statistics collector process"),
- pid, exitstatus);
- if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
- PgStatPID = pgstat_start();
- continue;
- }
-
  /* Was it the system logger?  If so, try to start a new one */
  if (pid == SysLoggerPID)
  {
@@ -3681,22 +3641,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
  signal_child(PgArchPID, SIGQUIT);
  }
 
- /*
- * Force a power-cycle of the pgstat process too.  (This isn't absolutely
- * necessary, but it seems like a good idea for robustness, and it
- * simplifies the state-machine logic in the case where a shutdown request
- * arrives during crash processing.)
- */
- if (PgStatPID != 0 && take_action)
- {
- ereport(DEBUG2,
- (errmsg_internal("sending %s to process %d",
- "SIGQUIT",
- (int) PgStatPID)));
- signal_child(PgStatPID, SIGQUIT);
- allow_immediate_pgstat_restart();
- }
-
  /* We do NOT restart the syslogger */
 
  if (Shutdown != ImmediateShutdown)
@@ -3892,8 +3836,6 @@ PostmasterStateMachine(void)
  SignalChildren(SIGQUIT);
  if (PgArchPID != 0)
  signal_child(PgArchPID, SIGQUIT);
- if (PgStatPID != 0)
- signal_child(PgStatPID, SIGQUIT);
  }
  }
  }
@@ -3928,8 +3870,7 @@ PostmasterStateMachine(void)
  * normal state transition leading up to PM_WAIT_DEAD_END, or during
  * FatalError processing.
  */
- if (dlist_is_empty(&BackendList) &&
- PgArchPID == 0 && PgStatPID == 0)
+ if (dlist_is_empty(&BackendList) && PgArchPID == 0)
  {
  /* These other guys should be dead already */
  Assert(StartupPID == 0);
@@ -4130,8 +4071,6 @@ TerminateChildren(int signal)
  signal_child(AutoVacPID, signal);
  if (PgArchPID != 0)
  signal_child(PgArchPID, signal);
- if (PgStatPID != 0)
- signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5109,18 +5048,6 @@ SubPostmasterMain(int argc, char *argv[])
 
  StartBackgroundWorker();
  }
- if (strcmp(argv[1], "--forkarch") == 0)
- {
- /* Do not want to attach to shared memory */
-
- PgArchiverMain(argc, argv); /* does not return */
- }
- if (strcmp(argv[1], "--forkcol") == 0)
- {
- /* Do not want to attach to shared memory */
-
- PgstatCollectorMain(argc, argv); /* does not return */
- }
  if (strcmp(argv[1], "--forklog") == 0)
  {
  /* Do not want to attach to shared memory */
@@ -5239,12 +5166,6 @@ sigusr1_handler(SIGNAL_ARGS)
  if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
  pmState == PM_RECOVERY && Shutdown == NoShutdown)
  {
- /*
- * Likewise, start other special children as needed.
- */
- Assert(PgStatPID == 0);
- PgStatPID = pgstat_start();
-
  ereport(LOG,
  (errmsg("database system is ready to accept read only connections")));
 
@@ -6139,7 +6060,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6195,8 +6115,6 @@ save_backend_variables(BackendParameters *param, Port *port,
  param->AuxiliaryProcs = AuxiliaryProcs;
  param->PreparedXactProcs = PreparedXactProcs;
  param->PMSignalState = PMSignalState;
- if (!write_inheritable_socket(&param->pgStatSock, pgStatSock, childPid))
- return false;
 
  param->PostmasterPid = PostmasterPid;
  param->PgStartTime = PgStartTime;
@@ -6431,7 +6349,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
  AuxiliaryProcs = param->AuxiliaryProcs;
  PreparedXactProcs = param->PreparedXactProcs;
  PMSignalState = param->PMSignalState;
- read_inheritable_socket(&pgStatSock, &param->pgStatSock);
 
  PostmasterPid = param->PostmasterPid;
  PgStartTime = param->PgStartTime;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e05e2b3456..26414dadb2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1947,7 +1947,7 @@ BufferSync(int flags)
  if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
  {
  TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
- BgWriterStats.m_buf_written_checkpoints++;
+ BgWriterStats.buf_written_checkpoints++;
  num_written++;
  }
  }
@@ -2057,7 +2057,7 @@ BgBufferSync(WritebackContext *wb_context)
  strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
  /* Report buffer alloc counts to pgstat */
- BgWriterStats.m_buf_alloc += recent_alloc;
+ BgWriterStats.buf_alloc += recent_alloc;
 
  /*
  * If we're not running the LRU scan, just stop after doing the stats
@@ -2247,7 +2247,7 @@ BgBufferSync(WritebackContext *wb_context)
  reusable_buffers++;
  if (++num_written >= bgwriter_lru_maxpages)
  {
- BgWriterStats.m_maxwritten_clean++;
+ BgWriterStats.maxwritten_clean++;
  break;
  }
  }
@@ -2255,7 +2255,7 @@ BgBufferSync(WritebackContext *wb_context)
  reusable_buffers++;
  }
 
- BgWriterStats.m_buf_written_clean += num_written;
+ BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
  elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d upcoming_est=%d scanned=%d wrote=%d reusable=%d",
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..58a442f482 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(void)
  size = add_size(size, BTreeShmemSize());
  size = add_size(size, SyncScanShmemSize());
  size = add_size(size, AsyncShmemSize());
+ size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
  size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(void)
  BTreeShmemInit();
  SyncScanShmemInit();
  AsyncShmemInit();
+ StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4c14e51c67..f61fd3e8ad 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -523,6 +523,7 @@ RegisterLWLockTranches(void)
  LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
  LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
  LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+ LWLockRegisterTranche(LWTRANCHE_STATS, "activity_statistics");
 
  /* Register named tranches. */
  for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843229..97eccb35d3 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock 41
 OldSnapshotTimeMapLock 42
 LogicalRepWorkerLock 43
 CLogTruncationLock 44
+StatsLock 45
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 00c77b66c7..e2998f965e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3189,6 +3189,12 @@ ProcessInterrupts(void)
 
  if (ParallelMessagePending)
  HandleParallelMessages();
+
+ if (IdleStatsUpdateTimeoutPending)
+ {
+ IdleStatsUpdateTimeoutPending = false;
+ pgstat_report_stat(true);
+ }
 }
 
 
@@ -3763,6 +3769,7 @@ PostgresMain(int argc, char *argv[],
  sigjmp_buf local_sigjmp_buf;
  volatile bool send_ready_for_query = true;
  bool disable_idle_in_transaction_timeout = false;
+ bool disable_idle_stats_update_timeout = false;
 
  /* Initialize startup process environment if necessary. */
  if (!IsUnderPostmaster)
@@ -4201,6 +4208,8 @@ PostgresMain(int argc, char *argv[],
  }
  else
  {
+ long stats_timeout;
+
  /* Send out notify signals and transmit self-notifies */
  ProcessCompletedNotifies();
 
@@ -4213,8 +4222,13 @@ PostgresMain(int argc, char *argv[],
  if (notifyInterruptPending)
  ProcessNotifyInterrupt();
 
- pgstat_report_stat(false);
-
+ stats_timeout = pgstat_report_stat(false);
+ if (stats_timeout > 0)
+ {
+ disable_idle_stats_update_timeout = true;
+ enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+ stats_timeout);
+ }
  set_ps_display("idle");
  pgstat_report_activity(STATE_IDLE, NULL);
  }
@@ -4249,7 +4263,7 @@ PostgresMain(int argc, char *argv[],
  DoingCommandRead = false;
 
  /*
- * (5) turn off the idle-in-transaction timeout
+ * (5) turn off the idle-in-transaction timeout and stats update timeout
  */
  if (disable_idle_in_transaction_timeout)
  {
@@ -4257,6 +4271,12 @@ PostgresMain(int argc, char *argv[],
  disable_idle_in_transaction_timeout = false;
  }
 
+ if (disable_idle_stats_update_timeout)
+ {
+ disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+ disable_idle_stats_update_timeout = false;
+ }
+
  /*
  * (6) check for any other interesting events that happened while we
  * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index cea01534a5..a1304dc3ce 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -33,9 +33,6 @@
 
 #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var))))
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1244,7 +1241,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_xact_commit);
+ result = (int64) (dbentry->counts.n_xact_commit);
 
  PG_RETURN_INT64(result);
 }
@@ -1260,7 +1257,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_xact_rollback);
+ result = (int64) (dbentry->counts.n_xact_rollback);
 
  PG_RETURN_INT64(result);
 }
@@ -1276,7 +1273,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_blocks_fetched);
+ result = (int64) (dbentry->counts.n_blocks_fetched);
 
  PG_RETURN_INT64(result);
 }
@@ -1292,7 +1289,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_blocks_hit);
+ result = (int64) (dbentry->counts.n_blocks_hit);
 
  PG_RETURN_INT64(result);
 }
@@ -1308,7 +1305,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_tuples_returned);
+ result = (int64) (dbentry->counts.n_tuples_returned);
 
  PG_RETURN_INT64(result);
 }
@@ -1324,7 +1321,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_tuples_fetched);
+ result = (int64) (dbentry->counts.n_tuples_fetched);
 
  PG_RETURN_INT64(result);
 }
@@ -1340,7 +1337,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_tuples_inserted);
+ result = (int64) (dbentry->counts.n_tuples_inserted);
 
  PG_RETURN_INT64(result);
 }
@@ -1356,7 +1353,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_tuples_updated);
+ result = (int64) (dbentry->counts.n_tuples_updated);
 
  PG_RETURN_INT64(result);
 }
@@ -1372,7 +1369,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_tuples_deleted);
+ result = (int64) (dbentry->counts.n_tuples_deleted);
 
  PG_RETURN_INT64(result);
 }
@@ -1405,7 +1402,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = dbentry->n_temp_files;
+ result = dbentry->counts.n_temp_files;
 
  PG_RETURN_INT64(result);
 }
@@ -1421,7 +1418,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = dbentry->n_temp_bytes;
+ result = dbentry->counts.n_temp_bytes;
 
  PG_RETURN_INT64(result);
 }
@@ -1436,7 +1433,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_conflict_tablespace);
+ result = (int64) (dbentry->counts.n_conflict_tablespace);
 
  PG_RETURN_INT64(result);
 }
@@ -1451,7 +1448,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_conflict_lock);
+ result = (int64) (dbentry->counts.n_conflict_lock);
 
  PG_RETURN_INT64(result);
 }
@@ -1466,7 +1463,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_conflict_snapshot);
+ result = (int64) (dbentry->counts.n_conflict_snapshot);
 
  PG_RETURN_INT64(result);
 }
@@ -1481,7 +1478,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_conflict_bufferpin);
+ result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
  PG_RETURN_INT64(result);
 }
@@ -1496,7 +1493,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_conflict_startup_deadlock);
+ result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
  PG_RETURN_INT64(result);
 }
@@ -1511,11 +1508,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_conflict_tablespace +
-  dbentry->n_conflict_lock +
-  dbentry->n_conflict_snapshot +
-  dbentry->n_conflict_bufferpin +
-  dbentry->n_conflict_startup_deadlock);
+ result = (int64) (dbentry->counts.n_conflict_tablespace +
+  dbentry->counts.n_conflict_lock +
+  dbentry->counts.n_conflict_snapshot +
+  dbentry->counts.n_conflict_bufferpin +
+  dbentry->counts.n_conflict_startup_deadlock);
 
  PG_RETURN_INT64(result);
 }
@@ -1530,7 +1527,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_deadlocks);
+ result = (int64) (dbentry->counts.n_deadlocks);
 
  PG_RETURN_INT64(result);
 }
@@ -1548,7 +1545,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = (int64) (dbentry->n_checksum_failures);
+ result = (int64) (dbentry->counts.n_checksum_failures);
 
  PG_RETURN_INT64(result);
 }
@@ -1585,7 +1582,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = ((double) dbentry->n_block_read_time) / 1000.0;
+ result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
  PG_RETURN_FLOAT8(result);
 }
@@ -1601,7 +1598,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
  if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
  result = 0;
  else
- result = ((double) dbentry->n_block_write_time) / 1000.0;
+ result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
  PG_RETURN_FLOAT8(result);
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..51748c99ad 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d..f65d05c24c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
  RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
  RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
  IdleInTransactionSessionTimeoutHandler);
+ RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+ IdleStatsUpdateTimeoutHandler);
  }
 
  /*
@@ -1241,6 +1244,14 @@ IdleInTransactionSessionTimeoutHandler(void)
  SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+ IdleStatsUpdateTimeoutPending = true;
+ InterruptPending = true;
+ SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3c70499feb..927ae319b1 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 107;
+use Test::More tests => 106;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
- qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+ qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
  is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 619b2f9c71..9f1de1e42f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a07012bf4b..6fad13c4be 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  * pgstat.h
  *
- * Definitions for the PostgreSQL statistics collector daemon.
+ * Definitions for the PostgreSQL activity statistics facility.
  *
  * Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -15,9 +15,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,33 +43,6 @@ typedef enum TrackFunctionsLevel
  TRACK_FUNC_ALL
 } TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
- PGSTAT_MTYPE_DUMMY,
- PGSTAT_MTYPE_INQUIRY,
- PGSTAT_MTYPE_TABSTAT,
- PGSTAT_MTYPE_TABPURGE,
- PGSTAT_MTYPE_DROPDB,
- PGSTAT_MTYPE_RESETCOUNTER,
- PGSTAT_MTYPE_RESETSHAREDCOUNTER,
- PGSTAT_MTYPE_RESETSINGLECOUNTER,
- PGSTAT_MTYPE_AUTOVAC_START,
- PGSTAT_MTYPE_VACUUM,
- PGSTAT_MTYPE_ANALYZE,
- PGSTAT_MTYPE_ARCHIVER,
- PGSTAT_MTYPE_BGWRITER,
- PGSTAT_MTYPE_FUNCSTAT,
- PGSTAT_MTYPE_FUNCPURGE,
- PGSTAT_MTYPE_RECOVERYCONFLICT,
- PGSTAT_MTYPE_TEMPFILE,
- PGSTAT_MTYPE_DEADLOCK,
- PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -78,9 +53,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -111,18 +85,17 @@ typedef struct PgStat_TableCounts
  PgStat_Counter t_delta_live_tuples;
  PgStat_Counter t_delta_dead_tuples;
  PgStat_Counter t_changed_tuples;
+ bool reset_changed_tuples;
 
  PgStat_Counter t_blocks_fetched;
  PgStat_Counter t_blocks_hit;
+
+ PgStat_Counter vacuum_count;
+ PgStat_Counter autovac_vacuum_count;
+ PgStat_Counter analyze_count;
+ PgStat_Counter autovac_analyze_count;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
- RESET_ARCHIVER,
- RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -156,6 +129,10 @@ typedef struct PgStat_TableStatus
  Oid t_id; /* table's OID */
  bool t_shared; /* is it a shared catalog? */
  struct PgStat_TableXactStatus *trans; /* lowest subxact's counts */
+ TimestampTz vacuum_timestamp; /* user initiated vacuum */
+ TimestampTz autovac_vacuum_timestamp; /* autovacuum initiated */
+ TimestampTz analyze_timestamp; /* user initiated */
+ TimestampTz autovac_analyze_timestamp; /* autovacuum initiated */
  PgStat_TableCounts t_counts; /* event counts to be sent */
 } PgStat_TableStatus;
 
@@ -181,280 +158,32 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
- StatMsgType m_type;
- int m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
- PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry Sent by a backend to ask the collector
- * to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
- PgStat_MsgHdr m_hdr;
- TimestampTz clock_time; /* observed local clock time */
- TimestampTz cutoff_time; /* minimum acceptable file timestamp */
- Oid databaseid; /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
- Oid t_id;
- PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat Sent by the backend to report table
- * and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
- ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter)) \
- / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
- int m_nentries;
- int m_xact_commit;
- int m_xact_rollback;
- PgStat_Counter m_block_read_time; /* times in microseconds */
- PgStat_Counter m_block_write_time;
- PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge Sent by the backend to tell the collector
- * about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
- ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
- / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
- int m_nentries;
- Oid m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb Sent by the backend to tell the collector
- * about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter Sent by the backend to tell the collector
- * to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- * to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
- PgStat_MsgHdr m_hdr;
- PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- * to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
- PgStat_Single_Reset_Type m_resettype;
- Oid m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart Sent by the autovacuum daemon to signal
- * that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
- TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum Sent by the backend or autovacuum daemon
- * after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
- Oid m_tableoid;
- bool m_autovacuum;
- TimestampTz m_vacuumtime;
- PgStat_Counter m_live_tuples;
- PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze Sent by the backend or autovacuum daemon
- * after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
- Oid m_tableoid;
- bool m_autovacuum;
- bool m_resetcounter;
- TimestampTz m_analyzetime;
- PgStat_Counter m_live_tuples;
- PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
- PgStat_MsgHdr m_hdr;
- bool m_failed; /* Failed attempt */
- char m_xlog[MAX_XFN_CHARS + 1];
- TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
- PgStat_MsgHdr m_hdr;
-
- PgStat_Counter m_timed_checkpoints;
- PgStat_Counter m_requested_checkpoints;
- PgStat_Counter m_buf_written_checkpoints;
- PgStat_Counter m_buf_written_clean;
- PgStat_Counter m_maxwritten_clean;
- PgStat_Counter m_buf_written_backend;
- PgStat_Counter m_buf_fsync_backend;
- PgStat_Counter m_buf_alloc;
- PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
- PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
- PgStat_MsgHdr m_hdr;
-
- Oid m_databaseid;
- int m_reason;
-} PgStat_MsgRecoveryConflict;
-
 /* ----------
- * PgStat_MsgTempFile Sent by the backend upon creating a temp file
+ * PgStat_BgWriter bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgTempFile
+typedef struct PgStat_BgWriter
 {
- PgStat_MsgHdr m_hdr;
-
- Oid m_databaseid;
- size_t m_filesize;
-} PgStat_MsgTempFile;
+ PgStat_Counter timed_checkpoints;
+ PgStat_Counter requested_checkpoints;
+ PgStat_Counter buf_written_checkpoints;
+ PgStat_Counter buf_written_clean;
+ PgStat_Counter maxwritten_clean;
+ PgStat_Counter buf_written_backend;
+ PgStat_Counter buf_fsync_backend;
+ PgStat_Counter buf_alloc;
+ PgStat_Counter checkpoint_write_time; /* times in milliseconds */
+ PgStat_Counter checkpoint_sync_time;
+} PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -486,96 +215,8 @@ typedef struct PgStat_FunctionEntry
  PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat Sent by the backend to report function
- * usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES \
- ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
- / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
- int m_nentries;
- PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge Sent by the backend to tell the collector
- * about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
- ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
- / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
- int m_nentries;
- Oid m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock Sent by the backend to tell the collector
- * about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure Sent by the backend to tell the collector
- * about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
- PgStat_MsgHdr m_hdr;
- Oid m_databaseid;
- int m_failurecount;
- TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
- PgStat_MsgHdr msg_hdr;
- PgStat_MsgDummy msg_dummy;
- PgStat_MsgInquiry msg_inquiry;
- PgStat_MsgTabstat msg_tabstat;
- PgStat_MsgTabpurge msg_tabpurge;
- PgStat_MsgDropdb msg_dropdb;
- PgStat_MsgResetcounter msg_resetcounter;
- PgStat_MsgResetsharedcounter msg_resetsharedcounter;
- PgStat_MsgResetsinglecounter msg_resetsinglecounter;
- PgStat_MsgAutovacStart msg_autovacuum_start;
- PgStat_MsgVacuum msg_vacuum;
- PgStat_MsgAnalyze msg_analyze;
- PgStat_MsgArchiver msg_archiver;
- PgStat_MsgBgWriter msg_bgwriter;
- PgStat_MsgFuncstat msg_funcstat;
- PgStat_MsgFuncpurge msg_funcpurge;
- PgStat_MsgRecoveryConflict msg_recoveryconflict;
- PgStat_MsgDeadlock msg_deadlock;
- PgStat_MsgTempFile msg_tempfile;
- PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -584,13 +225,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
 
-/* ----------
- * PgStat_StatDBEntry The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
- Oid databaseid;
  PgStat_Counter n_xact_commit;
  PgStat_Counter n_xact_rollback;
  PgStat_Counter n_blocks_fetched;
@@ -600,7 +237,6 @@ typedef struct PgStat_StatDBEntry
  PgStat_Counter n_tuples_inserted;
  PgStat_Counter n_tuples_updated;
  PgStat_Counter n_tuples_deleted;
- TimestampTz last_autovac_time;
  PgStat_Counter n_conflict_tablespace;
  PgStat_Counter n_conflict_lock;
  PgStat_Counter n_conflict_snapshot;
@@ -610,29 +246,55 @@ typedef struct PgStat_StatDBEntry
  PgStat_Counter n_temp_bytes;
  PgStat_Counter n_deadlocks;
  PgStat_Counter n_checksum_failures;
- TimestampTz last_checksum_failure;
  PgStat_Counter n_block_read_time; /* times in microseconds */
  PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatDBEntry The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+ Oid databaseid;
+ TimestampTz last_autovac_time;
+ TimestampTz last_checksum_failure;
  TimestampTz stat_reset_timestamp;
- TimestampTz stats_timestamp; /* time of db stats file update */
+ TimestampTz stats_timestamp; /* time of db stats update */
+
+ PgStat_StatDBCounts counts;
 
  /*
- * tables and functions must be last in the struct, because we don't write
- * the pointers out to the stats file.
+ * The followings must be last in the struct, because we don't write them
+ * out to the stats file.
  */
- HTAB   *tables;
- HTAB   *functions;
+ dshash_table_handle tables; /* current gen tables hash */
+ dshash_table_handle functions; /* current gen functions hash */
 } PgStat_StatDBEntry;
 
+/* ----------
+ * PgStat_HashMountInfo  dshash mount information
+ * ----------
+ */
+typedef struct PgStat_HashMountInfo
+{
+ HTAB   *snapshot_tables; /* table entry snapshot */
+ HTAB   *snapshot_functions; /* function entry snapshot */
+ dshash_table *dshash_tables; /* attached tables dshash */
+ dshash_table *dshash_functions; /* attached functions dshash */
+} PgStat_HashMountInfo;
 
 /* ----------
- * PgStat_StatTabEntry The collector's data per table (or index)
+ * PgStat_StatTabEntry The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
  Oid tableid;
+ TimestampTz vacuum_timestamp; /* user initiated vacuum */
+ TimestampTz autovac_vacuum_timestamp; /* autovacuum initiated */
+ TimestampTz analyze_timestamp; /* user initiated */
+ TimestampTz autovac_analyze_timestamp; /* autovacuum initiated */
 
  PgStat_Counter numscans;
 
@@ -651,19 +313,15 @@ typedef struct PgStat_StatTabEntry
  PgStat_Counter blocks_fetched;
  PgStat_Counter blocks_hit;
 
- TimestampTz vacuum_timestamp; /* user initiated vacuum */
  PgStat_Counter vacuum_count;
- TimestampTz autovac_vacuum_timestamp; /* autovacuum initiated */
  PgStat_Counter autovac_vacuum_count;
- TimestampTz analyze_timestamp; /* user initiated */
  PgStat_Counter analyze_count;
- TimestampTz autovac_analyze_timestamp; /* autovacuum initiated */
  PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry The collector's data per function
+ * PgStat_StatFuncEntry per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -678,7 +336,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -694,7 +352,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -760,7 +418,6 @@ typedef enum
  WAIT_EVENT_CHECKPOINTER_MAIN,
  WAIT_EVENT_LOGICAL_APPLY_MAIN,
  WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
- WAIT_EVENT_PGSTAT_MAIN,
  WAIT_EVENT_RECOVERY_WAL_STREAM,
  WAIT_EVENT_SYSLOGGER_MAIN,
  WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1004,7 +661,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1201,13 +858,15 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1222,29 +881,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 
 extern void pgstat_report_autovac(Oid dboid);
@@ -1405,8 +1061,8 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
   void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1415,11 +1071,15 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_snapshot(bool shared,
+ Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern void pgstat_clear_snapshot(void);
 
 #endif /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 8fda8e4f78..13371e8cb7 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
  LWTRANCHE_TBM,
  LWTRANCHE_PARALLEL_APPEND,
  LWTRANCHE_SXACT,
+ LWTRANCHE_STATS,
  LWTRANCHE_FIRST_USER_DEFINED
 } BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
  STANDBY_TIMEOUT,
  STANDBY_LOCK_TIMEOUT,
  IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+ IDLE_STATS_UPDATE_TIMEOUT,
  /* First user-definable timeout reason */
  USER_TIMEOUT,
  /* Maximum number of timeout reasons */
--
2.18.2


From 23f99af7ca3754bcd8bb567e2d8424a0e0abd5b3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v27 6/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  29 +++---
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 133 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 91 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 64614b569c..8bd8fc4d5f 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8151,9 +8151,9 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 355b408b0a..680e1c3564 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6999,11 +6999,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7019,14 +7019,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7057,9 +7056,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8077,7 +8076,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previous activity statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8090,7 +8089,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
         fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
+        activity statistics. The total number of heap tuples is stored in
         the index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index bc4d98fe03..d56afa17db 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2357,12 +2357,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e87fb9634e..80ad6e72dc 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g. after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   The activity statistics is placed on shared memory. When the server shuts
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +213,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -596,7 +587,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -914,7 +905,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
       <tbody>
        <row>
-        <entry morerows="64"><literal>LWLock</literal></entry>
+        <entry morerows="65"><literal>LWLock</literal></entry>
         <entry><literal>ShmemIndexLock</literal></entry>
         <entry>Waiting to find or allocate space in shared memory.</entry>
        </row>
@@ -1197,6 +1188,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to allocate or exchange a chunk of memory or update
          counters during Parallel Hash plan execution.</entry>
         </row>
+        <row>
+         <entry><literal>activity_statistics</literal></entry>
+         <entry>Waiting to write out activity statistics to shared memory
+         during transaction end or idle time.</entry>
+        </row>
         <row>
          <entry morerows="9"><literal>Lock</literal></entry>
          <entry><literal>relation</literal></entry>
@@ -1244,7 +1240,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to acquire a pin on a buffer.</entry>
         </row>
         <row>
-         <entry morerows="12"><literal>Activity</literal></entry>
+         <entry morerows="11"><literal>Activity</literal></entry>
          <entry><literal>ArchiverMain</literal></entry>
          <entry>Waiting in main loop of the archiver process.</entry>
         </row>
@@ -1272,10 +1268,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry><literal>LogicalLauncherMain</literal></entry>
          <entry>Waiting in main loop of logical launcher process.</entry>
         </row>
-        <row>
-         <entry><literal>PgStatMain</literal></entry>
-         <entry>Waiting in main loop of the statistics collector process.</entry>
-        </row>
         <row>
          <entry><literal>RecoveryWalStream</literal></entry>
          <entry>Waiting for WAL from a stream at recovery.</entry>
@@ -4170,9 +4162,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 13bd320b31..52c61d222a 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1259,11 +1259,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
--
2.18.2


From d7b5e4d7a44e75a973ff1cefe68e12325417e320 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <[hidden email]>
Date: Fri, 13 Mar 2020 17:00:44 +0900
Subject: [PATCH v27 7/7] Remove the GUC stats_temp_directory

The GUC used to specify the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contrib, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward compatibility.
---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 ---------
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 +++---
 src/backend/replication/basebackup.c          | 13 ++----
 src/backend/utils/misc/guc.c                  | 41 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 ++-
 src/test/perl/PostgresNode.pm                 |  4 --
 9 files changed, 13 insertions(+), 88 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index bdc9026c62..2885540362 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 680e1c3564..43d0d303ad 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7111,25 +7111,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 1c19e863d2..2f04bb68bb 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c0760854f4..053cd467fd 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -95,15 +95,12 @@ bool pgstat_track_counts = false;
 int pgstat_track_functions = TRACK_FUNC_OFF;
 int pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward compatibility for extensions, having the default
+ * value.
  */
-char   *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char   *pgstat_stat_filename = NULL;
-char   *pgstat_stat_tmpname = NULL;
+char   *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 /*
  * Shared stats bootstrap information, protected by StatsLock.
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index a2e28b064c..7b7d87b938 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -254,7 +254,6 @@ perform_base_backup(basebackup_options *opt)
  TimeLineID endtli;
  StringInfo labelfile;
  StringInfo tblspc_map_file = NULL;
- int datadirpathlen;
  List   *tablespaces = NIL;
 
  backup_total = 0;
@@ -273,8 +272,6 @@ perform_base_backup(basebackup_options *opt)
  backup_total);
  }
 
- datadirpathlen = strlen(DataDir);
-
  backup_started_in_recovery = RecoveryInProgress();
 
  labelfile = makeStringInfo();
@@ -306,13 +303,9 @@ perform_base_backup(basebackup_options *opt)
  * Calculate the relative path of temporary statistics directory in
  * order to skip the files which are located in that directory later.
  */
- if (is_absolute_path(pgstat_stat_directory) &&
- strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
- statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
- else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
- statrelpath = psprintf("./%s", pgstat_stat_directory);
- else
- statrelpath = pgstat_stat_directory;
+
+ Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+ statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
  /* Add a node for the base directory at the end */
  ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index af876d1f01..cabeb806c5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -197,7 +197,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4231,17 +4230,6 @@ static struct config_string ConfigureNamesString[] =
  NULL, NULL, NULL
  },
 
- {
- {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
- gettext_noop("Writes temporary statistics files to the specified directory."),
- NULL,
- GUC_SUPERUSER_ONLY
- },
- &pgstat_temp_directory,
- PG_STAT_TMP_DIR,
- check_canonical_path, assign_pgstat_temp_directory, NULL
- },
-
  {
  {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
  gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11518,35 +11506,6 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
  return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
- /* check_canonical_path already canonicalized newval for us */
- char   *dname;
- char   *tname;
- char   *fname;
-
- /* directory */
- dname = guc_malloc(ERROR, strlen(newval) + 1); /* runtime dir */
- sprintf(dname, "%s", newval);
-
- /* global stats */
- tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
- sprintf(tname, "%s/global.tmp", newval);
- fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
- sprintf(fname, "%s/global.stat", newval);
-
- if (pgstat_stat_directory)
- free(pgstat_stat_directory);
- pgstat_stat_directory = dname;
- if (pgstat_stat_tmpname)
- free(pgstat_stat_tmpname);
- pgstat_stat_tmpname = tname;
- if (pgstat_stat_filename)
- free(pgstat_stat_filename);
- pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index aa44f0c9bf..207e042e99 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -573,7 +573,6 @@
 #track_io_timing = off
 #track_functions = none # none, pl, all
 #track_activity_query_size = 1024 # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6fad13c4be..4971a88c70 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,7 +32,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9575268bd7..f3340f726c 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
  print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
   if defined $ENV{TEMP_CONFIG};
 
- # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
- # concurrently must not share a stats_temp_directory.
- print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
  if ($params{allows_streaming})
  {
  if ($params{allows_streaming} eq "logical")
--
2.18.2

Reply | Threaded
Open this post in threaded view
|

Re: shared-memory based stats collector

Alvaro Herrera-9
On 2020-Mar-27, Kyotaro Horiguchi wrote:

> +/*
> + * XLogArchiveWakeupEnd - Set up archiver wakeup stuff
> + */
> +void
> +XLogArchiveWakeupStart(void)
> +{
> + Latch *old_latch PG_USED_FOR_ASSERTS_ONLY;
> +
> + SpinLockAcquire(&XLogCtl->info_lck);
> + old_latch = XLogCtl->archiverWakeupLatch;
> + XLogCtl->archiverWakeupLatch = MyLatch;
> + SpinLockRelease(&XLogCtl->info_lck);
> + Assert (old_latch == NULL);
> +}

Comment is wrong about the function name; OTOH I don't think the
old_latch assigment in the fourth line won't work well in non-assert
builds.  But why do you need those shenanigans?  Surely
"Assert(XLogCtl->archiverWakeupLatch == NULL)" in the locked region
before assigning MyLatch should be sufficient and acceptable?

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|