psql is hanging

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

psql is hanging

John Smith-2
We have a long script of sql that we run, several thousand lines of sql. If I execute the script from start to finish, somewhere in the middle of it, one sql command will hang and take 2 to 3 hours. During this time, "htop" shows 100% cpu usage with a postgresql process, so it appears to be doing something.

If I stop that command, then immediately rerun the rest of the script starting from the command that "hung", the "hung" command then completes in 5 seconds and the rest of the script just continues on perfectly. I can interrupt the command within 30 mins of hanging, or 2 hours of hanging; regardless of when I interrrupt the hung command, it always immediately completes successully in < 5 seconds when I rerun the script immediately after that.

I have noticed this in multiple scripts, multiple commands; one was an "insert into", a different script hung on a "update" and a third one would hang on a "create table x as (select ... )". These are all short running commands normally, and we have indexes on all the columns for the appropriate "where" clauses.

I can't send the sql (forbidden), but the actual command does not appear to be the issue. Rerunning the same little section of sql manually which includes the hung command, always takes less than a minute, but the entire sql which should only take 30 minutes in total will always hang at the same spot; and then interrupting it (regardless of how long it was running) and rerunning immediately completes.

The "hung" command does eventually finish at times after 2 to 3 hours, most often we can't afford to wait for it to run since this batch of sql has to run daily.

Some other notes:
- Environment is Ubuntu 16.04; postgresql version was originaly 9.5(?) but we have upgraded to 11.1 using postgresql packages for ubuntu and the problem has persisted across postgresql versions.
- There are no other queries running on the postgresql server during this time, and server load is idling (other than whatever psql is doing).
- It is an amazon server, with ZFS as the files system, it has 50% disk free on this storage partition that postgresql database is on.
- We have 60 gigs of ram total, 8 cores.
- We have run the entire script using "create unlogged table" everywhere as well, no difference.
- We have made tunings to the postgresql configuration to make it use more RAM.
- All sql commands are sent through a custom Java application; so we use java + JDBC to connect to postgresql for this. The script has custom commands which are interpreted by the Java app, so we can't test running the exact script start to finish in psql native tools since the native tools won't understand our custom sql. However the java app is very simple. All individual sql commands are committed immediately after they run so there is no long transaction in play here. The psql jdbc driver is "postgresql-42.2.4".
- One may suspect the custom java app; but again, htop shows postgresql using 100% CPU usage on one core, so postgresql is chewing on the command, doing something.

I have attached our current postgresql.conf file.

We're kind of pulling out our hair here, any ideas?

Thanks!

postgresql.conf.zip (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: psql is hanging

Adrian Klaver-4
On 11/30/18 7:53 AM, John Smith wrote:
> We have a long script of sql that we run, several thousand lines of sql.
> If I execute the script from start to finish, somewhere in the middle of
> it, one sql command will hang and take 2 to 3 hours. During this time,
> "htop" shows 100% cpu usage with a postgresql process, so it appears to
> be doing something.

First, a naming note. psql is the name of the Postgres command line
client. What you are actually referring to is the Postgres server.

Second, without the actual commands solving this is going to be
difficult to impossible.

That being said I would suspect some sort of locking issue. I would
start with pg_stat_activity:

https://www.postgresql.org/docs/10/monitoring-stats.html#PG-STAT-ACTIVITY-VIEW

See what it says when you are in the 'hung' phase.

>
> If I stop that command, then immediately rerun the rest of the script
> starting from the command that "hung", the "hung" command then completes
> in 5 seconds and the rest of the script just continues on perfectly. I
> can interrupt the command within 30 mins of hanging, or 2 hours of
> hanging; regardless of when I interrrupt the hung command, it always
> immediately completes successully in < 5 seconds when I rerun the script
> immediately after that.
>
> I have noticed this in multiple scripts, multiple commands; one was an
> "insert into", a different script hung on a "update" and a third one
> would hang on a "create table x as (select ... )". These are all short
> running commands normally, and we have indexes on all the columns for
> the appropriate "where" clauses.
>
> I can't send the sql (forbidden), but the actual command does not appear
> to be the issue. Rerunning the same little section of sql manually which
> includes the hung command, always takes less than a minute, but the
> entire sql which should only take 30 minutes in total will always hang
> at the same spot; and then interrupting it (regardless of how long it
> was running) and rerunning immediately completes.
>
> The "hung" command does eventually finish at times after 2 to 3 hours,
> most often we can't afford to wait for it to run since this batch of sql
> has to run daily.
>
> Some other notes:
> - Environment is Ubuntu 16.04; postgresql version was originaly 9.5(?)
> but we have upgraded to 11.1 using postgresql packages for ubuntu and
> the problem has persisted across postgresql versions.
> - There are no other queries running on the postgresql server during
> this time, and server load is idling (other than whatever psql is doing).
> - It is an amazon server, with ZFS as the files system, it has 50% disk
> free on this storage partition that postgresql database is on.
> - We have 60 gigs of ram total, 8 cores.
> - We have run the entire script using "create unlogged table" everywhere
> as well, no difference.
> - We have made tunings to the postgresql configuration to make it use
> more RAM.
> - All sql commands are sent through a custom Java application; so we use
> java + JDBC to connect to postgresql for this. The script has custom
> commands which are interpreted by the Java app, so we can't test running
> the exact script start to finish in psql native tools since the native
> tools won't understand our custom sql. However the java app is very
> simple. All individual sql commands are committed immediately after they
> run so there is no long transaction in play here. The psql jdbc driver
> is "postgresql-42.2.4".
> - One may suspect the custom java app; but again, htop shows postgresql
> using 100% CPU usage on one core, so postgresql is chewing on the
> command, doing something.
>
> I have attached our current postgresql.conf file.
>
> We're kind of pulling out our hair here, any ideas?
>
> Thanks!


--
Adrian Klaver
[hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: psql is hanging

Christian Mair
In reply to this post by John Smith-2

> We're kind of pulling out our hair here, any ideas?

You might try issuing the command

   analyze;

right *before* the command that hangs.

The rationale behind this idea is that your script changed data
and the "hung" command uses a wrong plan based on outdated statistics.
By the time you run it manually it would be fast again, because in the
mean time the statistics have been updated automatically.

Analyze forces an immediate update of the statistics on the whole
database.

This can by itself take some time according to size of the
database. If my idea works, you can then further optimize by
doing analyze only on the changed tables.

Bye,
Chris.



Reply | Threaded
Open this post in threaded view
|

Re: psql is hanging

Christian Mair
>> We're kind of pulling out our hair here, any ideas?
>
> You might try issuing the command
>
>    analyze;
>
> right *before* the command that hangs.

Clarification: I meant to say: "put it into the script at the location right before the command, that hangs".

Bye,
Chris.



Reply | Threaded
Open this post in threaded view
|

Re: psql is hanging

Ron-2
In reply to this post by John Smith-2
On 11/30/2018 09:53 AM, John Smith wrote:
We have a long script of sql that we run, several thousand lines of sql. If I execute the script from start to finish, somewhere in the middle of it, one sql command will hang and take 2 to 3 hours. During this time, "htop" shows 100% cpu usage with a postgresql process, so it appears to be doing something.

If I stop that command, then immediately rerun the rest of the script starting from the command that "hung", the "hung" command then completes in 5 seconds and the rest of the script just continues on perfectly. I can interrupt the command within 30 mins of hanging, or 2 hours of hanging; regardless of when I interrrupt the hung command, it always immediately completes successully in < 5 seconds when I rerun the script immediately after that.

I have noticed this in multiple scripts, multiple commands; one was an "insert into", a different script hung on a "update" and a third one would hang on a "create table x as (select ... )". These are all short running commands normally, and we have indexes on all the columns for the appropriate "where" clauses.

I can't send the sql (forbidden), but the actual command does not appear to be the issue. Rerunning the same little section of sql manually which includes the hung command, always takes less than a minute, but the entire sql which should only take 30 minutes in total will always hang at the same spot; and then interrupting it (regardless of how long it was running) and rerunning immediately completes.

The "hung" command does eventually finish at times after 2 to 3 hours, most often we can't afford to wait for it to run since this batch of sql has to run daily.

Some other notes:
- Environment is Ubuntu 16.04; postgresql version was originaly 9.5(?) but we have upgraded to 11.1 using postgresql packages for ubuntu and the problem has persisted across postgresql versions.
- There are no other queries running on the postgresql server during this time, and server load is idling (other than whatever psql is doing).
- It is an amazon server, with ZFS as the files system, it has 50% disk free on this storage partition that postgresql database is on.
- We have 60 gigs of ram total, 8 cores.
- We have run the entire script using "create unlogged table" everywhere as well, no difference.
- We have made tunings to the postgresql configuration to make it use more RAM.
- All sql commands are sent through a custom Java application; so we use java + JDBC to connect to postgresql for this. The script has custom commands which are interpreted by the Java app, so we can't test running the exact script start to finish in psql native tools since the native tools won't understand our custom sql. However the java app is very simple. All individual sql commands are committed immediately after they run so there is no long transaction in play here. The psql jdbc driver is "postgresql-42.2.4".
- One may suspect the custom java app; but again, htop shows postgresql using 100% CPU usage on one core, so postgresql is chewing on the command, doing something.

I have attached our current postgresql.conf file.

We're kind of pulling out our hair here, any ideas?

We have a seemingly identical problem in v9.6 on our big prod databases.  It worked perfectly in 8.4 and works perfectly in 9.6 our small staging servers.

A giant sql script that drops the oldest child tables from a large set of partitioned tables (by dropping all relevant triggers, indexes, functions etc, dropping the oldest children, adding the new children then recreating all the relevant triggers, indexes, functions etc).

In 9.6 It works perfectly when I run the script directly from psql.

--
Angular momentum makes the world go 'round.
Reply | Threaded
Open this post in threaded view
|

Re: psql is hanging

Steve Crawford
In reply to this post by Christian Mair


On Fri, Nov 30, 2018 at 8:05 AM Chris Mair <[hidden email]> wrote:

> We're kind of pulling out our hair here, any ideas?

You might try issuing the command

   analyze;

right *before* the command that hangs.


You might consider trying the "auto_explain" module ( https://www.postgresql.org/docs/current/auto-explain.html ). This will let you "trap" the query plan used for the long-running query. Then compare that query plan with a manually run explain when it runs quickly to see if they differ. If they do, it suggests that bad statistics are a likely culprit and Chris' suggestion of running analyze will help.

You only need to analyze those tables used in the query and, most likely, only tables that were substantially changed within a moderately short period prior to the start of the query.

Autovacuum, which will handles analyze as well, typically defaults to checking for tables that need attention every minute so for processes that have a lot of steps it becomes "luck of the draw" whether or not a needed analyze is run after a substantial table change and before that table is used.

We frequently put specific "analyze" statements in such scripts immediately following bulk-update statements.

Cheers,
Steve



Reply | Threaded
Open this post in threaded view
|

Re: psql is hanging

Tom Lane-2
In reply to this post by Ron-2
Ron <[hidden email]> writes:
> We have a seemingly identical problem in v9.6 on our big prod databases.  It
> worked perfectly in 8.4 *and works perfectly in 9.6 our small staging servers*.
> A giant sql script that drops the oldest child tables from a large set of
> partitioned tables (by dropping all relevant triggers, indexes, functions
> etc, dropping the oldest children, adding the new children then recreating
> all the relevant triggers, indexes, functions etc).

A script like that is going to be taking exclusive locks on a whole lot
of tables.  My guess is that you are running up against lock conflicts
against other activity in the DB, and that the "hang" is just waiting
for other transactions to release their table locks.  Looking into
the pg_locks view could be informative.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: psql is hanging

Geoff Winkless
In reply to this post by John Smith-2
On Fri, 30 Nov 2018 at 15:53, John Smith <[hidden email]> wrote:
> We have a long script of sql that we run, several thousand lines of sql. If I execute the script
> from start to finish, somewhere in the middle of it, one sql command will hang and take 2 to
> 3 hours. During this time, "htop" shows 100% cpu usage with a postgresql process, so it
> appears to be doing something.
>
> If I stop that command, then immediately rerun the rest of the script starting from the command that "hung",
> the "hung" command then completes in 5 seconds and the rest of the script just continues on perfectly.

I assume that JDBC doesn't put a transaction around your whole script
if you send it in one hit?

Geoff

Reply | Threaded
Open this post in threaded view
|

Re: psql is hanging

Ron-2
In reply to this post by Tom Lane-2
On 11/30/2018 10:38 AM, Tom Lane wrote:

> Ron <[hidden email]> writes:
>> We have a seemingly identical problem in v9.6 on our big prod databases.  It
>> worked perfectly in 8.4 *and works perfectly in 9.6 our small staging servers*.
>> A giant sql script that drops the oldest child tables from a large set of
>> partitioned tables (by dropping all relevant triggers, indexes, functions
>> etc, dropping the oldest children, adding the new children then recreating
>> all the relevant triggers, indexes, functions etc).
> A script like that is going to be taking exclusive locks on a whole lot
> of tables.  My guess is that you are running up against lock conflicts
> against other activity in the DB, and that the "hang" is just waiting
> for other transactions to release their table locks.  Looking into
> the pg_locks view could be informative.

We thought of that, but...
1. Even when that process is the only one touching the database, it hangs.
2. It worked perfectly on 8.4.
3. It works perfectly on 9.6 using the small staging database.
4. It works perfectly when running the script from psql.

Before that weekly job runs again, the application team is going to upgrade
to JDBC driver v42.2.5 (don't know what they're using now), and add
protocolVersion=2 to the connection string.  I'll post an update next Wednesday.


--
Angular momentum makes the world go 'round.

Reply | Threaded
Open this post in threaded view
|

Re: psql is hanging

John Smith-2
In reply to this post by Steve Crawford
Thanks Chris and Steve for the analyze suggestion; That will be my next test!

On Fri, Nov 30, 2018 at 11:27 AM Steve Crawford <[hidden email]> wrote:


On Fri, Nov 30, 2018 at 8:05 AM Chris Mair <[hidden email]> wrote:

> We're kind of pulling out our hair here, any ideas?

You might try issuing the command

   analyze;

right *before* the command that hangs.


You might consider trying the "auto_explain" module ( https://www.postgresql.org/docs/current/auto-explain.html ). This will let you "trap" the query plan used for the long-running query. Then compare that query plan with a manually run explain when it runs quickly to see if they differ. If they do, it suggests that bad statistics are a likely culprit and Chris' suggestion of running analyze will help.

You only need to analyze those tables used in the query and, most likely, only tables that were substantially changed within a moderately short period prior to the start of the query.

Autovacuum, which will handles analyze as well, typically defaults to checking for tables that need attention every minute so for processes that have a lot of steps it becomes "luck of the draw" whether or not a needed analyze is run after a substantial table change and before that table is used.

We frequently put specific "analyze" statements in such scripts immediately following bulk-update statements.

Cheers,
Steve



Reply | Threaded
Open this post in threaded view
|

Re: psql is hanging

Christopher Browne-3
I like the "add an analyze" idea; the two most likely causes of the
phenomenon (to my mind) are either:

a) Something's getting locked and Tom Lane's idea of checking pg_locks
when you notice it's stopped can help track down the problem.

Further to that, the thing I'd be expecting to see if the problem is
locking is that the connection that's blocked up will be waiting on a
lock held by another connection.

It's near certain that you'll find that the connection doing the work
will have LOTS of locks outstanding; that's not a problem at all;
that's perfectly normal.  You need to look keenly for locks that have
not yet been granted.

b) I find it common in my environments to need to do manual ANALYZE
requests all the time because I'll set up temporary tables (that the
autovacuum daemon can't do anything about) which, as the stats are
lacking, will lead to awful query plans that make queries run badly.

If you use temporary tables, that's like a heavy "thumb on the scale"
that can lead to awful performance, unless those tables get an ANALYZE
after getting populated.

Unpredictable slowness can certainly result from tables having changed
size leading to pathological query plans.  ANALYZE will help.

Reply | Threaded
Open this post in threaded view
|

Re: psql is hanging

Ron-2
In reply to this post by John Smith-2
On 11/30/2018 09:53 AM, John Smith wrote:

> We have a long script of sql that we run, several thousand lines of sql.
> If I execute the script from start to finish, somewhere in the middle of
> it, one sql command will hang and take 2 to 3 hours. During this time,
> "htop" shows 100% cpu usage with a postgresql process, so it appears to be
> doing something.
>
> If I stop that command, then immediately rerun the rest of the script
> starting from the command that "hung", the "hung" command then completes
> in 5 seconds and the rest of the script just continues on perfectly. I can
> interrupt the command within 30 mins of hanging, or 2 hours of hanging;
> regardless of when I interrrupt the hung command, it always immediately
> completes successully in < 5 seconds when I rerun the script immediately
> after that.
>
> I have noticed this in multiple scripts, multiple commands; one was an
> "insert into", a different script hung on a "update" and a third one would
> hang on a "create table x as (select ... )". These are all short running
> commands normally, and we have indexes on all the columns for the
> appropriate "where" clauses.
>
> I can't send the sql (forbidden), but the actual command does not appear
> to be the issue. Rerunning the same little section of sql manually which
> includes the hung command, always takes less than a minute, but the entire
> sql which should only take 30 minutes in total will always hang at the
> same spot; and then interrupting it (regardless of how long it was
> running) and rerunning immediately completes.
>
> The "hung" command does eventually finish at times after 2 to 3 hours,
> most often we can't afford to wait for it to run since this batch of sql
> has to run daily.
>
> Some other notes:
> - Environment is Ubuntu 16.04; postgresql version was originaly 9.5(?) but
> we have upgraded to 11.1 using postgresql packages for ubuntu and the
> problem has persisted across postgresql versions.
> - There are no other queries running on the postgresql server during this
> time, and server load is idling (other than whatever psql is doing).
> - It is an amazon server, with ZFS as the files system, it has 50% disk
> free on this storage partition that postgresql database is on.
> - We have 60 gigs of ram total, 8 cores.
> - We have run the entire script using "create unlogged table" everywhere
> as well, no difference.
> - We have made tunings to the postgresql configuration to make it use more
> RAM.
> - All sql commands are sent through a custom Java application; so we use
> java + JDBC to connect to postgresql for this. The script has custom
> commands which are interpreted by the Java app, so we can't test running
> the exact script start to finish in psql native tools since the native
> tools won't understand our custom sql. However the java app is very
> simple. All individual sql commands are committed immediately after they
> run so there is no long transaction in play here. The psql jdbc driver is
> "postgresql-42.2.4".
> - One may suspect the custom java app; but again, htop shows postgresql
> using 100% CPU usage on one core, so postgresql is chewing on the command,
> doing something.
>
> I have attached our current postgresql.conf file.
>
> We're kind of pulling out our hair here, any ideas?

The (apparent) solution for us was to upgrade the JDBC driver from 42.2.1 to
42.2.5.  Note, though, that our giant script only does DDL statements.

--
Angular momentum makes the world go 'round.