Re: dsa_allocate() faliure

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: dsa_allocate() faliure

Jakub Glapa
Hi, I'm also experiencing the problem: dsa_allocate could not find 7 free pages CONTEXT: parallel worker

I'm running: PostgreSQL 10.5 (Ubuntu 10.5-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609, 64-bit

query plan: (select statement over parent table to many partitions):
select ... 
from fa 
where c_id in (<ID_LIST>) and 
datetime >= '2018/01/01' 
and ((dims ? 'p' and dims ? 'mcp') 
or (datasource in (FA', 'GA'))) 
and not datasource = 'm' 
GROUP BY datasource, dims ->'ct', dims ->'mcp', dims -> 'p', dims -> 'sp':
Finalize GroupAggregate  (cost=31514757.77..31519357.77 rows=40000 width=223)
   Group Key: fa.datasource, ((fa.dims -> 'ct'::text)), ((fa.dims -> 'mcp'::text)), ((fa.dims -> 'p'::text)), ((fa.dims -> 'sp'::text))
   ->  Sort  (cost=31514757.77..31515057.77 rows=120000 width=223)
         Sort Key: fa.datasource, ((fa.dims -> 'ct'::text)), ((fa.dims -> 'mcp'::text)), ((fa.dims -> 'p'::text)), ((fa.dims -> 'sp'::text))
         ->  Gather  (cost=31491634.17..31504634.17 rows=120000 width=223)
               Workers Planned: 3
               ->  Partial HashAggregate  (cost=31490634.17..31491634.17 rows=40000 width=223)
                     Group Key: fa.datasource, (fa.dims -> 'ct'::text), (fa.dims -> 'mcp'::text), (fa.dims -> 'p'::text), (fa.dims -> 'sp'::text)
                     ->  Result  (cost=0.00..31364713.39 rows=5596479 width=175)
                           ->  Append  (cost=0.00..31252783.81 rows=5596479 width=659)
                                 ->  Parallel Seq Scan on fa  (cost=0.00..0.00 rows=1 width=580)
                                       Filter: ((datetime >= '2018-01-01 00:00:00+01'::timestamp with time zone) AND ((datasource)::text <> 'M'::text) AND (((dims ? 'p'::text) AND (dims ? 'mcp'::text)) OR ((datasource)::text =
ANY ('{"FA","GA"}'::text[]))) AND (c_id = ANY ('{<ID_LIST>}'::bigint[])))
                                 ->  Parallel Bitmap Heap Scan on fa_10  (cost=1226.36..53641.49 rows=1 width=1290)
                                       Recheck Cond: (datetime >= '2018-01-01 00:00:00+01'::timestamp with time zone)
                                       Filter: (((datasource)::text <> 'M'::text) AND (((dims ? 'p'::text) AND (dims ? 'mcp'::text)) OR ((datasource)::text = ANY ('<ID_LIST>'::bigint[])))
                                       ->  Bitmap Index Scan on fa_10_rangestart  (cost=0.00..1226.36 rows=32259 width=0)
                                             Index Cond: (datetime >= '2018-01-01 00:00:00+01'::timestamp with time zone)
                                 ->  Parallel Seq Scan on fa_105  (cost=0.00..11.99 rows=1 width=580)
                                       Filter: ((datetime >= '2018-01-01 00:00:00+01'::timestamp with time zone) AND ((datasource)::text <> 'M'::text) AND (((dims ? 'p'::text) AND (dims ? 'mcp'::text)) OR ((datasource)::text =
ANY ('{"FA","GA"}'::text[]))) AND (c_id = ANY ('<ID_LIST>'::bigint[])))
                                 ->  Parallel Seq Scan on fa_106  (cost=0.00..11.99 rows=1 width=580)
                                       Filter: ((datetime >= '2018-01-01 00:00:00+01'::timestamp with time zone) AND ((datasource)::text <> 'M'::text) AND (((dims ? 'p'::text) AND (dims ? 'mcp'::text)) OR ((datasource)::text =
ANY ('{"FA","GA"}'::text[]))) AND (c_id = ANY ('<ID_LIST>..........


--
regards,
Jakub Glapa
Reply | Threaded
Open this post in threaded view
|

Re: dsa_allocate() faliure

Jakub Glapa
Looks like my email didn't match the right thread: https://www.postgresql.org/message-id/flat/CAMAYy4%2Bw3NTBM5JLWFi8twhWK4%3Dk_5L4nV5%2BbYDSPu8r4b97Zg%40mail.gmail.com
Any chance to get some feedback on this?
--
regards,
Jakub Glapa


On Tue, Nov 13, 2018 at 2:08 PM Jakub Glapa <[hidden email]> wrote:
Hi, I'm also experiencing the problem: dsa_allocate could not find 7 free pages CONTEXT: parallel worker

I'm running: PostgreSQL 10.5 (Ubuntu 10.5-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609, 64-bit

query plan: (select statement over parent table to many partitions):
select ... 
from fa 
where c_id in (<ID_LIST>) and 
datetime >= '2018/01/01' 
and ((dims ? 'p' and dims ? 'mcp') 
or (datasource in (FA', 'GA'))) 
and not datasource = 'm' 
GROUP BY datasource, dims ->'ct', dims ->'mcp', dims -> 'p', dims -> 'sp':
Finalize GroupAggregate  (cost=31514757.77..31519357.77 rows=40000 width=223)
   Group Key: fa.datasource, ((fa.dims -> 'ct'::text)), ((fa.dims -> 'mcp'::text)), ((fa.dims -> 'p'::text)), ((fa.dims -> 'sp'::text))
   ->  Sort  (cost=31514757.77..31515057.77 rows=120000 width=223)
         Sort Key: fa.datasource, ((fa.dims -> 'ct'::text)), ((fa.dims -> 'mcp'::text)), ((fa.dims -> 'p'::text)), ((fa.dims -> 'sp'::text))
         ->  Gather  (cost=31491634.17..31504634.17 rows=120000 width=223)
               Workers Planned: 3
               ->  Partial HashAggregate  (cost=31490634.17..31491634.17 rows=40000 width=223)
                     Group Key: fa.datasource, (fa.dims -> 'ct'::text), (fa.dims -> 'mcp'::text), (fa.dims -> 'p'::text), (fa.dims -> 'sp'::text)
                     ->  Result  (cost=0.00..31364713.39 rows=5596479 width=175)
                           ->  Append  (cost=0.00..31252783.81 rows=5596479 width=659)
                                 ->  Parallel Seq Scan on fa  (cost=0.00..0.00 rows=1 width=580)
                                       Filter: ((datetime >= '2018-01-01 00:00:00+01'::timestamp with time zone) AND ((datasource)::text <> 'M'::text) AND (((dims ? 'p'::text) AND (dims ? 'mcp'::text)) OR ((datasource)::text =
ANY ('{"FA","GA"}'::text[]))) AND (c_id = ANY ('{<ID_LIST>}'::bigint[])))
                                 ->  Parallel Bitmap Heap Scan on fa_10  (cost=1226.36..53641.49 rows=1 width=1290)
                                       Recheck Cond: (datetime >= '2018-01-01 00:00:00+01'::timestamp with time zone)
                                       Filter: (((datasource)::text <> 'M'::text) AND (((dims ? 'p'::text) AND (dims ? 'mcp'::text)) OR ((datasource)::text = ANY ('<ID_LIST>'::bigint[])))
                                       ->  Bitmap Index Scan on fa_10_rangestart  (cost=0.00..1226.36 rows=32259 width=0)
                                             Index Cond: (datetime >= '2018-01-01 00:00:00+01'::timestamp with time zone)
                                 ->  Parallel Seq Scan on fa_105  (cost=0.00..11.99 rows=1 width=580)
                                       Filter: ((datetime >= '2018-01-01 00:00:00+01'::timestamp with time zone) AND ((datasource)::text <> 'M'::text) AND (((dims ? 'p'::text) AND (dims ? 'mcp'::text)) OR ((datasource)::text =
ANY ('{"FA","GA"}'::text[]))) AND (c_id = ANY ('<ID_LIST>'::bigint[])))
                                 ->  Parallel Seq Scan on fa_106  (cost=0.00..11.99 rows=1 width=580)
                                       Filter: ((datetime >= '2018-01-01 00:00:00+01'::timestamp with time zone) AND ((datasource)::text <> 'M'::text) AND (((dims ? 'p'::text) AND (dims ? 'mcp'::text)) OR ((datasource)::text =
ANY ('{"FA","GA"}'::text[]))) AND (c_id = ANY ('<ID_LIST>..........


--
regards,
Jakub Glapa
Reply | Threaded
Open this post in threaded view
|

Re: dsa_allocate() faliure

Justin Pryzby
On Wed, Nov 21, 2018 at 03:26:42PM +0100, Jakub Glapa wrote:
> Looks like my email didn't match the right thread:
> https://www.postgresql.org/message-id/flat/CAMAYy4%2Bw3NTBM5JLWFi8twhWK4%3Dk_5L4nV5%2BbYDSPu8r4b97Zg%40mail.gmail.com
> Any chance to get some feedback on this?

In the related thread, it looks like Thomas backpatched a fix to v10, and so I
guess this should be resolved in 10.6, which was released couple weeks ago.
https://www.postgresql.org/message-id/CAEepm%3D0QxoUSkFqYbvmxi2eNvvU6BkqH6fTOu4oOzc1MRAT4Dw%40mail.gmail.com

Could you upgrade and check ?

38763d67784c6563d08dbea5c9f913fa174779b8 in master

|commit ba20d392584cdecc2808fe936448d127f43f2c07
|Author: Thomas Munro <[hidden email]>
|Date:   Thu Sep 20 15:52:39 2018 +1200
|
|    Fix segment_bins corruption in dsa.c.

Justin

Reply | Threaded
Open this post in threaded view
|

Re: dsa_allocate() faliure

Jakub Glapa
Hi Justin, I've upgrade to 10.6 but the error still shows up:

psql db@host as user => select version();
                                                                   version                                                                   
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 PostgreSQL 10.6 (Ubuntu 10.6-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609, 64-bit
(1 row)

Time: 110.512 ms

psql db@host as user  => select <COLUMNS> from fa where client_id in (<IDS>) and datetime >= '2018/01/01' and ((dims ? 'p' and dimensions ? 'mcp') or (datasource in ('FA', 'GA'))) and not datasource = 'M' GROUP BY datasource, dims ->'ct', dimensions ->'mct', dims -> 'p', dims -> 'sp';
ERROR:  XX000: dsa_allocate could not find 7 free pages
CONTEXT:  parallel worker
LOCATION:  dsa_allocate_extended, dsa.c:729
Time: 131400.831 ms (02:11.401)

the above is execute with max_parallel_workers=8
If I set it to max_parallel_workers=0 I also get and my connection is being closed (but the server is alive):

psql db@host as user => set max_parallel_workers=0;
SET
Time: 89.542 ms
psql db@host as user => SELECT <QUERY>;
FATAL:  XX000: dsa_allocate could not find 7 free pages
LOCATION:  dsa_allocate_extended, dsa.c:729
SSL connection has been closed unexpectedly
The connection to the server was lost. Attempting reset: Succeeded.
Time: 200390.466 ms (03:20.390)



--
regards,
Jakub Glapa


On Thu, Nov 22, 2018 at 5:10 PM Justin Pryzby <[hidden email]> wrote:
On Wed, Nov 21, 2018 at 03:26:42PM +0100, Jakub Glapa wrote:
> Looks like my email didn't match the right thread:
> https://www.postgresql.org/message-id/flat/CAMAYy4%2Bw3NTBM5JLWFi8twhWK4%3Dk_5L4nV5%2BbYDSPu8r4b97Zg%40mail.gmail.com
> Any chance to get some feedback on this?

In the related thread, it looks like Thomas backpatched a fix to v10, and so I
guess this should be resolved in 10.6, which was released couple weeks ago.
https://www.postgresql.org/message-id/CAEepm%3D0QxoUSkFqYbvmxi2eNvvU6BkqH6fTOu4oOzc1MRAT4Dw%40mail.gmail.com

Could you upgrade and check ?

38763d67784c6563d08dbea5c9f913fa174779b8 in master

|commit ba20d392584cdecc2808fe936448d127f43f2c07
|Author: Thomas Munro <[hidden email]>
|Date:   Thu Sep 20 15:52:39 2018 +1200
|
|    Fix segment_bins corruption in dsa.c.

Justin
Reply | Threaded
Open this post in threaded view
|

Re: dsa_allocate() faliure

Justin Pryzby
On Fri, Nov 23, 2018 at 03:31:41PM +0100, Jakub Glapa wrote:
> Hi Justin, I've upgrade to 10.6 but the error still shows up:
>
> If I set it to max_parallel_workers=0 I also get and my connection is being
> closed (but the server is alive):
>
> psql db@host as user => set max_parallel_workers=0;

Can you show the plan (explain without analyze) for the nonparallel case?

Also, it looks like the server crashed in that case (even if it restarted
itself quickly).  Can you confirm ?

For example: dmesg |tail might show "postmaster[8582]: segfault [...]" or
similar.  And other clients would've been disconnected.  (For example, you'd
get an error in another, previously-connected session the next time you run:
SELECT 1).

In any case, could you try to find a minimal way to reproduce the problem ?  I
mean, is the dataset and query small and something you can publish, or can you
reproduce with data generated from (for example) generate_series() ?

Thanks,
Justin

Reply | Threaded
Open this post in threaded view
|

Re: dsa_allocate() faliure

Jakub Glapa
So, the issue occurs only on production db an right now I cannot reproduce it.
I had a look at dmesg and indeed I see something like:


--
regards,
pozdrawiam,
Jakub Glapa


On Fri, Nov 23, 2018 at 5:10 PM Justin Pryzby <[hidden email]> wrote:
On Fri, Nov 23, 2018 at 03:31:41PM +0100, Jakub Glapa wrote:
> Hi Justin, I've upgrade to 10.6 but the error still shows up:
>
> If I set it to max_parallel_workers=0 I also get and my connection is being
> closed (but the server is alive):
>
> psql db@host as user => set max_parallel_workers=0;

Can you show the plan (explain without analyze) for the nonparallel case?

Also, it looks like the server crashed in that case (even if it restarted
itself quickly).  Can you confirm ?

For example: dmesg |tail might show "postmaster[8582]: segfault [...]" or
similar.  And other clients would've been disconnected.  (For example, you'd
get an error in another, previously-connected session the next time you run:
SELECT 1).

In any case, could you try to find a minimal way to reproduce the problem ?  I
mean, is the dataset and query small and something you can publish, or can you
reproduce with data generated from (for example) generate_series() ?

Thanks,
Justin
Reply | Threaded
Open this post in threaded view
|

Re: dsa_allocate() faliure

Jakub Glapa
sorry, the message was sent out to early.

So, the issue occurs only on production db an right now I cannot reproduce it.
I had a look at dmesg and indeed I see something like:

postgres[30667]: segfault at 0 ip 0000557834264b16 sp 00007ffc2ce1e030 error 4 in postgres[557833db7000+6d5000]

and AFAIR other sessions I had opened at that time were indeed disconnected.

When it comes to the execution plan for max_parallel_workers=0. 
There is no real difference.
I guess max_parallel_workers has no effect and max_parallel_workers_per_gather should have been used.
 Why it caused a server crash is unknown right now.

I cannot really give a reproducible recipe. 
My case is that I have a parent table with ~300 partitions. 
And I initiate a select on ~100 of them with select [...] from fa where client_id(<IDS>) and [filters].  
I know this is not effective. Every partition has several indexes and this query acquires a lot of locks... even for relations not used in the query. 
PG11 should have better partition pruning mechanism but I'm not there yet to upgrade.
Some of the partitions have millions of rows.

I'll keep observing maybe I'l find a pattern when this occurs.


--
regards,
pozdrawiam,
Jakub Glapa


On Mon, Nov 26, 2018 at 4:26 PM Jakub Glapa <[hidden email]> wrote:
So, the issue occurs only on production db an right now I cannot reproduce it.
I had a look at dmesg and indeed I see something like:


--
regards,
Jakub Glapa


On Fri, Nov 23, 2018 at 5:10 PM Justin Pryzby <[hidden email]> wrote:
On Fri, Nov 23, 2018 at 03:31:41PM +0100, Jakub Glapa wrote:
> Hi Justin, I've upgrade to 10.6 but the error still shows up:
>
> If I set it to max_parallel_workers=0 I also get and my connection is being
> closed (but the server is alive):
>
> psql db@host as user => set max_parallel_workers=0;

Can you show the plan (explain without analyze) for the nonparallel case?

Also, it looks like the server crashed in that case (even if it restarted
itself quickly).  Can you confirm ?

For example: dmesg |tail might show "postmaster[8582]: segfault [...]" or
similar.  And other clients would've been disconnected.  (For example, you'd
get an error in another, previously-connected session the next time you run:
SELECT 1).

In any case, could you try to find a minimal way to reproduce the problem ?  I
mean, is the dataset and query small and something you can publish, or can you
reproduce with data generated from (for example) generate_series() ?

Thanks,
Justin
Reply | Threaded
Open this post in threaded view
|

Re: dsa_allocate() faliure

Justin Pryzby
Hi, thanks for following through.

On Mon, Nov 26, 2018 at 04:38:35PM +0100, Jakub Glapa wrote:
> I had a look at dmesg and indeed I see something like:
>
> postgres[30667]: segfault at 0 ip 0000557834264b16 sp 00007ffc2ce1e030
> error 4 in postgres[557833db7000+6d5000]

That's useful, I think "at 0" means a null pointer dereferenced.

Can you check /var/log/messages (or ./syslog or similar) and verify the
timestamp matches the time of the last crash (and not an unrelated crash) ?

The logs might also indicate if the process dumped a core file anywhere.

I don't know what distribution/OS you're using, but it might be good to install
abrt (RHEL) or apport (ubuntu) or other mechanism to save coredumps, or to
manually configure /proc/sys/kernel/core_pattern.

On centos, I usually set:
/etc/abrt/abrt-action-save-package-data.conf
OpenGPGCheck = no

Also, it might be good to install debug symbols, in case you do find a core
dump now or get one later.

On centos: yum install postgresql10-debuginfo or debuginfo-install postgresql10-server
Make sure this exactly matches the debug symbols exactly match the server version.

Justin

Reply | Threaded
Open this post in threaded view
|

Re: dsa_allocate() faliure

Jakub Glapa
Justin thanks for the information! 
I'm running Ubuntu 16.04. 
I'll try to prepare for the next crash. 
Couldn't find anything this time.


--
regards,
Jakub Glapa


On Mon, Nov 26, 2018 at 4:52 PM Justin Pryzby <[hidden email]> wrote:
Hi, thanks for following through.

On Mon, Nov 26, 2018 at 04:38:35PM +0100, Jakub Glapa wrote:
> I had a look at dmesg and indeed I see something like:
>
> postgres[30667]: segfault at 0 ip 0000557834264b16 sp 00007ffc2ce1e030
> error 4 in postgres[557833db7000+6d5000]

That's useful, I think "at 0" means a null pointer dereferenced.

Can you check /var/log/messages (or ./syslog or similar) and verify the
timestamp matches the time of the last crash (and not an unrelated crash) ?

The logs might also indicate if the process dumped a core file anywhere.

I don't know what distribution/OS you're using, but it might be good to install
abrt (RHEL) or apport (ubuntu) or other mechanism to save coredumps, or to
manually configure /proc/sys/kernel/core_pattern.

On centos, I usually set:
/etc/abrt/abrt-action-save-package-data.conf
OpenGPGCheck = no

Also, it might be good to install debug symbols, in case you do find a core
dump now or get one later.

On centos: yum install postgresql10-debuginfo or debuginfo-install postgresql10-server
Make sure this exactly matches the debug symbols exactly match the server version.

Justin
Reply | Threaded
Open this post in threaded view
|

Re: dsa_allocate() faliure

Alvaro Herrera-9
On 2018-Nov-26, Jakub Glapa wrote:

> Justin thanks for the information!
> I'm running Ubuntu 16.04.
> I'll try to prepare for the next crash.
> Couldn't find anything this time.

As I recall, the appport stuff in Ubuntu is terrible ... I've seen it
take 40 minutes to write the crash dump to disk, during which the
database was "down".  I don't know why it is so slow (it's a rather
silly python script that apparently processes the core dump one byte at
a time, and you can imagine that with a few gigabytes of shared memory
that takes a while).  Anyway my recommendation was to *remove* that
stuff from the server and make sure the core file is saved by normal
means.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: dsa_allocate() faliure

Justin Pryzby
In reply to this post by Justin Pryzby
Hi,

On Mon, Nov 26, 2018 at 09:52:07AM -0600, Justin Pryzby wrote:
> Hi, thanks for following through.
>
> On Mon, Nov 26, 2018 at 04:38:35PM +0100, Jakub Glapa wrote:
> > I had a look at dmesg and indeed I see something like:
> >
> > postgres[30667]: segfault at 0 ip 0000557834264b16 sp 00007ffc2ce1e030
> > error 4 in postgres[557833db7000+6d5000]
>
> That's useful, I think "at 0" means a null pointer dereferenced.

Thomas fixed several bugs in DSA, which will be in next release, postgres 10.8
and 11.3.

However that doesn't explain the segfault you saw, and I don't see anything
which looks relevant changed since in 10.5.

If you still see that using the latest minor release (10.7), please try to
capture a core file and send a backtrace with a new thread on pgsql-hackers.

Thanks,
Justin

Reply | Threaded
Open this post in threaded view
|

Re: dsa_allocate() faliure

Jakub Glapa
Hi I just checked the dmesg.
The segfault I wrote about is the only one I see, dated Nov 24 last year.
Since then no other segfaults happened although dsa_allocated failures happen daily.
I'll report if anything occurs.
I have the core dumping setup in place.



--
regards,
pozdrawiam,
Jakub Glapa


On Sun, Feb 17, 2019 at 11:21 PM Justin Pryzby <[hidden email]> wrote:
Hi,

On Mon, Nov 26, 2018 at 09:52:07AM -0600, Justin Pryzby wrote:
> Hi, thanks for following through.
>
> On Mon, Nov 26, 2018 at 04:38:35PM +0100, Jakub Glapa wrote:
> > I had a look at dmesg and indeed I see something like:
> >
> > postgres[30667]: segfault at 0 ip 0000557834264b16 sp 00007ffc2ce1e030
> > error 4 in postgres[557833db7000+6d5000]
>
> That's useful, I think "at 0" means a null pointer dereferenced.

Thomas fixed several bugs in DSA, which will be in next release, postgres 10.8
and 11.3.

However that doesn't explain the segfault you saw, and I don't see anything
which looks relevant changed since in 10.5.

If you still see that using the latest minor release (10.7), please try to
capture a core file and send a backtrace with a new thread on pgsql-hackers.

Thanks,
Justin