seems like a bug in pgbench -R

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

seems like a bug in pgbench -R

Tomas Vondra-4
Hi,

There seems to be a bug in pgbench when used with '-R' option, resulting
in stuck pgbench process. Reproducing it is pretty easy:

echo 'select 1' > select.sql

while /bin/true; do
  pgbench -n -f select.sql -R 1000 -j 8 -c 8 -T 1 > /dev/null 2>&1;
  date;
done;

I do get a stuck pgbench within a few rounds (~10-20), but YMMV. It gets
stuck like this (this is on REL_11_STABLE):

0x00007f3b1814fcb3 in __GI___select (nfds=nfds@entry=0,
readfds=readfds@entry=0x7fff9d668ec0, writefds=writefds@entry=0x0,
exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x0)
    at ../sysdeps/unix/sysv/linux/select.c:41
41      return SYSCALL_CANCEL (select, nfds, readfds, writefds, exceptfds,
(gdb) bt
#0  0x00007f3b1814fcb3 in __GI___select (nfds=nfds@entry=0,
readfds=readfds@entry=0x7fff9d668ec0, writefds=writefds@entry=0x0,
exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x0)
    at ../sysdeps/unix/sysv/linux/select.c:41
#1  0x000000000040834d in threadRun (arg=arg@entry=0x1c9f370) at
pgbench.c:5810
#2  0x0000000000403e0a in main (argc=<optimized out>, argv=<optimized
out>) at pgbench.c:5583

All the connections are already closed at this point (there is nothing
in pg_stat_activity and log_disconnections=on confirms that), and then
there's this:

(gdb) p remains
$1 = 0
(gdb) p nstate
$1 = 1
(gdb) p state[0]
$2 = {con = 0x0, id = 0, state = CSTATE_FINISHED, cstack = 0xf21b10,
use_file = 0, command = 1, variables = 0xf2b0b0, nvariables = 4,
vars_sorted = false, txn_scheduled = 699536519281, sleep_until = 0,
txn_begin = {tv_sec = 699536, tv_nsec = 518478603}, stmt_begin = {tv_sec
= 0,  tv_nsec = 0}, prepared = {false <repeats 128 times>}, cnt = 132,
ecnt = 0}

So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or
one of the other commits touching this part of the code.

cheers

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: seems like a bug in pgbench -R

Fabien COELHO-3

> echo 'select 1' > select.sql
>
> while /bin/true; do
>  pgbench -n -f select.sql -R 1000 -j 8 -c 8 -T 1 > /dev/null 2>&1;
>  date;
> done;

Indeed. I'll look at it over the weekend.

> So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or
> one of the other commits touching this part of the code.

Yep, possibly.

--
Fabien.

Reply | Threaded
Open this post in threaded view
|

Re: seems like a bug in pgbench -R

Fabien COELHO-3

>> echo 'select 1' > select.sql
>>
>> while /bin/true; do
>>  pgbench -n -f select.sql -R 1000 -j 8 -c 8 -T 1 > /dev/null 2>&1;
>>  date;
>> done;
>
> Indeed. I'll look at it over the weekend.
>
>> So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or
>> one of the other commits touching this part of the code.

I could not reproduce this issue on head, but I confirm on 11.2.

--
Fabien.

Reply | Threaded
Open this post in threaded view
|

Re: seems like a bug in pgbench -R

Tomas Vondra-4
On 3/15/19 5:16 PM, Fabien COELHO wrote:

>
>>> echo 'select 1' > select.sql
>>>
>>> while /bin/true; do
>>>  pgbench -n -f select.sql -R 1000 -j 8 -c 8 -T 1 > /dev/null 2>&1;
>>>  date;
>>> done;
>>
>> Indeed. I'll look at it over the weekend.
>>
>>> So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or
>>> one of the other commits touching this part of the code.
>
> I could not reproduce this issue on head, but I confirm on 11.2.
>

AFAICS on head it's fixed by 3bac77c48f166b9024a5ead984df73347466ae12

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: seems like a bug in pgbench -R

Fabien COELHO-3

Hello Tomas,

>>>> So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or
>>>> one of the other commits touching this part of the code.
>>
>> I could not reproduce this issue on head, but I confirm on 11.2.
>>
>
> AFAICS on head it's fixed by 3bac77c48f166b9024a5ead984df73347466ae12

Thanks for the information.

I pinpointed the exact issue in one go: no surprise given that the patch
was motivated by cleaning up this kind of external state machine changes
which I thought doubtful and error prone.

Attached is a fix to apply on pg11.

--
Fabien.

pg11-pgbench-rate-fix-1.patch (642 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: seems like a bug in pgbench -R

Imai, Yoshikazu
Hi Fabien,

On Fri, Mar 15, 2019 at 4:17 PM, Fabien COELHO wrote:

> >> echo 'select 1' > select.sql
> >>
> >> while /bin/true; do
> >>  pgbench -n -f select.sql -R 1000 -j 8 -c 8 -T 1 > /dev/null 2>&1;
> >>  date;
> >> done;
> >
> > Indeed. I'll look at it over the weekend.
> >
> >> So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or
> >> one of the other commits touching this part of the code.
>
> I could not reproduce this issue on head, but I confirm on 11.2.

I could reproduce the stuck on 11.4.

On Sat, Mar 16, 2019 at 10:14 AM, Fabien COELHO wrote:
> Attached is a fix to apply on pg11.

I confirm the stuck doesn't happen after applying your patch.

It passes make check-world.

This change seems not to affect performance, so I didn't do any performance
test.

> + /* under throttling we may have finished the last client above */
> + if (remains == 0)
> + break;

If there are only CSTATE_WAIT_RESULT, CSTATE_SLEEP or CSTATE_THROTTLE clients,
a thread needs to wait the results or sleep. In that logic, there are the case
that a thread tried to wait the results when there are no clients wait the
results, and this causes the issue. This is happened when there are only
CSTATE_THROTLE clients and pgbench timeout is occured. Those clients will be
finished and "remains" will be 0.

I confirmed above codes prevent such a case.


I almost think this is ready for committer, but I have one question.

Is it better adding any check like if(maxsock != -1) before the select?


else                /* no explicit delay, select without timeout */
{
    nsocks = select(maxsock + 1, &input_mask, NULL, NULL, NULL);
}

--
Yoshikazu Imai


Reply | Threaded
Open this post in threaded view
|

RE: seems like a bug in pgbench -R

Fabien COELHO-3

Hello Yoshikazu,

>> I could not reproduce this issue on head, but I confirm on 11.2.
>
> I could reproduce the stuck on 11.4.
>
>> Attached is a fix to apply on pg11.
>
> I confirm the stuck doesn't happen after applying your patch.

Ok, thanks for the feedback.

>> + /* under throttling we may have finished the last client above */
>> + if (remains == 0)
>> + break;
>
> If there are only CSTATE_WAIT_RESULT, CSTATE_SLEEP or CSTATE_THROTTLE clients,
> a thread needs to wait the results or sleep. In that logic, there are the case
> that a thread tried to wait the results when there are no clients wait the
> results, and this causes the issue. This is happened when there are only
> CSTATE_THROTLE clients and pgbench timeout is occured. Those clients will be
> finished and "remains" will be 0.
>
> I confirmed above codes prevent such a case.

Yep.

> I almost think this is ready for committer,

Almost great, then.

> but I have one question. Is it better adding any check like if(maxsock
> != -1) before the select?
>
> else                /* no explicit delay, select without timeout */
> {
>    nsocks = select(maxsock + 1, &input_mask, NULL, NULL, NULL);
> }

I think that it is not necessary because this case cannot happen: If some
clients are still running (remains > 0), they are either sleeping, in
which case there would be a timeout, or they are waiting for something
from the server, otherwise the script could be advanced further so there
would be something else to do for the thread.

We could check this by adding "Assert(maxsock != -1);" before this select,
but I would not do that for a released version.

--
Fabien.


Reply | Threaded
Open this post in threaded view
|

RE: seems like a bug in pgbench -R

Imai, Yoshikazu
On Wed, July 24, 2019 at 7:02 PM, Fabien COELHO wrote:

> > but I have one question. Is it better adding any check like if(maxsock
> > != -1) before the select?
> >
> > else                /* no explicit delay, select without timeout */
> > {
> >    nsocks = select(maxsock + 1, &input_mask, NULL, NULL, NULL); }
>
> I think that it is not necessary because this case cannot happen: If some
> clients are still running (remains > 0), they are either sleeping, in
> which case there would be a timeout, or they are waiting for something
> from the server, otherwise the script could be advanced further so there
> would be something else to do for the thread.

Ah, I understand.

> We could check this by adding "Assert(maxsock != -1);" before this select,
> but I would not do that for a released version.

Yeah I also imagined that we can use Assert, but ah, it's released version.
I got it. Thanks for telling that.

So I'll mark this ready for committer.

--
Yoshikazu Imai



Reply | Threaded
Open this post in threaded view
|

RE: seems like a bug in pgbench -R

Fabien COELHO-3

Hello Yoshikazu,

> |...] So I'll mark this ready for committer.

Ok, thanks for the review.

--
Fabien.


Reply | Threaded
Open this post in threaded view
|

Re: seems like a bug in pgbench -R

Tom Lane-2
Fabien COELHO <[hidden email]> writes:
>> |...] So I'll mark this ready for committer.

> Ok, thanks for the review.

LGTM, pushed.

                        regards, tom lane