Trouble with hashagg spill I/O pattern and costing

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Trouble with hashagg spill I/O pattern and costing

Tomas Vondra-4
Hi,

I've been running some TPC-H benchmarks on master, to check if there's
something unexpected, and I ran into some annoying issues with Q17 and
Q20. I'll use Q17 as it's a bit simpler.

I think there are two related problem - with costing and with excessive
I/O due to using logical tapes.

Let's talk about the costing first. On 75GB scale (with disabled parallel
query), the execution plan looks like this:

                                                       QUERY PLAN
     ---------------------------------------------------------------------------------------------------------------
      Limit  (cost=16997740.10..16997740.12 rows=1 width=32)
        ->  Aggregate  (cost=16997740.10..16997740.12 rows=1 width=32)
              ->  Nested Loop  (cost=14204895.82..16997574.11 rows=66397 width=8)
                    Join Filter: (part.p_partkey = lineitem.l_partkey)
                    ->  Hash Join  (cost=14204895.25..16251060.84 rows=6640 width=40)
                          Hash Cond: (lineitem_1.l_partkey = part.p_partkey)
                          ->  HashAggregate  (cost=13977751.34..15945557.39 rows=6206695 width=36)
                                Group Key: lineitem_1.l_partkey
                                Planned Partitions: 128
                                ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9)
                          ->  Hash  (cost=227058.33..227058.33 rows=6846 width=4)
                                ->  Seq Scan on part  (cost=0.00..227058.33 rows=6846 width=4)
                                      Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar))
                    ->  Index Scan using idx_lineitem_part_supp on lineitem  (cost=0.57..112.30 rows=10 width=17)
                          Index Cond: (l_partkey = lineitem_1.l_partkey)
                          Filter: (l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
     (16 rows)

and if I disale hash aggregate (or spill to disk), it changes to this:

                                                            QUERY PLAN                                                        
     -------------------------------------------------------------------------------------------------------------------------
      Limit  (cost=44577524.39..44577524.40 rows=1 width=32)
        ->  Aggregate  (cost=44577524.39..44577524.40 rows=1 width=32)
              ->  Merge Join  (cost=41772792.17..44577358.39 rows=66397 width=8)
                    Merge Cond: (lineitem_1.l_partkey = part.p_partkey)
                    Join Filter: (lineitem.l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
                    ->  GroupAggregate  (cost=41772791.17..43305665.51 rows=6206695 width=36)
                          Group Key: lineitem_1.l_partkey
                          ->  Sort  (cost=41772791.17..42252715.81 rows=191969856 width=9)
                                Sort Key: lineitem_1.l_partkey
                                ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9)
                    ->  Materialize  (cost=1.00..1191105.89 rows=205371 width=21)
                          ->  Nested Loop  (cost=1.00..1190592.46 rows=205371 width=21)
                                ->  Index Scan using part_pkey on part  (cost=0.43..329262.21 rows=6846 width=4)
                                      Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar))
                                ->  Index Scan using idx_lineitem_part_supp on lineitem  (cost=0.57..125.51 rows=31 width=17)
                                      Index Cond: (l_partkey = part.p_partkey)
     (16 rows)

The problem is that the hashagg plan runs in ~1400 seconds, while the
groupagg only takes ~360. And per explain analyze, the difference really
is in the aggregation - if we subtract the seqscan, the sort+groupagg
takes about 310s:

     ->  GroupAggregate  (cost=41772791.17..43305665.51 rows=6206695 width=36) (actual time=283378.004..335611.192 rows=6398981 loops=1)
           Group Key: lineitem_1.l_partkey
           ->  Sort  (cost=41772791.17..42252715.81 rows=191969856 width=9) (actual time=283377.977..306182.393 rows=191969841 loops=1)
                 Sort Key: lineitem_1.l_partkey
                 Sort Method: external merge  Disk: 3569544kB
                 ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9) (actual time=0.019..28253.076 rows=192000551 loops=1)

while the hashagg takes ~1330s:

     ->  HashAggregate  (cost=13977751.34..15945557.39 rows=6206695 width=36) (actual time=202952.170..1354546.897 rows=6400000 loops=1)
           Group Key: lineitem_1.l_partkey
           Planned Partitions: 128
           Peak Memory Usage: 4249 kB
           Disk Usage: 26321840 kB
           HashAgg Batches: 16512
           ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9) (actual time=0.007..22205.617 rows=192000551 loops=1)

And that's while only writing 26GB, compared to 35GB in the sorted plan,
and with cost being ~16M vs. ~43M (so roughly inverse).

OK, let's make the hashagg plan more expensive - that'll fix it, right?.
But how do you do that? I might lower the work_mem, say from 4MB to 1MB,
which gets us from ~16M

     ->  HashAggregate  (cost=13977751.34..15945557.39 rows=6206695 width=36)
           Group Key: lineitem_1.l_partkey
           Planned Partitions: 128
           ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9)

to ~20M (I'm a bit surprised that the planned partitions dropped 4x, but
I suspect there's an explanation for that).

      ->  HashAggregate  (cost=17727162.59..20632321.45 rows=6206695 width=36)
            Group Key: lineitem_1.l_partkey
            Planned Partitions: 32
            ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9)

Anyway, this did not really solve anything, apparently. The cost is
still much lower than for groupagg, and moreover I don't want to lower
work_mem - I want to increase cost for a given work_mem value. And it
also increases the sort cost estimate, of course.

As I'll show in a minute, I believe most of this is due to I/O pattern
for the logical tapes, which is very different between sort and hashagg.
So it'd be natural to consider seq_page_cost/random_page_cost on the
temp tablespace, but that's not how it works - we just ignore that :-(


Why do I think this is due to a difference in I/O pattern on the logical
tape set? I've moved the temporary tablespace to a separate SSD device,
and used iosnoop [1] to collect all I/O requests associated with this
query. Attached are four charts showing blocks (sectors) accessed over
time, both for the groupagg and hashagg plans.


1) sort + groupagg

For groupagg (tempio-sort.png) the chart looks a bit chaotic, but it's
reasonable - it shows the sort does merges, etc. Nothing particularly
surprising, IMHO.

It's also interesting to look at statistics of block sizes, and deltas
of the blocks, for different request types. Showing the most common
block sizes look something like this (the last column is percentage
of all requests with the same request type):

      type |  bytes  | count |   pct  
     ------+---------+-------+-------
      RA   |  131072 | 26034 | 59.92
      RA   |   16384 |  6160 | 14.18
      RA   |    8192 |  3636 |  8.37
      RA   |   32768 |  3406 |  7.84
      RA   |   65536 |  3270 |  7.53
      RA   |   24576 |   361 |  0.83
      ...
      W    | 1310720 |  8070 | 34.26
      W    |  262144 |  1213 |  5.15
      W    |  524288 |  1056 |  4.48
      W    | 1056768 |   689 |  2.93
      W    |  786432 |   292 |  1.24
      W    |  802816 |   199 |  0.84
      ...

The writes are buffered and so are done by kworkers, which seem to be
able to coalesce them into fairly large chunks (e.g. 34% are 1280kB).
The reads come from the postgres backend, and generally are 128kB reads.

The deltas (in 512B sectors) are mostly consistent with this:

      type | block_delta | count |   pct
     ------+-------------+-------+-------
      RA   |         256 | 13432 | 30.91
      RA   |          16 |  3291 |  7.57
      RA   |          32 |  3272 |  7.53
      RA   |          64 |  3266 |  7.52
      RA   |         128 |  2877 |  6.62
      RA   |        1808 |  1278 |  2.94
      RA   |       -2320 |   483 |  1.11
      RA   |       28928 |   386 |  0.89
      ...
      W    |        2560 |  7856 | 33.35
      W    |        2064 |  4921 | 20.89
      W    |        2080 |   586 |  2.49
      W    |       30960 |   300 |  1.27
      W    |        2160 |   253 |  1.07
      W    |        1024 |   248 |  1.05
      ...

I believe this suggests most of the I/O is pretty sequential. E.g. 31%
of the reads are 256 sectors (128kB) apart, which is proportional to the
128kB reads.


2) hashagg

The I/O pattern is illustrated by the tempion-hash.png chart, and it's
clearly very different from the sort one. We're reading over and over
in a zig-zag way. I'm pretty sure there are ~128 cycles matching the
number of partitions in the explain analyze output, which end up being
interleaved in the temporary file.

But even at the partition level this is not very very sequential - there
are two "zoom" charts showing smaller parts in more detail, and there's
very obvious nested zig-zag pattern.

Also, let's look at the block / delta stats:

      type |  bytes  |  count  |    pct
     ------+---------+---------+--------
      RA   |    8192 | 3087724 |  95.42
      RA   |   24576 |   69511 |   2.15
      RA   |   16384 |   49297 |   1.52
      RA   |   32768 |   15589 |   0.48
      ...
      W    |    8192 |  321089 |  65.72
      W    |   16384 |   74097 |  15.17
      W    |   24576 |   27785 |   5.69
      W    | 1310720 |   16860 |   3.45
      W    |   32768 |   13823 |   2.83
      W    |   40960 |    7771 |   1.59
      W    |   49152 |    4767 |   0.98
      ...

Well, that's not great - we're not really coalescing writes or reads,
everything is pretty much 8kB block. Especially the writes are somewhat
surprising/concerning, because it shows the kernel is unable to combine
the requests etc.

The deltas look very different too:

      type | block_delta | count |   pct
     ------+-------------+-------+-------
      RA   |        2016 | 72399 |  2.24
      RA   |        2032 | 72351 |  2.24
      RA   |        1984 | 72183 |  2.23
      RA   |        2000 | 71964 |  2.22
      RA   |        2048 | 71718 |  2.22
      RA   |        2064 | 71387 |  2.21
      RA   |        1968 | 71363 |  2.21
      RA   |        1952 | 70412 |  2.18
      RA   |        2080 | 70189 |  2.17
      RA   |        2096 | 69568 |  2.15
      RA   |        1936 | 69109 |  2.14
      RA   |        1920 | 67660 |  2.09
      RA   |        2112 | 67248 |  2.08
      RA   |        1904 | 66026 |  2.04
     ...

There's no clear winner matching the block size, or anything. In fact,
it does oscillate around 2000 sectors, i.e. 1MB. And 128 partitions
multiplied by 8kB block per partition is ... 1MB (tadaaaa!).

This however makes any read-ahead attempts ineffective :-(

And let me repeat - this is on a machine with temp tablespace moved to
an SSD, so the random I/O is not entirely terrible. On a different box
with temp tablespace on 3x SATA RAID, the impact is much worse.


This kinda makes me question whether logical tapes are the right tool
for hashagg. I've read the explanation in logtape.c why it's about the
same amount of I/O as using separate files, but IMO that only really
works for I/O patters similar to merge sort - the more I think about
this, the more I'm convinced we should just do what hashjoin is doing.

But maybe I'm wrong, and logical tapes are the best thing we can do
here. But in that case I think we need to improve the costing, so that
it reflects the very different I/O pattern.


[1] https://github.com/brendangregg/perf-tools/blob/master/iosnoop

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

tempio-sort.png (216K) Download Attachment
tempio-hash.png (217K) Download Attachment
tempio-hash-1-zoom.png (63K) Download Attachment
tempio-hash-2-zoom.png (42K) Download Attachment
plans.txt (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Jeff Davis-8
On Tue, 2020-05-19 at 17:12 +0200, Tomas Vondra wrote:
> I think there are two related problem - with costing and with
> excessive
> I/O due to using logical tapes.

Thank you for the detailed analysis. I am still digesting this
information.

> This kinda makes me question whether logical tapes are the right tool
> for hashagg. I've read the explanation in logtape.c why it's about
> the
> same amount of I/O as using separate files, but IMO that only really
> works for I/O patters similar to merge sort - the more I think about
> this, the more I'm convinced we should just do what hashjoin is
> doing.

Fundamentally, sort writes sequentially and reads randomly; while
HashAgg writes randomly and reads sequentially.

If the random writes of HashAgg end up fragmented too much on disk,
then clearly the sequential reads are not so sequential anyway. The
only way to avoid fragmentation on disk is to preallocate for the
tape/file.

BufFile (relying more on the OS) would probably do a better job of
preallocating the disk space in a useful way; whereas logtape.c makes
it easier to manage buffers and the overall number of files created
(thereby allowing higher fanout of partitions).

We have a number of possibilities here:

1. Improve costing to reflect that HashAgg is creating more random IOs
than Sort.
2. Reduce the partition fanout in the hopes that the OS does a better
job with readahead.
3. Switch back to BufFile, in which case we probably need to reduce the
fanout for other reasons.
4. Change logtape.c to allow preallocation or to write in larger
blocks.
5. Change BufFile to allow more control over buffer usage, and switch
to that.

#1 or #2 are the least invasive, and I think we can get a satisfactory
solution by combining those.

I saw good results with the high fanout and low work_mem when there is
still a lot of system memory. That's a nice benefit, but perhaps it's
safer to use a lower fanout (which will lead to recursion) until we get
a better handle on the IO patterns.

Perhaps you can try recompiling with a lower max partitions and rerun
the query? How much would we have to lower it for either the cost to
approach reality or the OS readahead to become effective?

Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Tomas Vondra-4
On Tue, May 19, 2020 at 09:27:34AM -0700, Jeff Davis wrote:

>On Tue, 2020-05-19 at 17:12 +0200, Tomas Vondra wrote:
>> I think there are two related problem - with costing and with
>> excessive
>> I/O due to using logical tapes.
>
>Thank you for the detailed analysis. I am still digesting this
>information.
>
>> This kinda makes me question whether logical tapes are the right tool
>> for hashagg. I've read the explanation in logtape.c why it's about
>> the
>> same amount of I/O as using separate files, but IMO that only really
>> works for I/O patters similar to merge sort - the more I think about
>> this, the more I'm convinced we should just do what hashjoin is
>> doing.
>
>Fundamentally, sort writes sequentially and reads randomly; while
>HashAgg writes randomly and reads sequentially.
>

Not sure. I think the charts and stats of iosnoop data show that an
awful lot of reads during sort is actually pretty sequential. Moreover,
sort manages to read the data in much larger blocks - 128kB instead of
just 8kB (which is what hashagg seems to be doing).

I wonder why is that and if we could achieve that for hashagg too ...

>If the random writes of HashAgg end up fragmented too much on disk,
>then clearly the sequential reads are not so sequential anyway. The
>only way to avoid fragmentation on disk is to preallocate for the
>tape/file.
>

And if there a way to pre-allocate larger chunks? Presumably we could
assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x 8kB)
instead of just single block. I haven't seen anything like that in
tape.c, though ...

>BufFile (relying more on the OS) would probably do a better job of
>preallocating the disk space in a useful way; whereas logtape.c makes
>it easier to manage buffers and the overall number of files created
>(thereby allowing higher fanout of partitions).
>
>We have a number of possibilities here:
>
>1. Improve costing to reflect that HashAgg is creating more random IOs
>than Sort.

I think we'll need to do something about this, but I think we should try
improving the behavior first and then model the costing based on that.

>2. Reduce the partition fanout in the hopes that the OS does a better
>job with readahead.

I doubt this will make a significant difference. I think the problem is
the partitions end up interleaved way too much in the temp file, and I
don't see how a lower fanout would fix that.

BTW what do you mean when you say "fanout"? Do you mean how fast we
increase the number of partitions, or some parameter in particular?

>3. Switch back to BufFile, in which case we probably need to reduce the
>fanout for other reasons.

Maybe, although that seems pretty invasive post beta1.

>4. Change logtape.c to allow preallocation or to write in larger
>blocks.

I think this is what I suggested above (allocating 16 blocks at a time,
or something). I wonder how wasteful this would be, but I think not very
much. Essentially, with 1024 partitions and pre-allocating space in
128kB chunks, that means 128MB may end up unused, which seems ok-ish,
and I guess we could further restrict that by starting with lower value
and gradually increasing the number. Or something like that ...

>5. Change BufFile to allow more control over buffer usage, and switch
>to that.
>

Maybe. I don't recall what exactly is the issue with buffer usage, but I
think it has the same invasiveness issue as (3). OTOH it's what hashjoin
does, and we've lived with it for ages ...

>#1 or #2 are the least invasive, and I think we can get a satisfactory
>solution by combining those.
>

OK. I think tweaking the costing (and essentially reverting to what 12
does for those queries) is perfectly reasonable. But if we can actually
get some speedup thanks to hashagg, even better.

>I saw good results with the high fanout and low work_mem when there is
>still a lot of system memory. That's a nice benefit, but perhaps it's
>safer to use a lower fanout (which will lead to recursion) until we get
>a better handle on the IO patterns.
>

I don't know how much we can rely on that - once we push some of the
data from page cache, it has the issues I described. The trouble is
people may not have enough memory to keep everything in cache, otherwise
they might just as well bump up work_mem and not spill at all.

>Perhaps you can try recompiling with a lower max partitions and rerun
>the query? How much would we have to lower it for either the cost to
>approach reality or the OS readahead to become effective?
>

I can try that, of course. Which parameters should I tweak / how?

I can also try running it with BufFile, in case you prepare a WIP patch.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Jeff Davis-8
On Tue, 2020-05-19 at 19:53 +0200, Tomas Vondra wrote:
>
> And if there a way to pre-allocate larger chunks? Presumably we could
> assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x
> 8kB)
> instead of just single block. I haven't seen anything like that in
> tape.c, though ...

It turned out to be simple (at least a POC) so I threw together a
patch. I just added a 32-element array of block numbers to each tape.
When we need a new block, we retrieve a block number from that array;
or if it's empty, we fill it by calling ltsGetFreeBlock() 32 times.

I reproduced the problem on a smaller scale (330M groups, ~30GB of
memory on a 16GB box). Work_mem=64MB. The query is a simple distinct.

Unpatched master:
   Sort: 250s
   HashAgg: 310s
Patched master:
   Sort: 245s
   HashAgg: 262s

That's a nice improvement for such a simple patch. We can tweak the
number of blocks to preallocate, or do other things like double from a
small number up to a maximum. Also, a proper patch would probably
release the blocks back as free when the tape was rewound.

As long as the number of block numbers to preallocate is not too large,
I don't think we need to change the API. It seems fine for sort to do
the same thing, even though there's not any benefit.

Regards,
        Jeff Davis



logtape-prealloc.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Tomas Vondra-4
On Tue, May 19, 2020 at 09:15:40PM -0700, Jeff Davis wrote:

>On Tue, 2020-05-19 at 19:53 +0200, Tomas Vondra wrote:
>>
>> And if there a way to pre-allocate larger chunks? Presumably we could
>> assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x
>> 8kB)
>> instead of just single block. I haven't seen anything like that in
>> tape.c, though ...
>
>It turned out to be simple (at least a POC) so I threw together a
>patch. I just added a 32-element array of block numbers to each tape.
>When we need a new block, we retrieve a block number from that array;
>or if it's empty, we fill it by calling ltsGetFreeBlock() 32 times.
>
>I reproduced the problem on a smaller scale (330M groups, ~30GB of
>memory on a 16GB box). Work_mem=64MB. The query is a simple distinct.
>
>Unpatched master:
>   Sort: 250s
>   HashAgg: 310s
>Patched master:
>   Sort: 245s
>   HashAgg: 262s
>
>That's a nice improvement for such a simple patch. We can tweak the
>number of blocks to preallocate, or do other things like double from a
>small number up to a maximum. Also, a proper patch would probably
>release the blocks back as free when the tape was rewound.
>
>As long as the number of block numbers to preallocate is not too large,
>I don't think we need to change the API. It seems fine for sort to do
>the same thing, even though there's not any benefit.
>

I gave it a try on the machine with temp tablespace on SSD, and I can
confirm it improves performance. I've tried with different work_mem
values and I've also increased the number of pre-allocated blocks to 64
and 128 blocks, and the numbers look like this:

master

                sort      hash
    ----------------------------
       4MB       335      1331
     128MB       220      1208


patched (32)

                sort      hash
    ----------------------------
       4MB       344       685
     128MB       217       641


patched (64)

                 sort      hash
    ----------------------------
       4MB       329       545
     128MB       214       493

patched (128)

                sort      hash
    ----------------------------
       4MB       331       478
     128MB       222       434


I agree that's pretty nice. I wonder how far would we need to go before
reaching a plateau. I'll try this on the other machine with temporary
tablespace on SATA, but that'll take longer.

The I/O pattern changed significantly - it's not visible on the charts,
so I'm not attaching them. But the statistics of block sizes and "gaps"
are pretty clear.


size of I/O requests
--------------------

a) master

      type |  bytes  |  count  |  pct  
     ------+---------+---------+--------
      RA   |    8192 | 2905948 |  95.83
      RA   |   24576 |   63470 |   2.09
      RA   |   16384 |   40155 |   1.32
      W    |    8192 |  149295 |  52.85
      W    |   16384 |   51781 |  18.33
      W    |   24576 |   22247 |   7.88
      W    | 1310720 |   15493 |   5.48
      W    |   32768 |   11856 |   4.20

b) patched, 32 blocks

      type |  bytes  | count  |  pct
     ------+---------+--------+--------
      RA   |  131072 | 247686 |  41.75
      RA   |    8192 |  95746 |  16.14
      RA   |   16384 |  82314 |  13.87
      RA   |   32768 |  82146 |  13.85
      RA   |   65536 |  82126 |  13.84
      W    | 1310720 |  16815 |  52.19
      W    |  262144 |   3628 |  11.26
      W    |  524288 |   2791 |   8.66

c) patched, 64 blocks

      type |  bytes  | count  |  pct
     ------+---------+--------+--------
      RA   |  131072 | 213556 |  56.18
      RA   |    8192 |  47663 |  12.54
      RA   |   16384 |  39358 |  10.35
      RA   |   32768 |  39308 |  10.34
      RA   |   65536 |  39304 |  10.34
      W    | 1310720 |  18132 |  65.27
      W    |  524288 |   3722 |  13.40
      W    |  262144 |    581 |   2.09
      W    | 1048576 |    405 |   1.46
      W    |    8192 |    284 |   1.02

d) patched, 128 blocks

      type |  bytes  | count  |  pct
     ------+---------+--------+--------
      RA   |  131072 | 200816 |  70.93
      RA   |    8192 |  23640 |   8.35
      RA   |   16384 |  19324 |   6.83
      RA   |   32768 |  19279 |   6.81
      RA   |   65536 |  19273 |   6.81
      W    | 1310720 |  18000 |  65.91
      W    |  524288 |   2074 |   7.59
      W    | 1048576 |    660 |   2.42
      W    |    8192 |    409 |   1.50
      W    |  786432 |    354 |   1.30

Clearly, the I/O requests are much larger - both reads and writes
shifted from 8kB to much larger ones, and the larger the number of
blocks the more significant the shift is. This means the workload is
getting more "sequential" and the write combining / read-ahead becomes
more effective.


deltas between I/O requests
---------------------------

I'll only show reads to save space, it's about the same for writes.

a) master

      type | block_delta | count  |  pct  
     ------+-------------+--------+-------
      RA   |         256 | 569237 | 18.77
      RA   |         240 | 475182 | 15.67
      RA   |         272 | 437260 | 14.42
      RA   |         224 | 328604 | 10.84
      RA   |         288 | 293628 |  9.68
      RA   |         208 | 199530 |  6.58
      RA   |         304 | 181695 |  5.99
      RA   |         192 | 109472 |  3.61
      RA   |         320 | 105211 |  3.47
      RA   |         336 |  57423 |  1.89

b) patched, 32 blocks

      type | block_delta | count  |  pct  
     ------+-------------+--------+-------
      RA   |         256 | 165071 | 27.82
      RA   |          32 |  82129 | 13.84
      RA   |          64 |  82122 | 13.84
      RA   |         128 |  82077 | 13.83
      RA   |          16 |  82042 | 13.83
      RA   |        7440 |  45168 |  7.61
      RA   |        7952 |   9838 |  1.66

c) patched, 64 blocks

      type | block_delta | count  |  pct
     ------+-------------+--------+-------
      RA   |         256 | 173737 | 45.70
      RA   |          32 |  39301 | 10.34
      RA   |          64 |  39299 | 10.34
      RA   |         128 |  39291 | 10.34
      RA   |          16 |  39250 | 10.32
      RA   |       15120 |  21202 |  5.58
      RA   |       15376 |   4448 |  1.17

d) patched, 128 blocks

      type | block_delta | count  |  pct
     ------+-------------+--------+-------
      RA   |         256 | 180955 | 63.91
      RA   |          32 |  19274 |  6.81
      RA   |          64 |  19273 |  6.81
      RA   |         128 |  19264 |  6.80
      RA   |          16 |  19203 |  6.78
      RA   |       30480 |   9835 |  3.47

The way I understand it, this needs to be interpreted together with
block size stats - in a perfectly sequential workload the two stats
would match. For master that's clearly not the case - the most common
read request size is 8kB, but the most common delta is 128kB (256
sectors, which is the read-ahead for the SSD device). The patched
results are much closer, mostly thanks to switching to 128kB reads.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Tomas Vondra-4
In reply to this post by Tomas Vondra-4
On Tue, May 19, 2020 at 05:12:02PM +0200, Tomas Vondra wrote:

>
> ...
>
>The problem is that the hashagg plan runs in ~1400 seconds, while the
>groupagg only takes ~360. And per explain analyze, the difference really
>is in the aggregation - if we subtract the seqscan, the sort+groupagg
>takes about 310s:
>
>    ->  GroupAggregate  (cost=41772791.17..43305665.51 rows=6206695 width=36) (actual time=283378.004..335611.192 rows=6398981 loops=1)
>          Group Key: lineitem_1.l_partkey
>          ->  Sort  (cost=41772791.17..42252715.81 rows=191969856 width=9) (actual time=283377.977..306182.393 rows=191969841 loops=1)
>                Sort Key: lineitem_1.l_partkey
>                Sort Method: external merge  Disk: 3569544kB
>                ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9) (actual time=0.019..28253.076 rows=192000551 loops=1)
>
>while the hashagg takes ~1330s:
>
>    ->  HashAggregate  (cost=13977751.34..15945557.39 rows=6206695 width=36) (actual time=202952.170..1354546.897 rows=6400000 loops=1)
>          Group Key: lineitem_1.l_partkey
>          Planned Partitions: 128
>          Peak Memory Usage: 4249 kB
>          Disk Usage: 26321840 kB
>          HashAgg Batches: 16512
>          ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9) (actual time=0.007..22205.617 rows=192000551 loops=1)
>
>And that's while only writing 26GB, compared to 35GB in the sorted plan,
>and with cost being ~16M vs. ~43M (so roughly inverse).
>

I've noticed I've actually made a mistake here - it's not 26GB vs. 35GB
in hash vs. sort, it's 26GB vs. 3.5GB. That is, the sort-based plan
writes out *way less* data to the temp file.

The reason is revealed by explain verbose:

   ->  GroupAggregate
         Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity))
         Group Key: lineitem_1.l_partkey
         ->  Sort
               Output: lineitem_1.l_partkey, lineitem_1.l_quantity
               Sort Key: lineitem_1.l_partkey
               ->  Seq Scan on public.lineitem lineitem_1
                     Output: lineitem_1.l_partkey, lineitem_1.l_quantity

   ->  HashAggregate
         Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity))
         Group Key: lineitem_1.l_partkey
         ->  Seq Scan on public.lineitem lineitem_1
               Output: lineitem_1.l_orderkey, lineitem_1.l_partkey,
                       lineitem_1.l_suppkey, lineitem_1.l_linenumber,
                       lineitem_1.l_quantity, lineitem_1.l_extendedprice,
                       lineitem_1.l_discount, lineitem_1.l_tax,
                       lineitem_1.l_returnflag, lineitem_1.l_linestatus,
                       lineitem_1.l_shipdate, lineitem_1.l_commitdate,
                       lineitem_1.l_receiptdate, lineitem_1.l_shipinstruct,
                       lineitem_1.l_shipmode, lineitem_1.l_comment

It seems that in the hashagg case we're not applying projection in the
seqscan, forcing us to serialize way much data (the whole lineitem
table, essentially).

It's probably still worth tweaking the I/O pattern, I think.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Tomas Vondra-4
On Thu, May 21, 2020 at 03:41:22PM +0200, Tomas Vondra wrote:

>On Tue, May 19, 2020 at 05:12:02PM +0200, Tomas Vondra wrote:
>>
>>...
>>
>>The problem is that the hashagg plan runs in ~1400 seconds, while the
>>groupagg only takes ~360. And per explain analyze, the difference really
>>is in the aggregation - if we subtract the seqscan, the sort+groupagg
>>takes about 310s:
>>
>>   ->  GroupAggregate  (cost=41772791.17..43305665.51 rows=6206695 width=36) (actual time=283378.004..335611.192 rows=6398981 loops=1)
>>         Group Key: lineitem_1.l_partkey
>>         ->  Sort  (cost=41772791.17..42252715.81 rows=191969856 width=9) (actual time=283377.977..306182.393 rows=191969841 loops=1)
>>               Sort Key: lineitem_1.l_partkey
>>               Sort Method: external merge  Disk: 3569544kB
>>               ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9) (actual time=0.019..28253.076 rows=192000551 loops=1)
>>
>>while the hashagg takes ~1330s:
>>
>>   ->  HashAggregate  (cost=13977751.34..15945557.39 rows=6206695 width=36) (actual time=202952.170..1354546.897 rows=6400000 loops=1)
>>         Group Key: lineitem_1.l_partkey
>>         Planned Partitions: 128
>>         Peak Memory Usage: 4249 kB
>>         Disk Usage: 26321840 kB
>>         HashAgg Batches: 16512
>>         ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9) (actual time=0.007..22205.617 rows=192000551 loops=1)
>>
>>And that's while only writing 26GB, compared to 35GB in the sorted plan,
>>and with cost being ~16M vs. ~43M (so roughly inverse).
>>
>
>I've noticed I've actually made a mistake here - it's not 26GB vs. 35GB
>in hash vs. sort, it's 26GB vs. 3.5GB. That is, the sort-based plan
>writes out *way less* data to the temp file.
>
>The reason is revealed by explain verbose:
>
>  ->  GroupAggregate
>        Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity))
>        Group Key: lineitem_1.l_partkey
>        ->  Sort
>              Output: lineitem_1.l_partkey, lineitem_1.l_quantity
>              Sort Key: lineitem_1.l_partkey
>              ->  Seq Scan on public.lineitem lineitem_1
>                    Output: lineitem_1.l_partkey, lineitem_1.l_quantity
>
>  ->  HashAggregate
>        Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity))
>        Group Key: lineitem_1.l_partkey
>        ->  Seq Scan on public.lineitem lineitem_1
>              Output: lineitem_1.l_orderkey, lineitem_1.l_partkey,
>                      lineitem_1.l_suppkey, lineitem_1.l_linenumber,
>                      lineitem_1.l_quantity, lineitem_1.l_extendedprice,
>                      lineitem_1.l_discount, lineitem_1.l_tax,
>                      lineitem_1.l_returnflag, lineitem_1.l_linestatus,
>                      lineitem_1.l_shipdate, lineitem_1.l_commitdate,
>                      lineitem_1.l_receiptdate, lineitem_1.l_shipinstruct,
>                      lineitem_1.l_shipmode, lineitem_1.l_comment
>
>It seems that in the hashagg case we're not applying projection in the
>seqscan, forcing us to serialize way much data (the whole lineitem
>table, essentially).
>
>It's probably still worth tweaking the I/O pattern, I think.
>
OK, it seems the attached trivial fix (simply changing CP_LABEL_TLIST to
CP_SMALL_TLIST) addresses this for me. I've only tried it on the patched
version that pre-allocates 128 blocks, and the results seem pretty nice:

                sort      hash      hash+tlist
    ------------------------------------------
       4MB       331       478            188
     128MB       222       434            210

which I guess is what we wanted ...

I'll give it a try on the other machine (temp on SATA), but I don't see
why would it not behave similarly nicely.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

agg-tlist-fix.patch (651 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Tomas Vondra-4
In reply to this post by Tomas Vondra-4
On Thu, May 21, 2020 at 02:12:55AM +0200, Tomas Vondra wrote:
>
> ...
>
>I agree that's pretty nice. I wonder how far would we need to go before
>reaching a plateau. I'll try this on the other machine with temporary
>tablespace on SATA, but that'll take longer.
>

OK, I've managed to get some numbers from the other machine, with 75GB
data set and temp tablespace on SATA RAID. I haven't collected I/O data
using iosnoop this time, because we already know how that changes from
the other machine. I've also only done this with 128MB work_mem, because
of how long a single run takes, and with 128 blocks pre-allocation.

The patched+tlist means both pre-allocation and with the tlist tweak
I've posted to this thread a couple minutes ago:

                master       patched       patched+tlist
    -----------------------------------------------------
     sort         485            472                 462
     hash       24686           3060                 559

So the pre-allocation makes it 10x faster, and the tlist tweak makes it
5x faster. Not bad, I guess.

Note: I've slightly tweaked read-ahead on the RAID device(s) on those
patched runs, but the effect was pretty negligible (compared to other
patched runs with the old read-ahead setting).


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Robert Haas
On Thu, May 21, 2020 at 10:45 AM Tomas Vondra
<[hidden email]> wrote:
> So the pre-allocation makes it 10x faster, and the tlist tweak makes it
> 5x faster. Not bad, I guess.

That is pretty great stuff, Tomas.

FWIW, I agree that CP_SMALL_TLIST seems like the right thing here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Jeff Davis-8
In reply to this post by Tomas Vondra-4
On Thu, 2020-05-21 at 16:30 +0200, Tomas Vondra wrote:
> OK, it seems the attached trivial fix (simply changing CP_LABEL_TLIST
> to
> CP_SMALL_TLIST) addresses this for me.

Great!

There were a couple plan changes where it introduced a Subquery Scan.
I'm not sure that I understand why it's doing that, can you verify that
it is a reasonable thing to do?

Aside from that, feel free to commit it.

Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Tomas Vondra-4
On Thu, May 21, 2020 at 11:19:01AM -0700, Jeff Davis wrote:

>On Thu, 2020-05-21 at 16:30 +0200, Tomas Vondra wrote:
>> OK, it seems the attached trivial fix (simply changing CP_LABEL_TLIST
>> to
>> CP_SMALL_TLIST) addresses this for me.
>
>Great!
>
>There were a couple plan changes where it introduced a Subquery Scan.
>I'm not sure that I understand why it's doing that, can you verify that
>it is a reasonable thing to do?
>
>Aside from that, feel free to commit it.
>

It's doing that because we're doing projection everywhere, even in cases
when it may not be necessary - but I think that's actually OK.

At first I thought we might only do it conditionally when we expect to
spill to disk, but that'd not work for cases when we only realize we
need to spill to disk during execution.

So I think the plan changes are correct and expected.

I think we should do the pre-allocation patch too. I haven't tried yet
but I believe the tlist fix alone won't do nearly as good.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Tomas Vondra-4
On Thu, May 21, 2020 at 08:34:05PM +0200, Tomas Vondra wrote:
>On Thu, May 21, 2020 at 11:19:01AM -0700, Jeff Davis wrote:
>
> ...
>
>I think we should do the pre-allocation patch too. I haven't tried yet
>but I believe the tlist fix alone won't do nearly as good.
>

I've done some measurements on the smaller (SSD) machine, and the
comparison looks like this:

             sort   hash   hash+prealloc+tlist   hash+tlist
    --------------------------------------------------------
       4MB    331    478                   188          330
     128MB    222    434                   210          350


The last column is master with the tlist tweak alone - it's better than
hashagg on master alone, but it's not nearly as good as with both tlist
and prealloc patches.

I can't test this on the larger box with SATA temporary tablespace at
the moment (other tests are running), but I believe the difference will
be even more pronounced there.

I don't think we're under a lot of pressure - beta1 is out anyway, so we
have time to do proper testing first.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Jeff Davis-8
On Thu, 2020-05-21 at 20:54 +0200, Tomas Vondra wrote:
> The last column is master with the tlist tweak alone - it's better
> than
> hashagg on master alone, but it's not nearly as good as with both
> tlist
> and prealloc patches.

Right, I certainly think we should do the prealloc change, as well.

I'm tweaking the patch to be a bit more flexible. I'm thinking we
should start the preallocation list size ~8 and then double it up to
~128 (depending on your results). That would reduce the waste in case
we have a large number of small partitions.

Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Tomas Vondra-4
In reply to this post by Jeff Davis-8
On Tue, May 19, 2020 at 09:15:40PM -0700, Jeff Davis wrote:

>On Tue, 2020-05-19 at 19:53 +0200, Tomas Vondra wrote:
>>
>> And if there a way to pre-allocate larger chunks? Presumably we could
>> assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x
>> 8kB)
>> instead of just single block. I haven't seen anything like that in
>> tape.c, though ...
>
>It turned out to be simple (at least a POC) so I threw together a
>patch. I just added a 32-element array of block numbers to each tape.
>When we need a new block, we retrieve a block number from that array;
>or if it's empty, we fill it by calling ltsGetFreeBlock() 32 times.
>

I think the PoC patch goes in the right direction. I have two ideas how
to improve it a bit:

1) Instead of assigning the pages one by one, we can easily extend the
API to allow getting a range of blocks, so that we don't need to call
ltsGetFreeBlock in a loop. Instead we could call ltsGetFreeBlockRange
with the requested number of blocks. And we could keep just a min/max
of free blocks, not an array with fixed number of elements.

2) We could make it self-tuning, by increasing the number of blocks
we pre-allocate. So every time we exhaust the range, we double the
number of blocks (with a reasonable maximum, like 1024 or so). Or we
might just increment it by 32, or something.

IIUC the danger of pre-allocating blocks is that we might not fill them,
resulting in temp file much larger than necessary. It might be harmless
on some (most?) current filesystems that don't actually allocate space
for blocks that are never written, but it also confuses our accounting
of temporary file sizes. So we should try to limit that, and growing the
number of pre-allocated blocks over time seems reasonable.

Both (1) and (2) seem fairly simple, not much more complex than the
current PoC patch.

I also wonder if we could collect / report useful statistics about I/O
on the temporary file, not just the size. I mean, how many pages we've
written/read, how sequential it was, etc. But some of that is probably
only visible at the OS level (e.g. we have no insignt into how the
kernel combines writes in page cache, etc.). This is clearly matter for
v14, though.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Tomas Vondra-4
In reply to this post by Jeff Davis-8
On Thu, May 21, 2020 at 12:04:19PM -0700, Jeff Davis wrote:

>On Thu, 2020-05-21 at 20:54 +0200, Tomas Vondra wrote:
>> The last column is master with the tlist tweak alone - it's better
>> than
>> hashagg on master alone, but it's not nearly as good as with both
>> tlist
>> and prealloc patches.
>
>Right, I certainly think we should do the prealloc change, as well.
>
>I'm tweaking the patch to be a bit more flexible. I'm thinking we
>should start the preallocation list size ~8 and then double it up to
>~128 (depending on your results). That would reduce the waste in case
>we have a large number of small partitions.
>

You're reading my mind ;-)

I don't think 128 is necessarily the maximum we should use - it's just
that I haven't tested higher values. I wouldn't be surprised if higher
values made it a bit faster. But we can test and tune that, I agree with
growing the number of pre-allocted blocks over time.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Jeff Davis-8
In reply to this post by Tomas Vondra-4
On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote:
> 1) Instead of assigning the pages one by one, we can easily extend
> the
> API to allow getting a range of blocks, so that we don't need to call
> ltsGetFreeBlock in a loop. Instead we could call ltsGetFreeBlockRange
> with the requested number of blocks.

ltsGetFreeBlock() just draws one element from a minheap. Is there some
more efficient way to get many elements from a minheap at once?

>  And we could keep just a min/max
> of free blocks, not an array with fixed number of elements.

I don't quite know what you mean. Can you elaborate?

Regards,
        Jeff Davis




Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Tomas Vondra-4
On Thu, May 21, 2020 at 12:40:23PM -0700, Jeff Davis wrote:

>On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote:
>> 1) Instead of assigning the pages one by one, we can easily extend
>> the
>> API to allow getting a range of blocks, so that we don't need to call
>> ltsGetFreeBlock in a loop. Instead we could call ltsGetFreeBlockRange
>> with the requested number of blocks.
>
>ltsGetFreeBlock() just draws one element from a minheap. Is there some
>more efficient way to get many elements from a minheap at once?
>
>>  And we could keep just a min/max
>> of free blocks, not an array with fixed number of elements.
>
>I don't quite know what you mean. Can you elaborate?
>

Ah, I forgot there's and internal minheap thing - I thought we're just
incrementing some internal counter or something like that, but with the
minheap we can't just get a range of blocks. So just disregard that,
you're right we need the array.



regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Jeff Davis-8
In reply to this post by Tomas Vondra-4
On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote:
> 2) We could make it self-tuning, by increasing the number of blocks
> we pre-allocate. So every time we exhaust the range, we double the
> number of blocks (with a reasonable maximum, like 1024 or so). Or we
> might just increment it by 32, or something.

Attached a new version that uses the doubling behavior, and cleans it
up a bit. It also returns the unused prealloc blocks back to lts-
>freeBlocks when the tape is rewound for reading.

> IIUC the danger of pre-allocating blocks is that we might not fill
> them,
> resulting in temp file much larger than necessary. It might be
> harmless
> on some (most?) current filesystems that don't actually allocate
> space
> for blocks that are never written, but it also confuses our
> accounting
> of temporary file sizes. So we should try to limit that, and growing
> the
> number of pre-allocated blocks over time seems reasonable.
There's another danger here: it doesn't matter how well the filesystem
deals with sparse writes, because ltsWriteBlock fills in the holes with
zeros anyway. That's potentially a significant amount of wasted IO
effort if we aren't careful.

Regards,
        Jeff Davis


logtape-prealloc-v2.patch (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Tomas Vondra-4
On Thu, May 21, 2020 at 02:16:37PM -0700, Jeff Davis wrote:

>On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote:
>> 2) We could make it self-tuning, by increasing the number of blocks
>> we pre-allocate. So every time we exhaust the range, we double the
>> number of blocks (with a reasonable maximum, like 1024 or so). Or we
>> might just increment it by 32, or something.
>
>Attached a new version that uses the doubling behavior, and cleans it
>up a bit. It also returns the unused prealloc blocks back to lts-
>freeBlocks when the tape is rewound for reading.
>

Ah, the returning is a nice idea, that should limit the overhead quite a
bit, I think.

>> IIUC the danger of pre-allocating blocks is that we might not fill
>> them,
>> resulting in temp file much larger than necessary. It might be
>> harmless
>> on some (most?) current filesystems that don't actually allocate
>> space
>> for blocks that are never written, but it also confuses our
>> accounting
>> of temporary file sizes. So we should try to limit that, and growing
>> the
>> number of pre-allocated blocks over time seems reasonable.
>
>There's another danger here: it doesn't matter how well the filesystem
>deals with sparse writes, because ltsWriteBlock fills in the holes with
>zeros anyway. That's potentially a significant amount of wasted IO
>effort if we aren't careful.
>

True. I'll give it a try on both machines and report some numbers. Might
take a couple of days.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Trouble with hashagg spill I/O pattern and costing

Tomas Vondra-4
On Thu, May 21, 2020 at 11:41:22PM +0200, Tomas Vondra wrote:

>On Thu, May 21, 2020 at 02:16:37PM -0700, Jeff Davis wrote:
>>On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote:
>>>2) We could make it self-tuning, by increasing the number of blocks
>>>we pre-allocate. So every time we exhaust the range, we double the
>>>number of blocks (with a reasonable maximum, like 1024 or so). Or we
>>>might just increment it by 32, or something.
>>
>>Attached a new version that uses the doubling behavior, and cleans it
>>up a bit. It also returns the unused prealloc blocks back to lts-
>>freeBlocks when the tape is rewound for reading.
>>
>
>Ah, the returning is a nice idea, that should limit the overhead quite a
>bit, I think.
>
>>>IIUC the danger of pre-allocating blocks is that we might not fill
>>>them,
>>>resulting in temp file much larger than necessary. It might be
>>>harmless
>>>on some (most?) current filesystems that don't actually allocate
>>>space
>>>for blocks that are never written, but it also confuses our
>>>accounting
>>>of temporary file sizes. So we should try to limit that, and growing
>>>the
>>>number of pre-allocated blocks over time seems reasonable.
>>
>>There's another danger here: it doesn't matter how well the filesystem
>>deals with sparse writes, because ltsWriteBlock fills in the holes with
>>zeros anyway. That's potentially a significant amount of wasted IO
>>effort if we aren't careful.
>>
>
>True. I'll give it a try on both machines and report some numbers. Might
>take a couple of days.
>
OK, so I do have some numbers to share. I think there's a clear
conclusion that the two patches are a huge improvement, but there's also
something fishy about planning of parallel queries.

Firstly, I have two machines that I used for testing:

1) small one: i5-2500k (4 cores), 8GB RAM, SSD RAID for data, SSD for
temporary tablespace, using TPC-H 32GB data set

2) big one: 2x xeon e5-2620v3 (8 cores), 64GB RAM, NVME SSD for data,
temporary tablespace on SATA RAID0 (3 x 7.2k), using TPC-H 75GB


serial queries (no parallelism)
===============================

Results with parallel query disabled on the two machines look like this:

1) small one (SSD)

     algorithm  master  prealloc  tlist  prealloc-tlist
     --------------------------------------------------
          hash    1365       437    368             213
          sort     226       214    224             215

The sort row simply means "enable_hashagg = off" and AFAIK the patches
should not have a lot of influence here - the prealloc does, but it's
fairly negligible.

It's not always exactly on part, I've seen cases where hash or sort were
a bit faster (probably depending on work_mem), but I think we can ignore
that for now.


2) big one (SATA)

     algorithm  master  tlist  prealloc  prealloc+tlist
     --------------------------------------------------
          hash   25534   5120      2402             540
          sort     460    460       465             485

The effect is even more pronounced, thanks to poor handling of random
I/O by the SATA RAID device. It's not exactly on par with sort, but it's
close enough ...


parallel queries
================

And now the fun begins ...


1) small one (SSD, max_parallel_workers_per_gather = 2)

     algorithm  master  tlist  prealloc  prealloc+tlist
     --------------------------------------------------
          hash   693      390       177             128
          sort   103       99       101              99

This looks pretty nice - the patches have the expected effect, it got
faster than with just a single CPU etc.


2) big one (SATA, max_parallel_workers_per_gather = 16)

     algorithm  master  tlist  prealloc  prealloc+tlist
     --------------------------------------------------
          hash       ?  25000         ?            3132
          sort     248    234       216             200

Well, not that nice :-( The hash queries take so much time that I've
decided not to wait for them and the two numbers are actually just
estimates (after processing just a couple of logical tapes).

Plus it actually gets slower than with serial execution, so what's the
problem here? Especially considering it worked OK on the small machine?

At first I thought it's something about SSD vs. SATA, but it seems to be
more about how we construct the plans, because the plans between the two
machines are very different. And it seems to be depend by the number of
workers per gather - for low number of workers the plan looks like this
(the plans are attached in plans.txt in case the formatting gets broken
by your client):


                                                       QUERY PLAN
     ---------------------------------------------------------------------------------------------------------------
      Limit
        ->  Aggregate
              ->  Hash Join
                    Hash Cond: (part.p_partkey = lineitem_1.l_partkey)
                    Join Filter: (lineitem.l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
                    ->  Gather
                          Workers Planned: 2
                          ->  Nested Loop
                                ->  Parallel Seq Scan on part
                                      Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar))
                                ->  Index Scan using idx_lineitem_part_supp on lineitem
                                      Index Cond: (l_partkey = part.p_partkey)
                    ->  Hash
                          ->  Finalize HashAggregate
                                Group Key: lineitem_1.l_partkey
                                ->  Gather
                                      Workers Planned: 2
                                      ->  Partial HashAggregate
                                            Group Key: lineitem_1.l_partkey
                                            ->  Parallel Seq Scan on lineitem lineitem_1
     (20 rows)

but then if I crank the number of workers up, it switches to this:

                                                          QUERY PLAN
     ---------------------------------------------------------------------------------------------------------------------
      Limit
        ->  Finalize Aggregate
              ->  Gather
                    Workers Planned: 5
                    ->  Partial Aggregate
                          ->  Nested Loop
                                Join Filter: (part.p_partkey = lineitem.l_partkey)
                                ->  Hash Join
                                      Hash Cond: (part.p_partkey = lineitem_1.l_partkey)
                                      ->  Parallel Seq Scan on part
                                            Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar))
                                      ->  Hash
                                            ->  HashAggregate
                                                  Group Key: lineitem_1.l_partkey
                                                  ->  Seq Scan on lineitem lineitem_1
                                ->  Index Scan using idx_lineitem_part_supp on lineitem
                                      Index Cond: (l_partkey = lineitem_1.l_partkey)
                                      Filter: (l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
     (18 rows)


Notice that in the first plan, the hashagg is on top of parallel-aware
path - so each workers builds hashagg only on a subset of data, and also
spills only a fraction of the input rows (so that all workers combined
spill rouhly the "whole" table).

In the second plan, the hashagg is on the non-partitioned side of the
join, so each workers builds a hash aggregate on the *whole* set of
input rows. Which means that (a) we need much more disk space for temp
files, making it unlikely to fit into page cache and (b) there's a lot
of contention for I/O, making it much more random.

Now, I haven't seen the second plan with sort-based aggregation, no
matter how I set the number of workers it always looks like this:

                                                          QUERY PLAN
     ---------------------------------------------------------------------------------------------------------------------
      Limit
        ->  Aggregate
              ->  Merge Join
                    Merge Cond: (lineitem_1.l_partkey = part.p_partkey)
                    Join Filter: (lineitem.l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
                    ->  Finalize GroupAggregate
                          Group Key: lineitem_1.l_partkey
                          ->  Gather Merge
                                Workers Planned: 8
                                ->  Partial GroupAggregate
                                      Group Key: lineitem_1.l_partkey
                                      ->  Sort
                                            Sort Key: lineitem_1.l_partkey
                                            ->  Parallel Seq Scan on lineitem lineitem_1
                    ->  Materialize
                          ->  Gather Merge
                                Workers Planned: 6
                                ->  Nested Loop
                                      ->  Parallel Index Scan using part_pkey on part
                                            Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar))
                                      ->  Index Scan using idx_lineitem_part_supp on lineitem
                                            Index Cond: (l_partkey = part.p_partkey)
     (22 rows)

How come we don't have the same issue here? Is there something in the
optimizer that prevents us from creating the "silly" plans with
groupagg, and we should do the same thing for hashagg?


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

plans.txt (10K) Download Attachment
sata-parallel.png (27K) Download Attachment
sata-serial.png (26K) Download Attachment
ssd-parallel.png (29K) Download Attachment
ssd-serial.png (28K) Download Attachment
12