Polyphase merge is obsolete

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Polyphase merge is obsolete

Heikki Linnakangas
The beauty of the polyphase merge algorithm is that it allows reusing
input tapes as output tapes efficiently. So if you have N tape drives,
you can keep them all busy throughout the merge.

That doesn't matter, when we can easily have as many "tape drives" as we
want. In PostgreSQL, a tape drive consumes just a few kB of memory, for
the buffers. With the patch being discussed to allow a tape to be
"paused" between write passes [1], we don't even keep the tape buffers
around, when a tape is not actively read written to, so all it consumes
is the memory needed for the LogicalTape struct.

The number of *input* tapes we can use in each merge pass is still
limited, by the memory needed for the tape buffers and the merge heap,
but only one output tape is active at any time. The inactive output
tapes consume very few resources. So the whole idea of trying to
efficiently reuse input tapes as output tapes is pointless.

Let's switch over to a simple k-way balanced merge. Because it's
simpler. If you're severely limited on memory, like when sorting 1GB of
data with work_mem='1MB' or less, it's also slightly faster. I'm not too
excited about the performance aspect, because in the typical case of a
single-pass merge, there's no difference. But it would be worth changing
on simplicity grounds, since we're mucking around in tuplesort.c anyway.

I came up with the attached patch to do that. This patch applies on top
of my latest "Logical tape pause/resume" patches [1]. It includes
changes to the logtape.c interface, that make it possible to create and
destroy LogicalTapes in a tapeset on the fly. I believe this will also
come handy for Peter's parallel tuplesort patch set.

[1] Logical tape pause/resume,
https://www.postgresql.org/message-id/55b3b7ae-8dec-b188-b8eb-e07604052351%40iki.fi

PS. I finally bit the bullet and got self a copy of The Art of Computer
Programming, Vol 3, 2nd edition. In section 5.4 on External Sorting,
Knuth says:

"
When this book was first written, magnetic tapes were abundant and disk
drives were expensive. But disks became enormously better during the
1980s, and by the late 1990s they had almost completely replaced
magnetic tape units on most of the world's computer systems. Therefore
the once-crucial topic of tape merging has become of limited relevance
to current needs.

Yet many of the patterns are quite beautiful, and the associated
algorithms reflect some of the best research done in computer science
during its early days; the techniques are just too nice to be discarded
abruptly onto the rubbish heap of history. Indeed, the ways in which
these methods blend theory with practice are especially instructive.
Therefore merging patterns are discussed carefully and completely below,
in what may be their last grand appearance before they accept the final
curtain call.
"

Yep, the polyphase and other merge patterns are beautiful. I enjoyed
reading through those sections. Now let's put them to rest in PostgreSQL.

- Heikki


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

0001-Replace-polyphase-merge-algorithm-with-a-simple-bala.patch (66K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Polyphase merge is obsolete

Tom Lane-2
Heikki Linnakangas <[hidden email]> writes:
> The beauty of the polyphase merge algorithm is that it allows reusing
> input tapes as output tapes efficiently ... So the whole idea of trying to
> efficiently reuse input tapes as output tapes is pointless.

It's been awhile since I looked at that code, but I'm quite certain that
it *never* thought it was dealing with actual tapes.  Rather, the point of
sticking with polyphase merge was that it allowed efficient incremental
re-use of temporary disk files, so that the maximum on-disk footprint was
only about equal to the volume of data to be sorted, rather than being a
multiple of that.  Have we thrown that property away?

                        regards, tom lane


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Polyphase merge is obsolete

Heikki Linnakangas
On 10/12/2016 08:27 PM, Tom Lane wrote:

> Heikki Linnakangas <[hidden email]> writes:
>> The beauty of the polyphase merge algorithm is that it allows reusing
>> input tapes as output tapes efficiently ... So the whole idea of trying to
>> efficiently reuse input tapes as output tapes is pointless.
>
> It's been awhile since I looked at that code, but I'm quite certain that
> it *never* thought it was dealing with actual tapes.  Rather, the point of
> sticking with polyphase merge was that it allowed efficient incremental
> re-use of temporary disk files, so that the maximum on-disk footprint was
> only about equal to the volume of data to be sorted, rather than being a
> multiple of that.  Have we thrown that property away?

No, there's no difference to that behavior. logtape.c takes care of
incremental re-use of disk space, regardless of the merging pattern.

- Heikki



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Polyphase merge is obsolete

Peter Geoghegan-3
In reply to this post by Heikki Linnakangas
On Wed, Oct 12, 2016 at 10:16 AM, Heikki Linnakangas <[hidden email]> wrote:
> Let's switch over to a simple k-way balanced merge. Because it's simpler. If
> you're severely limited on memory, like when sorting 1GB of data with
> work_mem='1MB' or less, it's also slightly faster. I'm not too excited about
> the performance aspect, because in the typical case of a single-pass merge,
> there's no difference. But it would be worth changing on simplicity grounds,
> since we're mucking around in tuplesort.c anyway.

This analysis seems sound. I suppose we might as well simplify things
while we're at it.


--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Polyphase merge is obsolete

Peter Geoghegan-3
In reply to this post by Heikki Linnakangas
On Wed, Oct 12, 2016 at 10:16 AM, Heikki Linnakangas <[hidden email]> wrote:
> The number of *input* tapes we can use in each merge pass is still limited,
> by the memory needed for the tape buffers and the merge heap, but only one
> output tape is active at any time. The inactive output tapes consume very
> few resources. So the whole idea of trying to efficiently reuse input tapes
> as output tapes is pointless

I picked this up again. The patch won't apply cleanly. Can you rebase?
Also, please look at my bugfix for logtape.c free block management [1]
before doing so, as that might be prerequisite. Finally, I don't think
that the Logical tape pause/resume idea is compelling. Is it hard to
not do that, but still do everything else that you propose here?
That's what I lean towards doing right now.

Anyway, efficient use of tapes certainly mattered a lot more when
rewinding meant sitting around for a magnetic tape deck to physically
rewind. There is another algorithm in AoCP Vol III that lets us write
to tapes backwards, actually, which is motivated by similar obsolete
considerations about hardware. Why not write while we rewind, to avoid
doing nothing else while rewinding?!

Perhaps this patch should make a clean break from the "rewinding"
terminology. Perhaps you should rename LogicalTapeRewindForRead() to
LogicalTapePrepareForRead(), and so on. It's already a bit awkward
that that routine is called LogicalTapeRewindForRead(), because it
behaves significantly differently when a tape is frozen, and because
the whole point of logtape.c is space reuse that is completely
dissimilar to rewinding. (Space reuse is thus totally unlike how
polyphase merge is supposed to reuse space, which is all about
rewinding, and isn't nearly as eager. Same applies to K-way balanced
merge, of course.)

I think that the "rewinding" terminology does more harm than good, now
that it doesn't even help the Knuth reader to match Knuth's remarks to
what's going on in tuplesort.c. Just a thought.

> Let's switch over to a simple k-way balanced merge. Because it's simpler. If
> you're severely limited on memory, like when sorting 1GB of data with
> work_mem='1MB' or less, it's also slightly faster. I'm not too excited about
> the performance aspect, because in the typical case of a single-pass merge,
> there's no difference. But it would be worth changing on simplicity grounds,
> since we're mucking around in tuplesort.c anyway.

I actually think that the discontinuities in the merge scheduling are
worse than you suggest here. There doesn't have to be as extreme a
difference between work_mem and the size of input as you describe
here. As an example:

create table seq_tab(t int);
insert into seq_tab select generate_series(1, 10000000);
set work_mem = '4MB';
select count(distinct t) from seq_tab;

The trace_sort output ends like this:

30119/2017-01-16 17:17:05 PST LOG:  begin datum sort: workMem = 4096,
randomAccess = f
30119/2017-01-16 17:17:05 PST LOG:  switching to external sort with 16
tapes: CPU: user: 0.07 s, system: 0.00 s, elapsed: 0.06 s
30119/2017-01-16 17:17:05 PST LOG:  starting quicksort of run 1: CPU:
user: 0.07 s, system: 0.00 s, elapsed: 0.06 s
30119/2017-01-16 17:17:05 PST LOG:  finished quicksort of run 1: CPU:
user: 0.07 s, system: 0.00 s, elapsed: 0.07 s
*** SNIP ***
30119/2017-01-16 17:17:08 PST LOG:  finished writing run 58 to tape 0:
CPU: user: 2.50 s, system: 0.27 s, elapsed: 2.78 s
30119/2017-01-16 17:17:08 PST LOG:  using 4095 KB of memory for read
buffers among 15 input tapes
30119/2017-01-16 17:17:08 PST LOG:  finished 1-way merge step: CPU:
user: 2.52 s, system: 0.28 s, elapsed: 2.80 s
30119/2017-01-16 17:17:08 PST LOG:  finished 4-way merge step: CPU:
user: 2.58 s, system: 0.30 s, elapsed: 2.89 s
30119/2017-01-16 17:17:08 PST LOG:  finished 14-way merge step: CPU:
user: 2.86 s, system: 0.34 s, elapsed: 3.20 s
30119/2017-01-16 17:17:08 PST LOG:  finished 14-way merge step: CPU:
user: 3.09 s, system: 0.41 s, elapsed: 3.51 s
30119/2017-01-16 17:17:09 PST LOG:  finished 15-way merge step: CPU:
user: 3.61 s, system: 0.52 s, elapsed: 4.14 s
30119/2017-01-16 17:17:09 PST LOG:  performsort done (except 15-way
final merge): CPU: user: 3.61 s, system: 0.52 s, elapsed: 4.14 s
30119/2017-01-16 17:17:10 PST LOG:  external sort ended, 14678 disk
blocks used: CPU: user: 4.93 s, system: 0.57 s, elapsed: 5.51 s

(This is the test case that Cy posted earlier today, for the bug that
was just fixed in master.)

The first 1-way merge step is clearly kind of a waste of time. We
incur no actual comparisons during this "merge", since there is only
one real run from each input tape (all other active tapes contain only
dummy runs). We are, in effect, just shoveling the tuples from that
single run from one tape to another (from one range in the underlying
logtape.c BufFile space to another). I've seen this quite a lot
before, over the years, while working on sorting. It's not that big of
a deal, but it's a degenerate case that illustrates how polyphase
merge can do badly. What's probably of more concern here is that the
final on-the-fly merge we see merges 15 input runs, but 10 of the
inputs only have runs that are sized according to how much we could
quicksort at once (actually, I should say 11 out of 15, what with that
1-way merge). So, 4 of those runs are far larger, which is not great.
ISTM that having the merge be "balanced" is pretty important, at least
as far as multi-pass sorts in the year 2017 go.

Still, I agree with you that this patch is much more about refactoring
than about performance, and I agree that the refactoring is
worthwhile. I just think that you may have somewhat underestimated how
much this could help when we're low on memory, but not ridiculously
so. It doesn't seem necessary to prove this, though. (We need only
verify that there are no regressions.)

That's all I have right now.

[1] https://commitfest.postgresql.org/13/955/
--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Polyphase merge is obsolete

Peter Geoghegan-4
On Mon, Jan 16, 2017 at 5:56 PM, Peter Geoghegan <[hidden email]> wrote:
> On Wed, Oct 12, 2016 at 10:16 AM, Heikki Linnakangas <[hidden email]> wrote:
>> The number of *input* tapes we can use in each merge pass is still limited,
>> by the memory needed for the tape buffers and the merge heap, but only one
>> output tape is active at any time. The inactive output tapes consume very
>> few resources. So the whole idea of trying to efficiently reuse input tapes
>> as output tapes is pointless
>
> I picked this up again. The patch won't apply cleanly. Can you rebase?

Since we have an awful lot of stuff in the last CF, and this patch
doesn't seem particularly strategic, I've marked it "Returned with
Feedback".

Thanks
--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Polyphase merge is obsolete

Peter Geoghegan-4
On Mon, Feb 27, 2017 at 2:45 PM, Peter Geoghegan <[hidden email]> wrote:
> Since we have an awful lot of stuff in the last CF, and this patch
> doesn't seem particularly strategic, I've marked it "Returned with
> Feedback".

I noticed that this is in the upcoming CF 1 for v11. I'm signed up to review.

I'd like to point out that replacement selection is also obsolete,
which is something I brought up recently [1]. I don't actually have
any feature-driven reason to want to kill replacement selection - it's
just an annoyance at this point. I do think that RS is more deserving
of being killed than Polyphase merge, because it actually costs users
something to continue to support it. The replacement_sort_tuples GUC
particularly deserves to be removed.

It would be nice if killing RS was put in scope here. I'd appreciate
it, at least, since it would simplify the heap routines noticeably.
The original analysis that led to adding replacement_sort_tuples was
based on certain performance characteristics of merging that have
since changed by quite a bit, due to our work for v10.

[1] postgr.es/m/[hidden email]
--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Polyphase merge is obsolete

Robert Haas
On Wed, Aug 30, 2017 at 2:54 PM, Peter Geoghegan <[hidden email]> wrote:

> I noticed that this is in the upcoming CF 1 for v11. I'm signed up to review.
>
> I'd like to point out that replacement selection is also obsolete,
> which is something I brought up recently [1]. I don't actually have
> any feature-driven reason to want to kill replacement selection - it's
> just an annoyance at this point. I do think that RS is more deserving
> of being killed than Polyphase merge, because it actually costs users
> something to continue to support it. The replacement_sort_tuples GUC
> particularly deserves to be removed.
>
> It would be nice if killing RS was put in scope here. I'd appreciate
> it, at least, since it would simplify the heap routines noticeably.
> The original analysis that led to adding replacement_sort_tuples was
> based on certain performance characteristics of merging that have
> since changed by quite a bit, due to our work for v10.

These are separate topics.  They should each be discussed on their own
thread.  Please don't hijack this thread to talk about something else.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Polyphase merge is obsolete

Peter Geoghegan-4
On Wed, Aug 30, 2017 at 12:48 PM, Robert Haas <[hidden email]> wrote:
> These are separate topics.  They should each be discussed on their own
> thread.  Please don't hijack this thread to talk about something else.

I don't think that that is a fair summary.

Heikki has done a number of things in this area, of which this is only
the latest. I'm saying "hey, have you thought about RS too?". Whether
or not I'm "hijacking" this thread is, at best, ambiguous.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Polyphase merge is obsolete

Tomas Vondra-4
In reply to this post by Heikki Linnakangas
Hi,

I planned to do some benchmarking on this patch, but apparently the
patch no longer applies. Rebase please?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Polyphase merge is obsolete

Heikki Linnakangas
On 11/09/2017 13:37, Tomas Vondra wrote:
> I planned to do some benchmarking on this patch, but apparently the
> patch no longer applies. Rebase please?

Here's a rebase of this. Sorry to keep you waiting :-).

I split the patch into two, one patch to refactor the logtape.c APIs,
and another to change the merge algorithm. With the new logtape.c API,
you can create and destroy LogicalTapes in a tape set on the fly, which
simplified the new hash agg spilling code somewhat. I think that's nice,
even without the sort merge algorithm change.

I did some benchmarking too. I used four data sets: ordered integers,
random integers, ordered text, and random text, with 1 GB of data in
each. I sorted those datasets with different work_mem settings, and
plotted the results. I only ran each combination once, and all on my
laptop, so there's a fair amount of noise in individual data points. The
trend is clear, though.

The results show that this makes sorting faster with small work_mem
settings. As work_mem grows so that there's only one merge pass, there's
no difference. The rightmost data points are with work_mem='2500 MB', so
that the data fits completely in work_mem and you get a quicksort.
That's unaffected by the patch.

I've attached the test scripts I used for the plots, although they are
quite messy and not very automated. I don't expect them to be very
helpful to anyone, but might as well have it archived.

- Heikki

v2-0001-Refactor-LogicalTapeSet-LogicalTape-interface.patch (69K) Download Attachment
v2-0002-Replace-polyphase-merge-algorithm-with-a-simple-b.patch (41K) Download Attachment
ordered_ints.png (9K) Download Attachment
random_ints.png (9K) Download Attachment
ordered_text.png (9K) Download Attachment
random_text.png (9K) Download Attachment
benchmark-scripts.patch (16K) Download Attachment