Parallel tuplesort (for parallel B-Tree index creation)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
141 messages Options
1234 ... 8
Reply | Threaded
Open this post in threaded view
|

Parallel tuplesort (for parallel B-Tree index creation)

Peter Geoghegan-3
As some of you know, I've been working on parallel sort. I think I've
gone as long as I can without feedback on the design (and I see that
we're accepting stuff for September CF now), so I'd like to share what
I came up with. This project is something that I've worked on
inconsistently since late last year. It can be thought of as the
Postgres 10 follow-up to the 9.6 work on external sorting.

Attached WIP patch series:

* Adds a parallel sorting capability to tuplesort.c.

* Adds a new client of this capability: btbuild()/nbtsort.c can now
create B-Trees in parallel.

Most of the complexity here relates to the first item; the tuplesort
module has been extended to support sorting in parallel. This is
usable in principle by every existing tuplesort caller, without any
restriction imposed by the newly expanded tuplesort.h interface. So,
for example, randomAccess MinimalTuple support has been added,
although it goes unused for now.

I went with CREATE INDEX as the first client of parallel sort in part
because the cost model and so on can be relatively straightforward.
Even CLUSTER uses the optimizer to determine if a sort strategy is
appropriate, and that would need to be taught about parallelism if its
tuplesort is to be parallelized. I suppose that I'll probably try to
get CLUSTER (with a tuplesort) done in the Postgres 10 development
cycle too, but not just yet.

For now, I would prefer to focus discussion on tuplesort itself. If
you can only look at one part of this patch, please look at the
high-level description of the interface/caller contract that was added
to tuplesort.h.

Performance
===========

Without further ado, I'll demonstrate how the patch series improves
performance in one case. This benchmark was run on an AWS server with
many disks. A d2.4xlarge instance was used, with 16 vCPUs, 122 GiB
RAM, 12 x 2 TB HDDs, running Amazon Linux. Apparently, this AWS
instance type can sustain 1,750 MB/second of I/O, which I was able to
verify during testing (when a parallel sequential scan ran, iotop
reported read throughput slightly above that for multi-second bursts).
Disks were configured in software RAID0. These instances have disks
that are optimized for sequential performance, which suits the patch
quite well. I don't usually trust AWS EC2 for performance testing, but
it seemed to work well here (results were pretty consistent).

Setup:

CREATE TABLE parallel_sort_test AS
    SELECT hashint8(i) randint,
    md5(i::text) collate "C" padding1,
    md5(i::text || '2') collate "C" padding2
    FROM generate_series(0, 1e9::bigint) i;

CHECKPOINT;

This leaves us with a parallel_sort_test table that is 94 GB in size.

SET maintenance_work_mem = '8GB';

-- Serial case (external sort, should closely match master branch):
CREATE INDEX serial_idx ON parallel_sort_test (randint) WITH
(parallel_workers = 0);

Total time: 00:15:42.15

-- Patch with 8 tuplesort "sort-and-scan" workers (leader process
participates as a worker here):
CREATE INDEX patch_8_idx ON parallel_sort_test (randint) WITH
(parallel_workers = 7);

Total time: 00:06:03.86

As you can see, the parallel case is 2.58x faster (while using more
memory, though it's not the case that a higher maintenance_work_mem
setting speeds up the serial/baseline index build). 8 workers are a
bit faster than 4, but not by much (not shown). 16 are a bit slower,
but not by much (not shown).

trace_sort output for "serial_idx" case:
"""
begin index sort: unique = f, workMem = 8388608, randomAccess = f
switching to external sort with 501 tapes: CPU 7.81s/25.54u sec
elapsed 33.95 sec
*** SNIP ***
performsort done (except 7-way final merge): CPU 53.52s/666.89u sec
elapsed 731.67 sec
external sort ended, 2443786 disk blocks used: CPU 74.40s/854.52u sec
elapsed 942.15 sec
"""

trace_sort output for "patch_8_idx" case:
"""
begin index sort: unique = f, workMem = 8388608, randomAccess = f
*** SNIP ***
sized memtuples 1.62x from worker's 130254158 (3052832 KB) to
210895910 (4942873 KB) for leader merge (0 KB batch memory conserved)
*** SNIP ***
tape -1/7 initially used 411907 KB of 430693 KB batch (0.956) and
26361986 out of 26361987 slots (1.000)
performsort done (except 8-way final merge): CPU 12.28s/101.76u sec
elapsed 129.01 sec
parallel external sort ended, 2443805 disk blocks used: CPU
30.08s/318.15u sec elapsed 363.86 sec
"""

This is roughly the degree of improvement that I expected when I first
undertook this project late last year. As I go into in more detail
below, I believe that we haven't exhausted all avenues to make
parallel CREATE INDEX faster still, but I do think what's left on the
table is not enormous.

There is less benefit when sorting on a C locale text attribute,
because the overhead of merging dominates parallel sorts, and that's
even more pronounced with text. So, many text cases tend to work out
at about only 2x - 2.2x faster. We could work on this indirectly.

I've seen cases where a CREATE INDEX ended up more than 3x faster,
though. I benchmarked this case in the interest of simplicity (the
serial case is intended to be comparable, making the test fair).
Encouragingly, as you can see from the trace_sort output, the 8
parallel workers are 5.67x faster at getting to the final merge (a
merge that even it performs serially). Note that the final merge for
each CREATE INDEX is comparable (7 runs vs. 8 runs from each of 8
workers). Not bad!

Design:  New, key concepts for tuplesort.c
==========================================

The heap is scanned in parallel, and worker processes also merge in
parallel if required (it isn't required in the example above). The
implementation makes heavy use of existing external sort
infrastructure. In fact, it's almost the case that the implementation
is a generalization of external sorting that allows workers to perform
heap scanning and run sorting independently, with tapes then "unified"
in the leader process for merging. At that point, the state held by
the leader is more or less consistent with the leader being a serial
external sort process that has reached its merge phase in the
conventional manner (serially).

The steps callers must take are described fully in tuplesort.h. The
general idea is that a Tuplesortstate is aware that it might not be a
self-contained sort; it may instead be one part of a parallel sort
operation. You might say that the tuplesort caller must "build its own
sort" from participant worker process Tuplesortstates. The caller
creates a dynamic shared memory segment + TOC for each parallel sort
operation (could be more than one concurrent sort operation, of
course), passes that to tuplesort to initialize and manage, and
creates a "leader" Tuplesortstate in private memory, plus one or more
"worker" Tuplesortstates, each presumably managed by a different
parallel worker process.

tuplesort.c does most of the heavy lifting, including having processes
wait on each other to respect its ordering dependencies. Caller is
responsible for spawning workers to do the work, reporting details of
the workers to tuplesort through shared memory, and having workers
call tuplesort to actually perform sorting. Caller consumes final
output through leader Tuplesortstate in leader process.

I think that this division of labor works well for us.

Tape unification
----------------

Sort operations have a unique identifier, generated before any workers
are launched, using a scheme based on the leader's PID, and a unique
temp file number. This makes all on-disk state (temp files managed by
logtape.c) discoverable by the leader process. State in shared memory
is sized in proportion to the number of workers, so the only thing
about the data being sorted that gets passed around in shared memory
is a little logtape.c metadata for tapes, describing for example how
large each constituent BufFile is (a BufFile associated with one
particular worker's tapeset).

(See below also for notes on buffile.c's role in all of this, fd.c and
resource management, etc.)

workMem
-------

Each worker process claims workMem as if it was an independent node.

The new implementation reuses much of what was originally designed for
external sorts. As such, parallel sorts are necessarily external
sorts, even when the workMem (i.e. maintenance_work_mem) budget could
in principle allow for parallel sorting to take place entirely in
memory. The implementation arguably *insists* on making such cases
external sorts, when they don't really need to be. This is much less
of a problem than you might think, since the 9.6 work on external
sorting does somewhat blur the distinction between internal and
external sorts (just consider how much time trace_sort indicates is
spent waiting on writes in workers; it's typically a small part of the
total time spent). Since parallel sort is really only compelling for
large sorts, it makes sense to make them external, or at least to
prioritize the cases that should be performed externally.

Anyway, workMem-not-exceeded cases require special handling to not
completely waste memory. Statistics about worker observations are used
at later stages, to at least avoid blatant waste, and to ensure that
memory is used optimally more generally.

Merging
=======

The model that I've come up with is that every worker process is
guaranteed to output one materialized run onto one tape for the leader
to merge within from its "unified" tapeset. This is the case
regardless of how much workMem is available, or any other factor. The
leader always assumes that the worker runs/tapes are present and
discoverable based only on the number of known-launched worker
processes, and a little metadata on each that is passed through shared
memory.

Producing one output run/materialized tape from all input tuples in a
worker often happens without the worker running out of workMem, which
you saw above. A straight quicksort and dump of all tuples is
therefore possible, without any merging required in the worker.
Alternatively, it may prove necessary to do some amount of merging in
each worker to generate one materialized output run. This case is
handled in the same way as a randomAccess case that requires one
materialized output tape to support random access by the caller. This
worker merging does necessitate another pass over all temp files for
the worker, but that's a much lower cost than you might imagine, in
part because the newly expanded use of batch memory makes merging here
cache efficient.

Batch allocation is used for all merging involved here, not just the
leader's own final-on-the-fly merge, so merging is consistently cache
efficient. (Workers that must merge on their own are therefore similar
to traditional randomAccess callers, so these cases become important
enough to optimize with the batch memory patch, although that's still
independently useful.)

No merging in parallel
----------------------

Currently, merging worker *output* runs may only occur in the leader
process. In other words, we always keep n worker processes busy with
scanning-and-sorting (and maybe some merging), but then all processes
but the leader process grind to a halt (note that the leader process
can participate as a scan-and-sort tuplesort worker, just as it will
everywhere else, which is why I specified "parallel_workers = 7" but
talked about 8 workers).

One leader process is kept busy with merging these n output runs on
the fly, so things will bottleneck on that, which you saw in the
example above. As already described, workers will sometimes merge in
parallel, but only their own runs -- never another worker's runs. I
did attempt to address the leader merge bottleneck by implementing
cross-worker run merging in workers. I got as far as implementing a
very rough version of this, but initial results were disappointing,
and so that was not pursued further than the experimentation stage.

Parallel merging is a possible future improvement that could be added
to what I've come up with, but I don't think that it will move the
needle in a really noticeable way.

Partitioning for parallelism (samplesort style "bucketing")
-----------------------------------------------------------

Perhaps a partition-based approach would be more effective than
parallel merging (e.g., redistribute slices of worker runs across
workers along predetermined partition boundaries, sort a range of
values within dedicated workers, then concatenate to get final result,
a bit like the in-memory samplesort algorithm). That approach would
not suit CREATE INDEX, because the approach's great strength is that
the workers can run in parallel for the entire duration, since there
is no merge bottleneck (this assumes good partition boundaries, which
is of a bit risky assumption). Parallel CREATE INDEX wants something
where the workers can independently write the index, and independently
WAL log, and independently create a unified set of internal pages, all
of which is hard.

This patch series will tend to proportionally speed up CREATE INDEX
statements at a level that is comparable to other major database
systems. That's enough progress for one release. I think that
partitioning to sort is more useful for query execution than for
utility statements like CREATE INDEX.

Partitioning and merge joins
----------------------------

Robert has often speculated about what it would take to make merge
joins work well in parallel. I think that "range
distribution"/bucketing will prove an important component of that.
It's just too useful to aggregate tuples in shared memory initially,
and have workers sort them without any serial merge bottleneck;
arguments about misestimations, data skew, and so on should not deter
us from this, long term. This approach has minimal IPC overhead,
especially with regard to LWLock contention.

This kind of redistribution probably belongs in a Gather-like node,
though, which has access to the context necessary to determine a
range, and even dynamically alter the range in the event of a
misestimation. Under this scheme, tuplesort.c just needs to be
instructed that these worker-private Tuplesortstates are
range-partitioned (i.e., the sorts are virtually independent, as far
as it's concerned). That's a bit messy, but it is still probably the
way to go for merge joins and other sort-reliant executor nodes.

buffile.c, and "unification"
============================

There has been significant new infrastructure added to make logtape.c
aware of workers. buffile.c has in turn been taught about unification
as a first class part of the abstraction, with low-level management of
certain details occurring within fd.c. So, "tape unification" within
processes to open other backend's logical tapes to generate a unified
logical tapeset for the leader to merge is added. This is probably the
single biggest source of complexity for the patch, since I must
consider:

* Creating a general, reusable abstraction for other possible BufFile
users (logtape.c only has to serve tuplesort.c, though).

* Logical tape free space management.

* Resource management, file lifetime, etc. fd.c resource management
can now close a file at xact end for temp files, while not deleting it
in the leader backend (only the "owning" worker backend deletes the
temp file it owns).

* Crash safety (e.g., when to truncate existing temp files, and when not to).

CREATE INDEX user interface
===========================

There are two ways of determine how many parallel workers a CREATE
INDEX requests:

* A cost model, which is closely based on create_plain_partial_paths()
at the moment. This needs more work, particularly to model things like
maintenance_work_mem. Even still, it isn't terrible.

* A parallel_workers storage parameter, which completely bypasses the
cost model. This is the "DBA knows best" approach, and is what I've
consistently used during testing.

Corey Huinker has privately assisted me with performance testing the
patch, using his own datasets. Testing has exclusively used the
storage parameter.

I've added a new GUC, max_parallel_workers_maintenance, which is
essentially the utility statement equivalent of
max_parallel_workers_per_gather. This is clearly necessary, since
we're using up to maintenance_work_mem per worker, which is of course
typically much higher than work_mem. I didn't feel the need to create
a new maintenance-wise variant GUC for things like
min_parallel_relation_size, though. Only this one new GUC is added
(plus the new storage parameter, parallel_workers, not to be confused
with the existing table storage parameter of the same name).

I am much more concerned about the tuplesort.h interface than the
CREATE INDEX user interface as such. The user interface is merely a
facade on top of tuplesort.c and nbtsort.c (and not one that I'm
particularly attached to).

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

0001-Cap-the-number-of-tapes-used-by-external-sorts.patch.gz (2K) Download Attachment
0005-Add-force_btree_randomaccess-GUC-for-testing.patch.gz (2K) Download Attachment
0004-Add-parallel-B-tree-index-build-sorting.patch.gz (74K) Download Attachment
0003-Rearrange-header-file-include-directives.patch.gz (1K) Download Attachment
0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch.gz (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

Robert Haas
On Mon, Aug 1, 2016 at 6:18 PM, Peter Geoghegan <[hidden email]> wrote:
> As some of you know, I've been working on parallel sort. I think I've
> gone as long as I can without feedback on the design (and I see that
> we're accepting stuff for September CF now), so I'd like to share what
> I came up with. This project is something that I've worked on
> inconsistently since late last year. It can be thought of as the
> Postgres 10 follow-up to the 9.6 work on external sorting.

I am glad that you are working on this.

Just a first thought after reading the email:

> As you can see, the parallel case is 2.58x faster (while using more
> memory, though it's not the case that a higher maintenance_work_mem
> setting speeds up the serial/baseline index build). 8 workers are a
> bit faster than 4, but not by much (not shown). 16 are a bit slower,
> but not by much (not shown).
...
> I've seen cases where a CREATE INDEX ended up more than 3x faster,
> though. I benchmarked this case in the interest of simplicity (the
> serial case is intended to be comparable, making the test fair).
> Encouragingly, as you can see from the trace_sort output, the 8
> parallel workers are 5.67x faster at getting to the final merge (a
> merge that even it performs serially). Note that the final merge for
> each CREATE INDEX is comparable (7 runs vs. 8 runs from each of 8
> workers). Not bad!

I'm not going to say it's bad to be able to do things 2-2.5x faster,
but linear scalability this ain't - particularly because your 2.58x
faster case is using up to 7 or 8 times as much memory.  The
single-process case would be faster in that case, too: you could
quicksort.  I feel like for sorting, in particular, we probably ought
to be setting the total memory budget, not the per-process memory
budget.  Or if not, then any CREATE INDEX benchmarking had better
compare using scaled values for maintenance_work_mem; otherwise,
you're measuring the impact of using more memory as much as anything
else.

I also think that Amdahl's law is going to pinch pretty severely here.
If the final merge phase is a significant percentage of the total
runtime, picking an algorithm that can't parallelize the final merge
is going to limit the speedups to small multiples.  That's an OK place
to be as a result of not having done all the work yet, but you don't
want to get locked into it.  If we're going to have a substantial
portion of the work that can never be parallelized, maybe we've picked
the wrong algorithm.

The work on making the logtape infrastructure parallel-aware seems
very interesting and potentially useful for other things.  Sadly, I
don't have time to look at it right now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

Peter Geoghegan-3
On Wed, Aug 3, 2016 at 11:42 AM, Robert Haas <[hidden email]> wrote:
> I'm not going to say it's bad to be able to do things 2-2.5x faster,
> but linear scalability this ain't - particularly because your 2.58x
> faster case is using up to 7 or 8 times as much memory.  The
> single-process case would be faster in that case, too: you could
> quicksort.

Certainly, there are cases where a parallel version could benefit from
having more memory more so than actually parallelizing the underlying
task. However, this case was pointedly chosen to *not* be such a case.
When maintenance_work_mem exceeds about 5GB, I've observed that since
9.6 increasing it is just as likely to hurt as to help by about +/-5%
(unless and until it's all in memory, which still doesn't help much).
In general, there isn't all that much point in doing a very large sort
like this in memory. You just don't get that much of a benefit for the
memory you use, because linearithmic CPU costs eventually really
dominate linear sequential I/O costs.

I think you're focusing on the fact that there is a large absolute
disparity in memory used in this one benchmark, but that isn't
something that the gains shown particularly hinge upon. There isn't
that much difference when workers must merge their own runs, for
example. It saves the serial leader merge some work, and in particular
makes it more cache efficient (by having fewer runs/tapes).

Finally, while about 8x as much memory is used, the memory used over
and above the serial case is almost all freed when the final merge
begins (the final merges are therefore very similar in both cases,
including in terms of memory use). So, for as long as you use 8x as
much memory for 8 active processes, you get a 5.67x speed-up of that
part alone. You still keep a few extra KiBs of memory for worker tapes
and things like that during the leader's merge, but that's a close to
negligible amount.

> I feel like for sorting, in particular, we probably ought
> to be setting the total memory budget, not the per-process memory
> budget.  Or if not, then any CREATE INDEX benchmarking had better
> compare using scaled values for maintenance_work_mem; otherwise,
> you're measuring the impact of using more memory as much as anything
> else.

As I said, the benchmark was chosen to avoid that (and to be simple
and reproducible). I am currently neutral on the question of whether
or not maintenance_work_mem should be dolled out per process or per
sort operation. I do think that making it a per-process allowance is
far closer to what we do for hash joins today, and is simpler.

What's nice about the idea of making the workMem/maintenance_work_mem
budget per sort is that that leaves the leader process with license to
greatly increase the amount of memory it can use for the merge.
Increasing the amount of memory used for the merge will improve things
for longer than it will for workers. I've simulated it already.

> I also think that Amdahl's law is going to pinch pretty severely here.

Doesn't that almost always happen, though? Isn't that what you
generally see with queries that show off the parallel join capability?

> If the final merge phase is a significant percentage of the total
> runtime, picking an algorithm that can't parallelize the final merge
> is going to limit the speedups to small multiples.  That's an OK place
> to be as a result of not having done all the work yet, but you don't
> want to get locked into it.  If we're going to have a substantial
> portion of the work that can never be parallelized, maybe we've picked
> the wrong algorithm.

I suggest that this work be compared to something with similar
constraints. I used Google to try to get some indication of how much
of a difference parallel CREATE INDEX makes in other major database
systems. This is all I could find:

https://www.mssqltips.com/sqlservertip/3100/reduce-time-for-sql-server-index-rebuilds-and-update-statistics/

It seems like the degree of parallelism used for SQL Server tends to
affect index build time in a way that is strikingly similar with what
I've come up with (which may be a coincidence; I don't know anything
about SQL Server). So, I suspect that the performance of this is
fairly good in an apples-to-apples comparison.

Parallelizing merging can hurt or help, because there is a cost in
memory bandwidth (if not I/O) for the extra passes that are used to
keep more CPUs busy, which is kind of analogous to the situation with
polyphase merge. I'm not saying that we shouldn't do that even still,
but I believe that there are sharply diminishing returns. Merging
tuple comparisons are much more expensive than quicksort tuple
comparisons, which tend to benefit from abbreviated keys a lot.

As I've said, there is probably a good argument to be made for
partitioning to increase parallelism. But, that involves risks around
the partitioning being driven by statistics or a cost model, and I
don't think you'd be too on board with the idea of every CREATE INDEX
after bulk loading needing an ANALYZE first. I tend to think of that
as more of a parallel query thing, because you can often push down a
lot more there, dynamic sampling might be possible, and there isn't a
need to push all the tuples through one point in the end. Nothing I've
done here precludes your idea of a sort-order-preserving gather node.
I think that we may well need both.

Since merging is a big bottleneck with this, we should probably also
work to address that indirectly.

> The work on making the logtape infrastructure parallel-aware seems
> very interesting and potentially useful for other things.  Sadly, I
> don't have time to look at it right now.

I would be happy to look at generalizing that further, to help
parallel hash join. As you know, Thomas Munro and I have discussed
this privately.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

Robert Haas
On Wed, Aug 3, 2016 at 5:13 PM, Peter Geoghegan <[hidden email]> wrote:
> On Wed, Aug 3, 2016 at 11:42 AM, Robert Haas <[hidden email]> wrote:
>> I'm not going to say it's bad to be able to do things 2-2.5x faster,
>> but linear scalability this ain't - particularly because your 2.58x
>> faster case is using up to 7 or 8 times as much memory.  The
>> single-process case would be faster in that case, too: you could
>> quicksort.
>
> [ lengthy counter-argument ]

None of this convinces me that testing this in a way that is not
"apples to apples" is a good idea, nor will any other argument.

>> I also think that Amdahl's law is going to pinch pretty severely here.
>
> Doesn't that almost always happen, though?

To some extent, sure, absolutely.  But it's our job as developers to
try to foresee and minimize those cases.  When Noah was at
EnterpriseDB a few years ago and we were talking about parallel
internal sort, Noah started by doing a survey of the literature and
identified parallel quicksort as the algorithm that seemed best for
our use case.  Of course, every time quicksort partitions the input,
you get two smaller sorting problems, so it's easy to see how to use 2
CPUs after the initial partitioning step has been completed and 4 CPUs
after each of those partitions has been partitioned again, and so on.
However, that turns out not to be good enough because the first
partitioning step can consume a significant percentage of the total
runtime - so if you only start parallelizing after that, you're
leaving too much on the table.  To avoid that, the algorithm he was
looking at had a (complicated) way of parallelizing the first
partitioning step; then you can, it seems, do the full sort in
parallel.

There are some somewhat outdated and perhaps naive ideas about this
that we wrote up here:

https://wiki.postgresql.org/wiki/Parallel_Sort

Anyway, you're proposing an algorithm that can't be fully
parallelized.  Maybe that's OK.  But I'm a little worried about it.
I'd feel more confident if we knew that the merge could be done in
parallel and were just leaving that to a later development stage; or
if we picked an algorithm like the one above that doesn't leave a
major chunk of the work unparallelizable.

> Isn't that what you
> generally see with queries that show off the parallel join capability?

For nested loop joins, no.  The whole join operation can be done in
parallel.  For hash joins, yes: building the hash table once per
worker can run afoul of Amdahl's law in a big way.  That's why Thomas
Munro is working on fixing it:

https://wiki.postgresql.org/wiki/EnterpriseDB_database_server_roadmap

Obviously, parallel query is subject to a long list of annoying
restrictions at this point.  On queries that don't hit any of those
restrictions we can get 4-5x speedup with a leader and 4 workers.  As
we expand the range of plan types that we can construct, I think we'll
see those kinds of speedups for a broader range of queries.  (The
question of exactly why we top out with as few workers as currently
seems to be the case needs more investigation, too; maybe contention
effects?)

>> If the final merge phase is a significant percentage of the total
>> runtime, picking an algorithm that can't parallelize the final merge
>> is going to limit the speedups to small multiples.  That's an OK place
>> to be as a result of not having done all the work yet, but you don't
>> want to get locked into it.  If we're going to have a substantial
>> portion of the work that can never be parallelized, maybe we've picked
>> the wrong algorithm.
>
> I suggest that this work be compared to something with similar
> constraints. I used Google to try to get some indication of how much
> of a difference parallel CREATE INDEX makes in other major database
> systems. This is all I could find:
>
> https://www.mssqltips.com/sqlservertip/3100/reduce-time-for-sql-server-index-rebuilds-and-update-statistics/

I do agree that it is important not to have unrealistic expectations.

> As I've said, there is probably a good argument to be made for
> partitioning to increase parallelism. But, that involves risks around
> the partitioning being driven by statistics or a cost model, and I
> don't think you'd be too on board with the idea of every CREATE INDEX
> after bulk loading needing an ANALYZE first. I tend to think of that
> as more of a parallel query thing, because you can often push down a
> lot more there, dynamic sampling might be possible, and there isn't a
> need to push all the tuples through one point in the end. Nothing I've
> done here precludes your idea of a sort-order-preserving gather node.
> I think that we may well need both.

Yes.  Rushabh is working on that, and Finalize GroupAggregate ->
Gather Merge -> Partial GroupAggregate -> Sort -> whatever is looking
pretty sweet.

>> The work on making the logtape infrastructure parallel-aware seems
>> very interesting and potentially useful for other things.  Sadly, I
>> don't have time to look at it right now.
>
> I would be happy to look at generalizing that further, to help
> parallel hash join. As you know, Thomas Munro and I have discussed
> this privately.

Right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

Peter Geoghegan-3
On Fri, Aug 5, 2016 at 9:06 AM, Robert Haas <[hidden email]> wrote:

> To some extent, sure, absolutely.  But it's our job as developers to
> try to foresee and minimize those cases.  When Noah was at
> EnterpriseDB a few years ago and we were talking about parallel
> internal sort, Noah started by doing a survey of the literature and
> identified parallel quicksort as the algorithm that seemed best for
> our use case.  Of course, every time quicksort partitions the input,
> you get two smaller sorting problems, so it's easy to see how to use 2
> CPUs after the initial partitioning step has been completed and 4 CPUs
> after each of those partitions has been partitioned again, and so on.
> However, that turns out not to be good enough because the first
> partitioning step can consume a significant percentage of the total
> runtime - so if you only start parallelizing after that, you're
> leaving too much on the table.  To avoid that, the algorithm he was
> looking at had a (complicated) way of parallelizing the first
> partitioning step; then you can, it seems, do the full sort in
> parallel.
>
> There are some somewhat outdated and perhaps naive ideas about this
> that we wrote up here:
>
> https://wiki.postgresql.org/wiki/Parallel_Sort

I'm familiar with that effort. I think that when research topics like
sorting, it can sometimes be a mistake to not look at an approach
specifically recommended by the database research community. A lot of
the techniques we've benefited from within tuplesort.c have been a
matter of addressing memory latency as a bottleneck; techniques that
are fairly simple and not worth writing a general interest paper on.
Also, things like abbreviated keys are beneficial in large part
because people tend to follow the first normal form, and therefore an
abbreviated key can contain a fair amount of entropy most of the time.
Similarly, Radix sort seems really cool, but our requirements around
generality seem to make it impractical.

> Anyway, you're proposing an algorithm that can't be fully
> parallelized.  Maybe that's OK.  But I'm a little worried about it.
> I'd feel more confident if we knew that the merge could be done in
> parallel and were just leaving that to a later development stage; or
> if we picked an algorithm like the one above that doesn't leave a
> major chunk of the work unparallelizable.

I might be able to resurrect the parallel merge stuff, just to guide
reviewer intuition on how much that can help or hurt. I can probably
repurpose it to show you the mixed picture on how effective it is. I
think it might help more with collatable text that doesn't have
abbreviated keys, for example, because you can use more of the
machines memory bandwidth for longer. But for integers, it can hurt.
(That's my recollection; I prototyped parallel merge a couple of
months ago now.)

>> Isn't that what you
>> generally see with queries that show off the parallel join capability?
>
> For nested loop joins, no.  The whole join operation can be done in
> parallel.

Sure, I know, but I'm suggesting that laws-of-physics problems may
still be more significant than implementation deficiencies, even
though those deficiencies should need to be stamped out. Linear
scalability is really quite rare for most database workloads.

> Obviously, parallel query is subject to a long list of annoying
> restrictions at this point.  On queries that don't hit any of those
> restrictions we can get 4-5x speedup with a leader and 4 workers.  As
> we expand the range of plan types that we can construct, I think we'll
> see those kinds of speedups for a broader range of queries.  (The
> question of exactly why we top out with as few workers as currently
> seems to be the case needs more investigation, too; maybe contention
> effects?)

You're probably bottlenecked on memory bandwidth. Note that I showed
improvements with 8 workers, not 4. 4 Workers are slower than 8, but
not by that much.

>> https://www.mssqltips.com/sqlservertip/3100/reduce-time-for-sql-server-index-rebuilds-and-update-statistics/
>
> I do agree that it is important not to have unrealistic expectations.

Great. My ambition for this patch is that it put parallel CREATE INDEX
on a competitive footing against the implementations featured in other
major systems. I don't think we need to do everything at once, but I
have no intention of pushing forward with something that doesn't do
respectably there. I also want to avoid partitioning in the first
version of this, and probably in any version that backs CREATE INDEX.
I've only made minimal changes to the tuplesort.h interface here to
support parallelism. That flexibility counts for a lot, IMV.

>> As I've said, there is probably a good argument to be made for
>> partitioning to increase parallelism. But, that involves risks around
>> the partitioning being driven by statistics or a cost model

> Yes.  Rushabh is working on that, and Finalize GroupAggregate ->
> Gather Merge -> Partial GroupAggregate -> Sort -> whatever is looking
> pretty sweet.

A "Gather Merge" node doesn't really sound like what I'm talking
about. Isn't that something to do with table-level partitioning? I'm
talking about dynamic partitioning, typically of a single table, of
course.

>>> The work on making the logtape infrastructure parallel-aware seems
>>> very interesting and potentially useful for other things.  Sadly, I
>>> don't have time to look at it right now.
>>
>> I would be happy to look at generalizing that further, to help
>> parallel hash join. As you know, Thomas Munro and I have discussed
>> this privately.
>
> Right.

By the way, the patch is in better shape from that perspective, as
compared to the early version Thomas (CC'd) had access to. The BufFile
stuff is now credible as a general-purpose abstraction.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

akapila
On Sat, Aug 6, 2016 at 2:16 AM, Peter Geoghegan <[hidden email]> wrote:

> On Fri, Aug 5, 2016 at 9:06 AM, Robert Haas <[hidden email]> wrote:
>> There are some somewhat outdated and perhaps naive ideas about this
>> that we wrote up here:
>>
>> https://wiki.postgresql.org/wiki/Parallel_Sort
>
> I'm familiar with that effort. I think that when research topics like
> sorting, it can sometimes be a mistake to not look at an approach
> specifically recommended by the database research community. A lot of
> the techniques we've benefited from within tuplesort.c have been a
> matter of addressing memory latency as a bottleneck; techniques that
> are fairly simple and not worth writing a general interest paper on.
> Also, things like abbreviated keys are beneficial in large part
> because people tend to follow the first normal form, and therefore an
> abbreviated key can contain a fair amount of entropy most of the time.
> Similarly, Radix sort seems really cool, but our requirements around
> generality seem to make it impractical.
>
>> Anyway, you're proposing an algorithm that can't be fully
>> parallelized.  Maybe that's OK.  But I'm a little worried about it.
>> I'd feel more confident if we knew that the merge could be done in
>> parallel and were just leaving that to a later development stage; or
>> if we picked an algorithm like the one above that doesn't leave a
>> major chunk of the work unparallelizable.
>
> I might be able to resurrect the parallel merge stuff, just to guide
> reviewer intuition on how much that can help or hurt.
>

I think here some of the factors like how many workers will be used
for merge phase might impact the performance.   Having too many
workers can lead to more communication cost and having too few workers
might not yield best results for merge.  One thing, I have noticed
that in general for sorting, some of the other databases uses range
partitioning [1], now that might not be what is good for us.  I see
you mentioned above that why it is not good [2], but I don't
understand why you think it is a risky assumption to assume good
partition boundaries for parallelizing sort.


[1] -
https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm
Refer Producer or Consumer Operations section.

[2] -
"That approach would not suit CREATE INDEX, because the approach's
great strength is that the workers can run in parallel for the entire
duration, since there is no merge bottleneck (this assumes good
partition boundaries, which is of a bit risky assumption)"


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

Peter Geoghegan-3
On Sat, Aug 6, 2016 at 6:46 AM, Amit Kapila <[hidden email]> wrote:
> I think here some of the factors like how many workers will be used
> for merge phase might impact the performance.   Having too many
> workers can lead to more communication cost and having too few workers
> might not yield best results for merge.  One thing, I have noticed
> that in general for sorting, some of the other databases uses range
> partitioning [1], now that might not be what is good for us.

I don't disagree with anything you say here. I acknowledged that
partitioning will probably be important for sorting in my introductory
e-mail, after all.

> I see
> you mentioned above that why it is not good [2], but I don't
> understand why you think it is a risky assumption to assume good
> partition boundaries for parallelizing sort.

Well, apparently there are numerous problems with partitioning in
systems like SQL Server and Oracle in the worst case. For one thing,
in the event of a misestimation (or failure of the dynamic sampling
that I presume can sometimes be used), workers can be completely
starved of work for the entire duration of the sort. And for CREATE
INDEX to get much of any benefit, all workers must write their part of
the index independently, too. This can affect the physical structure
of the final index. SQL Server also has a caveat in its documentation
about this resulting in an unbalanced final index, which I imagine
could be quite bad in the worst case.

I believe that it's going to be hard to get any version of this that
writes the index simultaneously in each worker accepted for these
reasons. This patch I came up with isn't very different from the
serial case at all. Any index built in parallel by the patch ought to
have relfilenode files on the filesystem that are 100% identical to
those produced by the serial case, in fact (since CREATE INDEX does
not set LSNs in the new index pages). I've actually developed a simple
way of "fingerprinting" indexes during testing of this patch, knowing
that hashing the files on disk ought to produce a perfect match
compared to a master branch serial sort case.

At the same time, any information that I've seen about how much
parallel CREATE INDEX speeds things up in these other systems
indicates that the benefits are very similar. It tends to be in the 2x
- 3x range, with the same reduction in throughput seen at about 16
workers, after we peak at about 8 workers. So, I think that the
benefits of partitioning are not really seen with CREATE INDEX (I
think of partitioning as more of a parallel query thing). Obviously,
any benefit that might still exist for CREATE INDEX in particular,
when weighed against the costs, makes partitioning look pretty
unattractive as a next step.

I think that during the merge phase of parallel CREATE INDEX as
implemented, the system generally still isn't that far from being I/O
bound. Whereas, with parallel query, partitioning makes each worker
able to return one tuple from its own separated range very quickly,
not just one worker (presumably, each worker merges non-overlapping
"ranges" from runs initially sorted in each worker. Each worker
subsequently merges after a partition-wise redistribution of the
initial fully sorted runs, allowing for dynamic sampling to optimize
the actual range used for load balancing.). The workers can then do
more CPU-bound processing in whatever node is fed by each worker's
ranged merge; everything is kept busy. That's the approach that I
personally had in mind for partitioning, at least. It's really nice
for parallel query to be able to totally separate workers after the
point of redistribution. CREATE INDEX is not far from being I/O bound
anyway, though, so it benefits far less. (Consider how fast the merge
phase still is at writing out the index in *absolute* terms.)

Look at figure 9 in this paper: http://www.vldb.org/pvldb/vol7/p85-balkesen.pdf

Even in good cases for "independent sorting", there is only a benefit
seen at 8 cores. At the same time, I can only get about 6x scaling
with 8 workers, just for the initial generation of runs.

All of these factors are why I believe I'm able to compete well with
other systems with this relatively straightforward, evolutionary
approach. I have a completely open mind about partitioning, but my
approach makes sense in this context.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

Peter Geoghegan-3
In reply to this post by Peter Geoghegan-3
On Wed, Aug 3, 2016 at 2:13 PM, Peter Geoghegan <[hidden email]> wrote:
> Since merging is a big bottleneck with this, we should probably also
> work to address that indirectly.

I attach a patch that changes how we maintain the heap invariant
during tuplesort merging. I already mentioned this over on the
"Parallel tuplesort, partitioning, merging, and the future" thread. As
noted already on that thread, this patch makes merging clustered
numeric input about 2.1x faster overall in one case, which is
particularly useful in the context of a serial final/leader merge
during a parallel CREATE INDEX. Even *random* non-C-collated text
input is made significantly faster. This work is totally orthogonal to
parallelism, though; it's just very timely, given our discussion of
the merge bottleneck on this thread.

If I benchmark a parallel build of a 100 million row index, with
presorted input, I can see a 71% reduction in *comparisons* with 8
tapes/workers, and an 80% reduction in comparisons with 16
workers/tapes in one instance (the numeric case I just mentioned).
With random input, we can still come out significantly ahead, but not
to the same degree. I was able to see a reduction in comparisons
during a leader merge, from 1,468,522,397 comparisons to 999,755,569
comparisons, which is obviously still quite significant (worker
merges, if any, benefit too). I think I need to redo my parallel
CREATE INDEX benchmark, so that you can take this into account. Also,
I think that this patch will make very large external sorts that
naturally have tens of runs to merge significantly faster, but I
didn't bother to benchmark that.

The patch is intended to be applied on top of parallel B-Tree patches
0001-* and 0002-* [1]. I happened to test it with parallelism, but
these are all independently useful, and will be entered as a separate
CF entry (perhaps better to commit the earlier two patches first, to
avoid merge conflicts). I'm optimistic that we can get those 3 patches
in the series out of the way early, without blocking on discussing
parallel sort.

The patch makes tuplesort merging shift down and displace the root
tuple with the tape's next preread tuple, rather than compacting and
then inserting into the heap anew. This approach to maintaining the
heap as tuples are returned to caller will always produce fewer
comparisons overall. The new approach is also simpler. We were already
shifting down to compact the heap within the misleadingly named [2]
function tuplesort_heap_siftup() -- why not instead just use the
caller tuple (the tuple that we currently go on to insert) when
initially shifting down (not the heap's preexisting last tuple, which
is guaranteed to go straight to the leaf level again)? That way, we
don't need to enlarge the heap at all through insertion, shifting up,
etc. We're done, and are *guaranteed* to have performed less work
(fewer comparisons and swaps) than with the existing approach (this is
the reason for my optimism about getting this stuff out of the way
early).

This new approach is more or less the *conventional* way to maintain
the heap invariant when returning elements from a heap during k-way
merging. Our existing approach is convoluted; merging was presumably
only coded that way because the generic functions
tuplesort_heap_siftup() and tuplesort_heap_insert() happened to be
available. Perhaps the problem was masked by unrelated bottlenecks
that existed at the time, too.

I think that I could push this further (a minimum of 2 comparisons per
item returned when 3 or more tapes are active still seems like 1
comparison too many), but what I have here gets us most of the
benefit. And, it does so while not actually adding code that could be
called "overly clever", IMV. I'll probably leave clever, aggressive
optimization of merging for a later release.

[1] https://www.postgresql.org/message-id/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@...
[2] https://www.postgresql.org/message-id/CAM3SWZQ+2gJMNV7ChxwEXqXopLfb_FEW2RfEXHJ+GsYF39f6MQ@...
--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

0003-Displace-heap-s-root-during-tuplesort-merge.patch (10K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

Peter Geoghegan-3
In reply to this post by Peter Geoghegan-3
On Mon, Aug 1, 2016 at 3:18 PM, Peter Geoghegan <[hidden email]> wrote:
> Attached WIP patch series:

This has bitrot, since commit da1c9163 changed the interface for
checking parallel safety. I'll have to fix that, and will probably
take the opportunity to change how workers have maintenance_work_mem
apportioned while I'm at it. To recap, it would probably be better if
maintenance_work_mem remained a high watermark for the entire CREATE
INDEX, rather than applying as a per-worker allowance.


--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

Heikki Linnakangas
In reply to this post by Peter Geoghegan-3
On 08/16/2016 03:33 AM, Peter Geoghegan wrote:

> I attach a patch that changes how we maintain the heap invariant
> during tuplesort merging. I already mentioned this over on the
> "Parallel tuplesort, partitioning, merging, and the future" thread. As
> noted already on that thread, this patch makes merging clustered
> numeric input about 2.1x faster overall in one case, which is
> particularly useful in the context of a serial final/leader merge
> during a parallel CREATE INDEX. Even *random* non-C-collated text
> input is made significantly faster. This work is totally orthogonal to
> parallelism, though; it's just very timely, given our discussion of
> the merge bottleneck on this thread.

Nice!

> The patch makes tuplesort merging shift down and displace the root
> tuple with the tape's next preread tuple, rather than compacting and
> then inserting into the heap anew. This approach to maintaining the
> heap as tuples are returned to caller will always produce fewer
> comparisons overall. The new approach is also simpler. We were already
> shifting down to compact the heap within the misleadingly named [2]
> function tuplesort_heap_siftup() -- why not instead just use the
> caller tuple (the tuple that we currently go on to insert) when
> initially shifting down (not the heap's preexisting last tuple, which
> is guaranteed to go straight to the leaf level again)? That way, we
> don't need to enlarge the heap at all through insertion, shifting up,
> etc. We're done, and are *guaranteed* to have performed less work
> (fewer comparisons and swaps) than with the existing approach (this is
> the reason for my optimism about getting this stuff out of the way
> early).

Makes sense.

> This new approach is more or less the *conventional* way to maintain
> the heap invariant when returning elements from a heap during k-way
> merging. Our existing approach is convoluted; merging was presumably
> only coded that way because the generic functions
> tuplesort_heap_siftup() and tuplesort_heap_insert() happened to be
> available. Perhaps the problem was masked by unrelated bottlenecks
> that existed at the time, too.

Yeah, this seems like a very obvious optimization. Is there a standard
name for this technique in the literature? I'm OK with "displace", or
perhaps just "replace" or "siftup+insert", but if there's a standard
name for this, let's use that.

- Heikki



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

Heikki Linnakangas
In reply to this post by Peter Geoghegan-3
I'm reviewing patches 1-3 in this series, i.e. those patches that are
not directly related to parallelism, but are independent improvements to
merging.

Let's begin with patch 1:

On 08/02/2016 01:18 AM, Peter Geoghegan wrote:

> Cap the number of tapes used by external sorts
>
> Commit df700e6b set merge order based on available buffer space (the
> number of tapes was as high as possible while still allowing at least 32
> * BLCKSZ buffer space per tape), rejecting Knuth's theoretically
> justified "sweet spot" of 7 tapes (a merge order of 6 -- Knuth's P),
> improving performance when the sort thereby completed in one pass.
> However, it's still true that there are unlikely to be benefits from
> increasing the number of tapes past 7 once the amount of data to be
> sorted significantly exceeds available memory; that commit probably
> mostly just improved matters where it enabled all merging to be done in
> a final on-the-fly merge.
>
> One problem with the merge order logic established by that commit is
> that with large work_mem settings and data volumes, the tapes previously
> wasted as much as 8% of the available memory budget; tens of thousands
> of tapes could be logically allocated for a sort that will only benefit
> from a few dozen.

Yeah, wasting 8% of the memory budget on this seems like a bad idea. If
I understand correctly, that makes the runs shorter than necessary,
leading to more runs.

> A new quasi-arbitrary cap of 501 is applied on the number of tapes that
> tuplesort will ever use (i.e.  merge order is capped at 500 inclusive).
> This is a conservative estimate of the number of runs at which doing all
> merging on-the-fly no longer allows greater overlapping of I/O and
> computation.

Hmm. Surely there are cases, so that with > 501 tapes you could do it
with one merge pass, but now you need two? And that would hurt
performance, no?

Why do we reserve the buffer space for all the tapes right at the
beginning? Instead of the single USEMEM(maxTapes * TAPE_BUFFER_OVERHEAD)
callin inittapes(), couldn't we call USEMEM(TAPE_BUFFER_OVERHEAD) every
time we start a new run, until we reach maxTapes?

- Heikki



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

Peter Geoghegan-3
In reply to this post by Heikki Linnakangas
On Tue, Sep 6, 2016 at 12:08 AM, Heikki Linnakangas <[hidden email]> wrote:
>> I attach a patch that changes how we maintain the heap invariant
>> during tuplesort merging.

> Nice!

Thanks!

>> This new approach is more or less the *conventional* way to maintain
>> the heap invariant when returning elements from a heap during k-way
>> merging. Our existing approach is convoluted; merging was presumably
>> only coded that way because the generic functions
>> tuplesort_heap_siftup() and tuplesort_heap_insert() happened to be
>> available. Perhaps the problem was masked by unrelated bottlenecks
>> that existed at the time, too.
>
>
> Yeah, this seems like a very obvious optimization. Is there a standard name
> for this technique in the literature? I'm OK with "displace", or perhaps
> just "replace" or "siftup+insert", but if there's a standard name for this,
> let's use that.

I used the term "displace" specifically because it wasn't a term with
a well-defined meaning in the context of the analysis of algorithms.
Just like "insert" isn't for tuplesort_heap_insert(). I'm not
particularly attached to the name tuplesort_heap_root_displace(), but
I do think that whatever it ends up being called should at least not
be named after an implementation detail. For example,
tuplesort_heap_root_replace() also seems fine.

I think that tuplesort_heap_siftup() should be called something like
tuplesort_heap_compact instead [1], since what it actually does
(shifting down -- the existing name is completely backwards!) is just
an implementation detail involved in compacting the heap (notice that
it decrements memtupcount, which, by now, means the k-way merge heap
gets one element smaller). I can write a patch to do this renaming, if
you're interested. Someone should fix it, because independent of all
this, it's just wrong.

[1] https://www.postgresql.org/message-id/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@...
--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

Peter Geoghegan-3
On Tue, Sep 6, 2016 at 12:39 PM, Peter Geoghegan <[hidden email]> wrote:
> On Tue, Sep 6, 2016 at 12:08 AM, Heikki Linnakangas <[hidden email]> wrote:
>>> I attach a patch that changes how we maintain the heap invariant
>>> during tuplesort merging.
>
>> Nice!
>
> Thanks!

BTW, the way that k-way merging is made more efficient by this
approach makes the case for replacement selection even weaker than it
was just before we almost killed it. I hate to say it, but I have to
wonder if we shouldn't get rid of the new-to-9.6
replacement_sort_tuples because of this, and completely kill
replacement selection. I'm not going to go on about it, but that seems
sensible to me.


--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

Claudio Freire
In reply to this post by Peter Geoghegan-3
On Mon, Aug 15, 2016 at 9:33 PM, Peter Geoghegan <[hidden email]> wrote:
> The patch is intended to be applied on top of parallel B-Tree patches
> 0001-* and 0002-* [1]. I happened to test it with parallelism, but
> these are all independently useful, and will be entered as a separate
> CF entry (perhaps better to commit the earlier two patches first, to
> avoid merge conflicts). I'm optimistic that we can get those 3 patches
> in the series out of the way early, without blocking on discussing
> parallel sort.

Applied patches 1 and 2, builds fine, regression tests run fine. It
was a prerequisite to reviewing patch 3 (which I'm going to do below),
so I thought I might as well report on that tidbit of info, fwiw.

> The patch makes tuplesort merging shift down and displace the root
> tuple with the tape's next preread tuple, rather than compacting and
> then inserting into the heap anew. This approach to maintaining the
> heap as tuples are returned to caller will always produce fewer
> comparisons overall. The new approach is also simpler. We were already
> shifting down to compact the heap within the misleadingly named [2]
> function tuplesort_heap_siftup() -- why not instead just use the
> caller tuple (the tuple that we currently go on to insert) when
> initially shifting down (not the heap's preexisting last tuple, which
> is guaranteed to go straight to the leaf level again)? That way, we
> don't need to enlarge the heap at all through insertion, shifting up,
> etc. We're done, and are *guaranteed* to have performed less work
> (fewer comparisons and swaps) than with the existing approach (this is
> the reason for my optimism about getting this stuff out of the way
> early).

Patch 3 applies fine to git master as of
25794e841e5b86a0f90fac7f7f851e5d950e51e2 (on top of patches 1 and 2).

Builds fine and without warnings on gcc 4.8.5 AFAICT, regression test
suite runs without issues as well.

Patch lacks any new tests, but the changed code paths seem covered
sufficiently by existing tests. A little bit of fuzzing on the patch
itself, like reverting some key changes, or flipping some key
comparisons, induces test failures as it should, mostly in cluster.

The logic in tuplesort_heap_root_displace seems sound, except:

+                */
+               memtuples[i] = memtuples[imin];
+               i = imin;
+       }
+
+       Assert(state->memtupcount > 1 || imin == 0);
+       memtuples[imin] = *newtup;
+}

Why that assert? Wouldn't it make more sense to Assert(imin < n) ?


In the meanwhile, I'll go and do some perf testing.

Assuming the speedup is realized during testing, LGTM.


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

Peter Geoghegan-3
In reply to this post by Heikki Linnakangas
On Tue, Sep 6, 2016 at 12:34 AM, Heikki Linnakangas <[hidden email]> wrote:
> I'm reviewing patches 1-3 in this series, i.e. those patches that are not
> directly related to parallelism, but are independent improvements to
> merging.

That's fantastic! Thanks!

I'm really glad you're picking those ones up. I feel that I'm far too
dependent on Robert's review for this stuff. That shouldn't be taken
as a statement against Robert -- it's intended as quite the opposite
-- but it's just personally difficult to rely on exactly one other
person for something that I've put so much work into. Robert has been
involved with 100% of all sorting patches I've written, generally with
far less input from anyone else, and at this point, that's really
rather a lot of complex patches.

> Let's begin with patch 1:
>
> On 08/02/2016 01:18 AM, Peter Geoghegan wrote:
>>
>> Cap the number of tapes used by external sorts

> Yeah, wasting 8% of the memory budget on this seems like a bad idea. If I
> understand correctly, that makes the runs shorter than necessary, leading to
> more runs.

Right. Quite simply, whatever you could have used the workMem for
prior to the merge step, now you can't. It's not so bad during the
merge step of a final on-the-fly merge (or, with the 0002-* patch, any
final merge), since you can get a "refund" of unused (though logically
allocated by USEMEM()) tapes to grow memtuples with (other overhead
forms the majority of the refund, though). That still isn't much
consolation to the user, because run generation is typically much more
expensive (we really just refund unused tapes because it's easy).

>> A new quasi-arbitrary cap of 501 is applied on the number of tapes that
>> tuplesort will ever use (i.e.  merge order is capped at 500 inclusive).
>> This is a conservative estimate of the number of runs at which doing all
>> merging on-the-fly no longer allows greater overlapping of I/O and
>> computation.
>
>
> Hmm. Surely there are cases, so that with > 501 tapes you could do it with
> one merge pass, but now you need two? And that would hurt performance, no?

In theory, yes, that could be true, and not just for my proposed new
cap of 500 for merge order (501 tapes), but for any such cap. I
noticed that the Greenplum tuplesort.c uses a max of 250, so I guess I
just thought that to double that. Way back in 2006, Tom and Simon
talked about a cap too on several occasions, but I think that that was
in the thousands then.

Hundreds of runs are typically quite rare. It isn't that painful to do
a second pass, because the merge process may be more CPU cache
efficient as a result, which tends to be the dominant cost these days
(over and above the extra I/O that an extra pass requires).

This seems like a very familiar situation to me: I pick a
quasi-arbitrary limit or cap for something, and it's not clear that
it's optimal. Everyone more or less recognizes the need for such a
cap, but is uncomfortable about the exact figure chosen, not because
it's objectively bad, but because it's clearly something pulled from
the air, to some degree. It may not make you feel much better about
it, but I should point out that I've read a paper that claims "Modern
servers of the day have hundreds of GB operating memory and tens of TB
storage capacity. Hence, if the sorted data fit the persistent
storage, the first phase will generate hundreds of runs at most." [1].

Feel free to make a counter-proposal for a cap. I'm not attached to
500. I'm mostly worried about blatant waste with very large workMem
sizings. Tens of thousands of tapes is just crazy. The amount of data
that you need to have as input is very large when workMem is big
enough for this new cap to be enforced.

> Why do we reserve the buffer space for all the tapes right at the beginning?
> Instead of the single USEMEM(maxTapes * TAPE_BUFFER_OVERHEAD) callin
> inittapes(), couldn't we call USEMEM(TAPE_BUFFER_OVERHEAD) every time we
> start a new run, until we reach maxTapes?

No, because then you have no way to clamp back memory, which is now
almost all used (we hold off from making LACKMEM() continually true,
if at all possible, which is almost always the case). You can't really
continually shrink memtuples to make space for new tapes, which is
what it would take.

[1] http://ceur-ws.org/Vol-1343/paper8.pdf
--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

Peter Geoghegan-3
On Tue, Sep 6, 2016 at 2:46 PM, Peter Geoghegan <[hidden email]> wrote:
> Feel free to make a counter-proposal for a cap. I'm not attached to
> 500. I'm mostly worried about blatant waste with very large workMem
> sizings. Tens of thousands of tapes is just crazy. The amount of data
> that you need to have as input is very large when workMem is big
> enough for this new cap to be enforced.

If tuplesort callers passed a hint about the number of tuples that
would ultimately be sorted, and (for the sake of argument) it was
magically 100% accurate, then theoretically we could just allocate the
right number of tapes up-front. That discussion is a big can of worms,
though. There are of course obvious disadvantages that come with a
localized cost model, even if you're prepared to add some "slop" to
the allocation size or whatever.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

Peter Geoghegan-3
In reply to this post by Claudio Freire
On Tue, Sep 6, 2016 at 12:57 PM, Claudio Freire <[hidden email]> wrote:

> Patch lacks any new tests, but the changed code paths seem covered
> sufficiently by existing tests. A little bit of fuzzing on the patch
> itself, like reverting some key changes, or flipping some key
> comparisons, induces test failures as it should, mostly in cluster.
>
> The logic in tuplesort_heap_root_displace seems sound, except:
>
> +                */
> +               memtuples[i] = memtuples[imin];
> +               i = imin;
> +       }
> +
> +       Assert(state->memtupcount > 1 || imin == 0);
> +       memtuples[imin] = *newtup;
> +}
>
> Why that assert? Wouldn't it make more sense to Assert(imin < n) ?

There might only be one or two elements in the heap. Note that the
heap size is indicated by state->memtupcount at this point in the
sort, which is a little confusing (that differs from how memtupcount
is used elsewhere, where we don't partition memtuples into a heap
portion and a preread tuples portion, as we do here).

> In the meanwhile, I'll go and do some perf testing.
>
> Assuming the speedup is realized during testing, LGTM.

Thanks. I suggest spending at least as much time on unsympathetic
cases (e.g., only 2 or 3 tapes must be merged). At the same time, I
suggest focusing on a type that has relatively expensive comparisons,
such as collated text, to make differences clearer.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

Claudio Freire
On Tue, Sep 6, 2016 at 8:28 PM, Peter Geoghegan <[hidden email]> wrote:

> On Tue, Sep 6, 2016 at 12:57 PM, Claudio Freire <[hidden email]> wrote:
>> Patch lacks any new tests, but the changed code paths seem covered
>> sufficiently by existing tests. A little bit of fuzzing on the patch
>> itself, like reverting some key changes, or flipping some key
>> comparisons, induces test failures as it should, mostly in cluster.
>>
>> The logic in tuplesort_heap_root_displace seems sound, except:
>>
>> +                */
>> +               memtuples[i] = memtuples[imin];
>> +               i = imin;
>> +       }
>> +
>> +       Assert(state->memtupcount > 1 || imin == 0);
>> +       memtuples[imin] = *newtup;
>> +}
>>
>> Why that assert? Wouldn't it make more sense to Assert(imin < n) ?
>
> There might only be one or two elements in the heap. Note that the
> heap size is indicated by state->memtupcount at this point in the
> sort, which is a little confusing (that differs from how memtupcount
> is used elsewhere, where we don't partition memtuples into a heap
> portion and a preread tuples portion, as we do here).

I noticed, but here n = state->memtupcount

+       Assert(memtuples[0].tupindex == newtup->tupindex);
+
+       CHECK_FOR_INTERRUPTS();
+
+       n = state->memtupcount;                 /* n is heap's size,
including old root */
+       imin = 0;                                               /*
start with caller's "hole" in root */
+       i = imin;

In fact, the assert on the patch would allow writing memtuples outside
the heap, as in calling tuplesort_heap_root_displace if
memtupcount==0, but I don't think that should be legal (memtuples[0]
== memtuples[imin] would be outside the heap).

Sure, that's a weird enough case (that assert up there already reads
memtuples[0] which would be equally illegal if memtupcount==0), but it
goes on to show that the assert expression just seems odd for its
intent.

BTW, I know it's not the scope of the patch, but shouldn't
root_displace be usable on the TSS_BOUNDED phase?


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

Peter Geoghegan-3
On Tue, Sep 6, 2016 at 4:55 PM, Claudio Freire <[hidden email]> wrote:

> I noticed, but here n = state->memtupcount
>
> +       Assert(memtuples[0].tupindex == newtup->tupindex);
> +
> +       CHECK_FOR_INTERRUPTS();
> +
> +       n = state->memtupcount;                 /* n is heap's size,
> including old root */
> +       imin = 0;                                               /*
> start with caller's "hole" in root */
> +       i = imin;

I'm fine with using "n" in the later assertion you mentioned, if
that's clearer to you. memtupcount is broken out as "n" simply because
that's less verbose, in a place where that makes things far clearer.

> In fact, the assert on the patch would allow writing memtuples outside
> the heap, as in calling tuplesort_heap_root_displace if
> memtupcount==0, but I don't think that should be legal (memtuples[0]
> == memtuples[imin] would be outside the heap).

You have to have a valid heap (i.e. there must be at least one
element) to call tuplesort_heap_root_displace(), and it doesn't
directly compact the heap, so it must remain valid on return. The
assertion exists to make sure that everything is okay with a
one-element heap, a case which is quite possible. If you want to see a
merge involving one input tape, apply the entire parallel CREATE INDEX
patch set, set "force_parallal_mode = regress", and note that the
leader merge merges only 1 input tape, making the heap only ever
contain one element. In general, most use of the heap for k-way
merging will eventually end up as a one element heap, at the very end.

Maybe that assertion you mention is overkill, but I like to err on the
side of overkill with assertions. It doesn't seem that important,
though.

> Sure, that's a weird enough case (that assert up there already reads
> memtuples[0] which would be equally illegal if memtupcount==0), but it
> goes on to show that the assert expression just seems odd for its
> intent.
>
> BTW, I know it's not the scope of the patch, but shouldn't
> root_displace be usable on the TSS_BOUNDED phase?

I don't think it should be, no. With a top-n heap sort, the
expectation is that after a little while, we can immediately determine
that most tuples do not belong in the heap (this will require more
than one comparison per tuple when the tuple that may be entered into
the heap will in fact go in the heap, which should be fairly rare
after a time). That's why that general strategy can be so much faster,
of course.

Note that that heap is "reversed" -- the sort order is inverted, so
that we can use a minheap. The top of the heap is the most marginal
tuple in the top-n heap so far, and so is the next to be removed from
consideration entirely (not the next to be returned to caller, when
merging).

Anyway, I just don't think that this is important enough to change --
it couldn't possibly be worth much of any risk. I can see the appeal
of consistency, but I also see the appeal of sticking to how things
work there: continually and explicitly inserting into and compacting
the heap seems like a good enough way of framing what a top-n heap
does, since there are no groupings of tuples (tapes) involved there.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|

Re: Parallel tuplesort (for parallel B-Tree index creation)

Claudio Freire
On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan <[hidden email]> wrote:

> On Tue, Sep 6, 2016 at 4:55 PM, Claudio Freire <[hidden email]> wrote:
>> I noticed, but here n = state->memtupcount
>>
>> +       Assert(memtuples[0].tupindex == newtup->tupindex);
>> +
>> +       CHECK_FOR_INTERRUPTS();
>> +
>> +       n = state->memtupcount;                 /* n is heap's size,
>> including old root */
>> +       imin = 0;                                               /*
>> start with caller's "hole" in root */
>> +       i = imin;
>
> I'm fine with using "n" in the later assertion you mentioned, if
> that's clearer to you. memtupcount is broken out as "n" simply because
> that's less verbose, in a place where that makes things far clearer.
>
>> In fact, the assert on the patch would allow writing memtuples outside
>> the heap, as in calling tuplesort_heap_root_displace if
>> memtupcount==0, but I don't think that should be legal (memtuples[0]
>> == memtuples[imin] would be outside the heap).
>
> You have to have a valid heap (i.e. there must be at least one
> element) to call tuplesort_heap_root_displace(), and it doesn't
> directly compact the heap, so it must remain valid on return. The
> assertion exists to make sure that everything is okay with a
> one-element heap, a case which is quite possible.

More than using "n" or "memtupcount" what I'm saying is to assert that
memtuples[imin] is inside the heap, which would catch the same errors
the original assert would, and more.

Assert(imin < state->memtupcount)

If you prefer.

The original asserts allows any value of imin for memtupcount>1, and
that's my main concern. It shouldn't.


On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan <[hidden email]> wrote:

>> Sure, that's a weird enough case (that assert up there already reads
>> memtuples[0] which would be equally illegal if memtupcount==0), but it
>> goes on to show that the assert expression just seems odd for its
>> intent.
>>
>> BTW, I know it's not the scope of the patch, but shouldn't
>> root_displace be usable on the TSS_BOUNDED phase?
>
> I don't think it should be, no. With a top-n heap sort, the
> expectation is that after a little while, we can immediately determine
> that most tuples do not belong in the heap (this will require more
> than one comparison per tuple when the tuple that may be entered into
> the heap will in fact go in the heap, which should be fairly rare
> after a time). That's why that general strategy can be so much faster,
> of course.

I wasn't proposing getting rid of that optimization, but just
replacing the siftup+insert step with root_displace...

> Note that that heap is "reversed" -- the sort order is inverted, so
> that we can use a minheap. The top of the heap is the most marginal
> tuple in the top-n heap so far, and so is the next to be removed from
> consideration entirely (not the next to be returned to caller, when
> merging).

...but I didn't pause to consider that point.

It still looks like a valid optimization, instead rearranging the heap
twice (siftup + insert), do it once (replace + relocate).

However, I agree that it's not worth the risk conflating the two
optimizations. That one can be done later as a separate patch.


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
1234 ... 8
Previous Thread Next Thread