I'd like to discuss scaleout at PGCon

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
51 messages Options
123
Reply | Threaded
Open this post in threaded view
|

I'd like to discuss scaleout at PGCon

MauMau
Hello,

I'm going to attend PGCon in Ottawa for the first time.  I am happy if
I can meet you.

Because I'm visually impaired, I only have vision to sense light.  If
you see a Japanese man with a height of 171 cm with a white cane, it's
probably me.  I'd be happy if you talk to me.  But as I'm still far
from good at listening and speaking English, I'm sorry if I take an
unfriendly attitude or if I can not keep on talking for a long time.


I'd like to have a session on scaleout design at the unconference.
I've created a wiki page for that (this is still just a memo; I'd like
to populate this page with you as the discussion in the community
progresses).  I'd appreciate it if someone could stand with me and
facilitate the discussion at the unconference.

https://wiki.postgresql.org/wiki/Scaleout_Design

The background is ... our company is faced with an immediate need to
develop the read-write scaleout feature on PostgreSQL.  We tried
Postgres-XL with much hope, but we found it difficult to achieve our
performance goal.  I will tell you the details at the conference.  But
personally, Postgres-XL seems to be very nice software, and I feel
that good parts of it should be integrated into core.

I know that many great hackers from 2ndQuadrant, EnterpriseDB, NTT,
Postgres Professional, CitusData, and so on are addressing this
difficult scaleout feature.  I don't think yet we are competent to
lead this development.

On the other hand, we have a proprietary RDBMS called Symfoware (I'm
sure you don't know it), which is not based on PostgreSQL, that
provides the scaleout feature.  Its architecture is a mix of shared
nothing and shared everything.  It implements deadlock detection and
resolution without a central node or periodic monitoring, parallel 2PC
across nodes, parallel crash recovery, client connection routing and
failover without any overhead of intermediary middleware during SQL
execution, etc.  So we may be able to help in some way.  I'd be happy
if we could help the community to proceed with development of
scaleout.

If you have a session for scaleout outside the unconference, could you
call me and let me join it?


By the way, the popularity score of PostgreSQL finally exceeded 400
points in the DB-Engines ranking!  The popularity difference with the
top products has shrunk greatly.  Let's make PostgreSQL more popular.

https://db-engines.com/en/ranking

    [as of May 27, 2018]
    Oracle=1290.42  MySQL=1223.34  SQL Server=1085.84
    PostgreSQL=400.90  MongoDB=342.11
    (Oracle / PostgreSQL ratio is 3.2)

    [as of Feb 2016, according to a memo at hand]
    Oracle=1476.14  MySQL=1321.13  SQL Server=??
    MongoDB=??  PostgreSQL=288.66
    (Oracle / PostgreSQL ratio is 5.1)


Regards
MauMau


Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

Robert Haas
On Sun, May 27, 2018 at 1:20 AM, MauMau <[hidden email]> wrote:
> I'm going to attend PGCon in Ottawa for the first time.  I am happy if
> I can meet you.

It was nice to meet you in person.

> I'd like to have a session on scaleout design at the unconference.
> I've created a wiki page for that (this is still just a memo; I'd like
> to populate this page with you as the discussion in the community
> progresses).  I'd appreciate it if someone could stand with me and
> facilitate the discussion at the unconference.
>
> https://wiki.postgresql.org/wiki/Scaleout_Design

We didn't have time in the unconference session to discuss these
topics in detail, for you have raised many issues here each of which
deserves discussion individually and in detail.  I wrote a blog post
somewhat related to this topic recently which you can find at
http://rhaas.blogspot.com/2018/05/built-in-sharding-for-postgresql.html
-- it focuses on a somewhat different set of issues than the ones you
raise but that's good, because we need to hear all perspectives.  In
terms of high-level architecture, I think you are right to wonder
about the possibility of a cloud-native mode based on separating
storage and compute.  Amazon Aurora offers that feature, but we might
want to have it in PostgreSQL.

Another, somewhat different thing that we might want is a fully
distributed database, with a distributed buffer cache, distributed
lock manager, distributed invalidation queue, etc.  That would be like
what Oracle RAC does, but it's a tremendous amount of work, and a lot
of that work has no value by itself.  You don't get the payoff until
it's all working.  There are a few pieces that might be independently
useful, though, like a distributed deadlock detector.  The goal of
this kind of effort is, I think, to support zillions of connections --
scale further than you can with just one node.  This would be a lot
better if combined with the cloud-native storage, but of course that's
even more work.

The FDW approach, of which I have been a supporter for some years now,
is really aiming at a different target, which is to allow efficient
analytics queries across a multi-node cluster.  I think we're getting
pretty close to being able to do that -- IMHO, the last fundamental
building block that we need is asynchronous execution, which Andres is
working on.  After that, it's a matter of adding other features that
people want (like cross-node MVCC) and improving the plans for queries
that still don't perform well (like joins that involve redistributing
one of the tables involved).

We might not want to confine ourselves strictly to the FDW interface
-- for example, I've thought about the idea of building introducing a
new relkind for a "distributed table".  A distributed table may be
present on the local node, in which case it can be scanned like a
local table, or it may be not present, in which case it can be scanned
like a foreign table by connecting to a node on which it is present.
The set of nodes on which a table is present is metadata that is
shared throughout the cluster.  Multi-master logical replication
propagates changes between all nodes on which the table is present.
With a concept like this, there is a lot of opportunity to optimize
queries by, for example, deciding to perform a join on a node on which
both input tables are present, to minimize data movement.  But even if
we create something like this, I see it as fundamentally an extension
of the FDW approach that would hopefully manage to reuse a lot of
what's already been built there.  I don't think we need to (or should)
throw away the work that's been done on FDW pushdown and start over --
we should find a way to build on top of it and add ease-of-use and
management features.

In fact, even if we said that we want a fully distributed database,
we'd probably still need some kind of distributed table concept.
Unless every node has a full copy of everything in the database, you
still need to reason about which data is present on which nodes and
optimize queries accordingly.

> By the way, the popularity score of PostgreSQL finally exceeded 400
> points in the DB-Engines ranking!  The popularity difference with the
> top products has shrunk greatly.  Let's make PostgreSQL more popular.
>
> https://db-engines.com/en/ranking
>
>     [as of May 27, 2018]
>     Oracle=1290.42  MySQL=1223.34  SQL Server=1085.84
>     PostgreSQL=400.90  MongoDB=342.11
>     (Oracle / PostgreSQL ratio is 3.2)
>
>     [as of Feb 2016, according to a memo at hand]
>     Oracle=1476.14  MySQL=1321.13  SQL Server=??
>     MongoDB=??  PostgreSQL=288.66
>     (Oracle / PostgreSQL ratio is 5.1)

From the chart view, in February 2016, SQL Server was at 1150.227, and
MongoDB was at 305.599.  Generally it looks like the "big three" --
Oracle, MySQL, and SQL Server -- are all slowly declining in the
rankings while PostgreSQL and MongoDB are slowly rising.

It's good to see PostgreSQL going up, but we have a long way to go to
really be in the big leagues.  I think in the last couple of years we
have done a much better job of adding interesting features than we did
for a few years before that.  Between PostgreSQL 9.6 and 11, we will
have added (and in many cases significantly improved) partitioning,
logical replication, parallel query, and stored procedures.  Along the
way we've improved a lot of other things, not least executor
performance.  I don't know how directly those things affect the
DB-Engines ranking, but they are such major things that I have to
believe they make PostgreSQL a real possibility for many users who
wouldn't previously have found it a viable option, and I assume that's
at least part of what is driving the numbers, though there are
probably other factors too, such as Oracle's acquisition of MySQL.  I
think it's pretty clear that we need to both continue to improve some
of these major new features we've added and at the same time keep
introducing even more new things if we want to continue to gain market
share and mind share.  I hope that features like scale-out and also
zheap are going to help us continue to whittle away at the gap, and I
look forward to seeing what else anyone may have in mind.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

MauMau
2018-05-31 11:26 GMT+09:00, Robert Haas <[hidden email]>:
> It was nice to meet you in person.

Me too.  And it was very kind of you to help me to display the wiki
page well and guide the session.  When I first heard your voice at the
Developer Meeting, I thought Bruce Momjian was speaking, because your
voice sounded similar to him...


> We didn't have time in the unconference session to discuss these
> topics in detail, for you have raised many issues here each of which
> deserves discussion individually and in detail.  I wrote a blog post
> somewhat related to this topic recently which you can find at
> http://rhaas.blogspot.com/2018/05/built-in-sharding-for-postgresql.html

Yes, I read this article before PGCon.  Your articles are always
helpful to catch the current overall situation of the community.


> In
> terms of high-level architecture, I think you are right to wonder
> about the possibility of a cloud-native mode based on separating
> storage and compute.  Amazon Aurora offers that feature, but we might
> want to have it in PostgreSQL.

> Another, somewhat different thing that we might want is a fully
> distributed database, with a distributed buffer cache, distributed
> lock manager, distributed invalidation queue, etc.  That would be like
> what Oracle RAC does, but it's a tremendous amount of work, and a lot
> of that work has no value by itself.  You don't get the payoff until
> it's all working.  There are a few pieces that might be independently
> useful, though, like a distributed deadlock detector.  The goal of
> this kind of effort is, I think, to support zillions of connections --
> scale further than you can with just one node.  This would be a lot
> better if combined with the cloud-native storage, but of course that's
> even more work.

Yes, I can understand the difficulty.  So, I simply wanted to ask
opinions at the unconference on which (hard) direction the community
wants to go and what database we want PostgreSQL to be like
ultimately.  Without that fundamental consensus, development work
might be wasted, facing objections after submitting patches.  As you
mentioned in your blog post and in a past email, I don't think anyone
yet has a clear image of what the scaleout of PostgreSQL should be.
How should we proceed?  Which approach should we take to minimize
rework?

a) Define functional specification with the basic overall architecture
(this doesn't mean to write a heavy detailed design document or
manual; I think a general README or wiki would be sufficient.)  At
this time, I expect we can evaluate how to use scaleout feature and
whether it's reasonably easy to use.  Then we can proceed to design
and code each part with relief -- 2PC, global consistency, failure
detection and failover, distributed lock management and deadlock
handling, etc.

b) Various developers design and code each part, bring together those
patches, and then all try to figure out how to combine them.

I'm in favor of a) at least at the basic architecture level.
Otherwise, such an unhappiness could happen:
"Hey, I made a patch to implement a distributed cache management like
Oracle Cache Fusion."
"No, we don't want features based on shared everything architecture."

I anticipated a decision process at the unconference like this:
"Do we want to build on shared everything architecture?"
"No, because it limits scalability, requires expensive shared storage,
and it won't run on many clouds."
"Then do we want to employ a new architecture like AWS Aurora?"
"That may be interesting.  But AWS could do it because they have an
infinitely scalable storage layer which is built and used for a long
time.  This architecture may not be our option.  But let's keep our
eye on leveraging services of major cloud vendors just like Vertica
does recently.  Cloud services are now like traditional
vendor-specific hardware.  Maybe PostgreSQL should utilize them just
like we use CPU-specific instructions now and GPU/persistent memory in
the near future."
"Then, it seems that we should go on the shared nothing architecture.
Is it OK?"
"Yes."



> The FDW approach, of which I have been a supporter for some years now,
> is really aiming at a different target, which is to allow efficient
> analytics queries across a multi-node cluster.

Oh, I didn't know you support FDW approach mainly for analytics.  I
guessed the first target was OLTP read-write scalability.


> We might not want to confine ourselves strictly to the FDW interface
> -- for example, I've thought about the idea of building introducing a
> new relkind for a "distributed table".  A distributed table may be
> present on the local node, in which case it can be scanned like a
> local table, or it may be not present, in which case it can be scanned
> like a foreign table by connecting to a node on which it is present.
> The set of nodes on which a table is present is metadata that is
> shared throughout the cluster.  Multi-master logical replication
> propagates changes between all nodes on which the table is present.
> With a concept like this, there is a lot of opportunity to optimize
> queries by, for example, deciding to perform a join on a node on which
> both input tables are present, to minimize data movement.

I agree.  XL, Oracle Sharding, and possibly MySQL Cluster does that,
too.  It seems like a must-do thing.


> But even if
> we create something like this, I see it as fundamentally an extension
> of the FDW approach that would hopefully manage to reuse a lot of
> what's already been built there.  I don't think we need to (or should)
> throw away the work that's been done on FDW pushdown and start over --
> we should find a way to build on top of it and add ease-of-use and
> management features.

Agreed.  I think we should not write much code from scratch, too.  On
the other hand, if we have to support sharding natively without FDW, I
wonder if we can reuse the FDW artifact.  I mean, extracting necessary
logics from FDW into common functions, and native sharding code also
calls them.


> In fact, even if we said that we want a fully distributed database,
> we'd probably still need some kind of distributed table concept.
> Unless every node has a full copy of everything in the database, you
> still need to reason about which data is present on which nodes and
> optimize queries accordingly.

Then, how about building the cluster membership management first,
including node management and failure detection/failover?  I think
that node management is necessary anyway, and other developers can
experiment other things on that cluster infrastructure.  Do you think
it would be helpful or wasteful?  I'm finding what we can do for early
scaleout release.


> From the chart view, in February 2016, SQL Server was at 1150.227, and
> MongoDB was at 305.599.  Generally it looks like the "big three" --

Thank you for looking at the chart and telling me the figures.

> I
> think it's pretty clear that we need to both continue to improve some
> of these major new features we've added and at the same time keep
> introducing even more new things if we want to continue to gain market
> share and mind share.  I hope that features like scale-out and also
> zheap are going to help us continue to whittle away at the gap, and I
> look forward to seeing what else anyone may have in mind.

Definitely.  I couldn't agree more.


Regards
MauMau

Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

Robert Haas
On Thu, May 31, 2018 at 8:12 AM, MauMau <[hidden email]> wrote:

> I anticipated a decision process at the unconference like this:
> "Do we want to build on shared everything architecture?"
> "No, because it limits scalability, requires expensive shared storage,
> and it won't run on many clouds."
> "Then do we want to employ a new architecture like AWS Aurora?"
> "That may be interesting.  But AWS could do it because they have an
> infinitely scalable storage layer which is built and used for a long
> time.  This architecture may not be our option.  But let's keep our
> eye on leveraging services of major cloud vendors just like Vertica
> does recently.  Cloud services are now like traditional
> vendor-specific hardware.  Maybe PostgreSQL should utilize them just
> like we use CPU-specific instructions now and GPU/persistent memory in
> the near future."
> "Then, it seems that we should go on the shared nothing architecture.
> Is it OK?"
> "Yes."

I think all of that except for the last two sentences accords with my
view, at least.  As for the last two sentences, I'm happy to see
someone develop some kind of cloud-native storage if they want, but I
have no immediate plans in that area myself.

>> The FDW approach, of which I have been a supporter for some years now,
>> is really aiming at a different target, which is to allow efficient
>> analytics queries across a multi-node cluster.
>
> Oh, I didn't know you support FDW approach mainly for analytics.  I
> guessed the first target was OLTP read-write scalability.

That seems like a harder target to me, because you will have an extra
hop involved -- SQL from the client to the first server, then via SQL
to a second server.  The work of parsing and planning also has to be
done twice, once for the foreign table and again for the table.  For
longer-running queries this overhead doesn't matter as much, but for
short-running queries it is significant.

> Agreed.  I think we should not write much code from scratch, too.  On
> the other hand, if we have to support sharding natively without FDW, I
> wonder if we can reuse the FDW artifact.  I mean, extracting necessary
> logics from FDW into common functions, and native sharding code also
> calls them.

I wouldn't do it that way.  I want to use postgres_fdw as a whole, not
break it up into pieces and reuse the individual bits of code.  We
should think also about whether other FDWs could be part of it,
although maybe that's not realistic.

>> In fact, even if we said that we want a fully distributed database,
>> we'd probably still need some kind of distributed table concept.
>> Unless every node has a full copy of everything in the database, you
>> still need to reason about which data is present on which nodes and
>> optimize queries accordingly.
>
> Then, how about building the cluster membership management first,
> including node management and failure detection/failover?  I think
> that node management is necessary anyway, and other developers can
> experiment other things on that cluster infrastructure.  Do you think
> it would be helpful or wasteful?  I'm finding what we can do for early
> scaleout release.

I don't know what "node management" and "failure dectection/failover"
mean specifically.  I'd like to hear proposals, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

MauMau
2018-05-31 22:44 GMT+09:00, Robert Haas <[hidden email]>:

> On Thu, May 31, 2018 at 8:12 AM, MauMau <[hidden email]> wrote:
>> Oh, I didn't know you support FDW approach mainly for analytics.  I
>> guessed the first target was OLTP read-write scalability.
>
> That seems like a harder target to me, because you will have an extra
> hop involved -- SQL from the client to the first server, then via SQL
> to a second server.  The work of parsing and planning also has to be
> done twice, once for the foreign table and again for the table.  For
> longer-running queries this overhead doesn't matter as much, but for
> short-running queries it is significant.

Yes, that extra hop and double parsing/planning were the killer for
our performance goal when we tried to meet our customer's scaleout
needs with XL.  The application executes 82 DML statements in one
transaction.  Those DMLs consist of INSERT, UPDATE and SELECT that
only accesses one row with a primary key.  The target tables are only
a few, so the application PREPAREs a few statements and EXECUTEs them
repeatedly.  We placed the coordinator node of XL on the same host as
the application, and data nodes and GTM on other individual nodes.

The response time of XL compared to PostgreSQL was 2.4 times, and the
throughput (tps) was 43%.  Interestingly, perf showed that
base_yyparse() was the top CPU consumer on both coordinator and data
node, while base_yyparse() appeared near the bottom of the ranking.
The SQL processor should be one layer, not two layers.

In the above benchmark, each transaction only accessed data on one
data node.  That's what sharding principles recommend.  The FDW
approach would be no problem as long as the application follows the
sharding recommendation.

But not all applications will/can follow the sharding recommendation.
The above application, which is migrated from a mainframe, uses
INSERTs to load data, inserting rows onto various nodes.  Considering
your concern of double parsing/planning for a local foreign table and
a remote real table, wouldn't the FDW approach hit the wall?


> I don't know what "node management" and "failure dectection/failover"
> mean specifically.  I'd like to hear proposals, though.

That's nothing special or new.  Things like:

* Define a set of nodes that can join the cluster.
* Initialize or configure a node according to its role in the cluster.
* Decommission a node from the cluster.
* Define a node group in which all member nodes have the same data set
for redundancy.
* One command to start and shutdown the entire cluster.
* System tables to display the member nodes and node groups.
* Each node's in-memory view of the current cluster state.
* How each node monitors which other nodes.
* Elect a new primary node within a node group when the current
primary node fails.
* Whether each node group should be configured with a master-slaves
replication topology, or a multi-master topology like MySQL Group
Replication

Some of the above may end up with XL's things like
pgxc_node/pgxc_group system tables, pgxc_ctl command, CREATE/DROP
NODE/NODE GROUP commands, etc.


Regards
MauMau

Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

Ashutosh Bapat
On Thu, May 31, 2018 at 11:00 PM, MauMau <[hidden email]> wrote:

> 2018-05-31 22:44 GMT+09:00, Robert Haas <[hidden email]>:
>> On Thu, May 31, 2018 at 8:12 AM, MauMau <[hidden email]> wrote:
>>> Oh, I didn't know you support FDW approach mainly for analytics.  I
>>> guessed the first target was OLTP read-write scalability.
>>
>> That seems like a harder target to me, because you will have an extra
>> hop involved -- SQL from the client to the first server, then via SQL
>> to a second server.  The work of parsing and planning also has to be
>> done twice, once for the foreign table and again for the table.  For
>> longer-running queries this overhead doesn't matter as much, but for
>> short-running queries it is significant.
>
> Yes, that extra hop and double parsing/planning were the killer for
> our performance goal when we tried to meet our customer's scaleout
> needs with XL.  The application executes 82 DML statements in one
> transaction.  Those DMLs consist of INSERT, UPDATE and SELECT that
> only accesses one row with a primary key.  The target tables are only
> a few, so the application PREPAREs a few statements and EXECUTEs them
> repeatedly.  We placed the coordinator node of XL on the same host as
> the application, and data nodes and GTM on other individual nodes.
>

I agree that there's double parsing happening, but I am hesitant to
agree with the double planning claim. We do plan, let's say a join
between two foreign tables, on the local server, but that's only to
decide whether it's efficient to join locally or on the foreign
server. That means we create foreign paths for scan on the foreign
tables, may be as many parameterized plans as the number of join
conditions, and one path for the join pushdown that's it. We then
create local join paths but we need those to decide whether it's
efficient to join locally and if yes, which way. But don't create
paths as to how the foreign server would plan that join. That's not
double planning since we do not create same paths locally and on the
foreign server.

In order to avoid double parsing, we might want to find a way to pass
a "normalized" parse tree down to the foreign server. We need to
normalize the OIDs in the parse tree since those may be different
across the nodes.

>
>
>> I don't know what "node management" and "failure dectection/failover"
>> mean specifically.  I'd like to hear proposals, though.
>
> That's nothing special or new.  Things like:

That's a good summary of what we need here. Thanks for the summary.

>
> * Define a set of nodes that can join the cluster.
> * Initialize or configure a node according to its role in the cluster.
> * Decommission a node from the cluster.
> * Define a node group in which all member nodes have the same data set
> for redundancy.
> * One command to start and shutdown the entire cluster.

Right.

> * System tables to display the member nodes and node groups.

I think we need system tables on each node to store the cluster
configuration as seen by that node not just display. But that's a bit
of a detail.

> * Each node's in-memory view of the current cluster state.
> * How each node monitors which other nodes.

That's where we can have multiple ways. It's either each node
monitoring other nodes or we have a kind of watch-dog or a central
place (not necessarily a node, a file shared across the nodes might as
well server that purpose) from where we can fetch the "last known"
state of the cluster. But that along with split brain problem is much
larger problem to solve. I agree we need some way to know the "last
known" status of the cluster and then correct it as the cluster has
conflicting experiences, and do all of that without having much
communication overhead.

> * Elect a new primary node within a node group when the current
> primary node fails.

Well, some configuration might want more than one primary or
coordinator nodes. Having a single primary in itself creates an SPOF,
which should be avoided. But then there will be some loads which will
be happy with a single primary and risks arising out of that.

> * Whether each node group should be configured with a master-slaves
> replication topology, or a multi-master topology like MySQL Group
> Replication

Instead of a master-slave configuration, we might want to use logical
replication or some such method to create replicas of tables on
multiple nodes in a cluster and let the optimizer take advantage of
that for join push-down or load balancing.

>
> Some of the above may end up with XL's things like
> pgxc_node/pgxc_group system tables, pgxc_ctl command, CREATE/DROP
> NODE/NODE GROUP commands, etc.
>

I agree.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

Simon Riggs
On 1 June 2018 at 15:44, Ashutosh Bapat <[hidden email]> wrote:

> On Thu, May 31, 2018 at 11:00 PM, MauMau <[hidden email]> wrote:
>> 2018-05-31 22:44 GMT+09:00, Robert Haas <[hidden email]>:
>>> On Thu, May 31, 2018 at 8:12 AM, MauMau <[hidden email]> wrote:
>>>> Oh, I didn't know you support FDW approach mainly for analytics.  I
>>>> guessed the first target was OLTP read-write scalability.
>>>
>>> That seems like a harder target to me, because you will have an extra
>>> hop involved -- SQL from the client to the first server, then via SQL
>>> to a second server.  The work of parsing and planning also has to be
>>> done twice, once for the foreign table and again for the table.  For
>>> longer-running queries this overhead doesn't matter as much, but for
>>> short-running queries it is significant.
>>
>> Yes, that extra hop and double parsing/planning were the killer for
>> our performance goal when we tried to meet our customer's scaleout
>> needs with XL.  The application executes 82 DML statements in one
>> transaction.  Those DMLs consist of INSERT, UPDATE and SELECT that
>> only accesses one row with a primary key.  The target tables are only
>> a few, so the application PREPAREs a few statements and EXECUTEs them
>> repeatedly.  We placed the coordinator node of XL on the same host as
>> the application, and data nodes and GTM on other individual nodes.
>>
>
> I agree that there's double parsing happening, but I am hesitant to
> agree with the double planning claim. We do plan, let's say a join
> between two foreign tables, on the local server, but that's only to
> decide whether it's efficient to join locally or on the foreign
> server. That means we create foreign paths for scan on the foreign
> tables, may be as many parameterized plans as the number of join
> conditions, and one path for the join pushdown that's it. We then
> create local join paths but we need those to decide whether it's
> efficient to join locally and if yes, which way. But don't create
> paths as to how the foreign server would plan that join. That's not
> double planning since we do not create same paths locally and on the
> foreign server.
>
> In order to avoid double parsing, we might want to find a way to pass
> a "normalized" parse tree down to the foreign server. We need to
> normalize the OIDs in the parse tree since those may be different
> across the nodes.

Passing detailed info between servers is exactly what XL does.

It requires us to define a cluster, exactly as XL does.

And yes, its a good idea to replicate some tables to all nodes, as XL does.

So it seems we have at last some agreement that some of the things XL
does are the correct approaches.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

Simon Riggs
In reply to this post by MauMau
On 1 June 2018 at 04:00, MauMau <[hidden email]> wrote:

> The SQL processor should be one layer, not two layers.

For OLTP, that would be best. But it would be restricted to
single-node requests, leaving you the problem of how you know ahead of
time whether an SQL statement was single node or not.

Using a central coordinator node allows us to hide the decision of
single-node/multi-node from the user which seems essential for general
SQL. If you are able to restrict the types of requests users make then
we can do direct access to partitions - so there is scope for a
single-node API, as Mongo provides.

Using a central coordinator also allows multi-node transaction
control, global deadlock detection etc..

And that is why both XL and "FDW approach" rely on a central coordinator.

FDWs alone are not enough. It is clear that some more tight coupling
is required to get things to work well. For example, supporting SQL
query plans that allow for redistribution of data for joins.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

Tom Lane-2
In reply to this post by Ashutosh Bapat
Ashutosh Bapat <[hidden email]> writes:
> In order to avoid double parsing, we might want to find a way to pass
> a "normalized" parse tree down to the foreign server. We need to
> normalize the OIDs in the parse tree since those may be different
> across the nodes.

I don't think this is a good idea at all.  It breaks any hope of
supporting remote servers that are not the identical version to the local
one (since their parsetrees might be different).  And "normalized OIDs"
sounds like "pie in the sky".  You might get away with asssuming that
built-in functions have stable OIDs, but you can't expect that for
functions in extensions.

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

Ashutosh Bapat
In reply to this post by Simon Riggs
On Fri, Jun 1, 2018 at 11:10 AM, Simon Riggs <[hidden email]> wrote:
>
> Using a central coordinator also allows multi-node transaction
> control, global deadlock detection etc..

But that becomes an SPOF and then we have to configure a standby for
that. I am not saying that that's a bad design but it's not very good
for many work-loads. But it would be good if we could avoid any
"central server" in this configuration.

>
> And that is why both XL and "FDW approach" rely on a central coordinator.

I don't think we ever specified that "FDW approach" "relies" on a
central coordinator. One could configure and setup a cluster with
multiple coordinators using FDWs.

>
> FDWs alone are not enough. It is clear that some more tight coupling
> is required to get things to work well. For example, supporting SQL
> query plans that allow for redistribution of data for joins.

I think partitioning + FDW provide basic infrastructure for
distributing data, planning queries working with such data. We need
more glue to support node management, cluster configuration. So, I
agree with your statement. But I think it was clear from the beginning
that we need more than FDW and partitioning.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

Ashutosh Bapat
In reply to this post by Tom Lane-2
On Fri, Jun 1, 2018 at 11:27 AM, Tom Lane <[hidden email]> wrote:

> Ashutosh Bapat <[hidden email]> writes:
>> In order to avoid double parsing, we might want to find a way to pass
>> a "normalized" parse tree down to the foreign server. We need to
>> normalize the OIDs in the parse tree since those may be different
>> across the nodes.
>
> I don't think this is a good idea at all.  It breaks any hope of
> supporting remote servers that are not the identical version to the local
> one (since their parsetrees might be different).  And "normalized OIDs"
> sounds like "pie in the sky".  You might get away with asssuming that
> built-in functions have stable OIDs, but you can't expect that for
> functions in extensions.

Sorry for confusing writeup. I didn't mean "normalized OIDs" as I
mentioned in my last sentence. I meant "normalized parse-tree" as in
the first sentence. In order to normalize parse trees, we need to at
least replace various OIDs in parse-tree with something that the
foreign server will understand correctly like table name on the
foreign table pointed to by local foreign table OR (schema qualified)
function names  and so on. There might be more things to "normalize"
in the parse tree other than OIDs, but I can't think of anything right
now.


--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

Merlin Moncure-2
In reply to this post by Robert Haas
On Wed, May 30, 2018 at 9:26 PM Robert Haas <[hidden email]> wrote:
> The FDW approach, of which I have been a supporter for some years now,
> is really aiming at a different target, which is to allow efficient
> analytics queries across a multi-node cluster.  I think we're getting
> pretty close to being able to do that -- IMHO, the last fundamental
> building block that we need is asynchronous execution, which Andres is
> working on.  After that, it's a matter of adding other features that
> people want (like cross-node MVCC) and improving the plans for queries
> that still don't perform well (like joins that involve redistributing
> one of the tables involved).

FWIW, Distributed analytical queries is the right market to be in.
This is the field in which I work, and this is where the action is at.
I am very, very, sure about this.  My view is that many of the
existing solutions to this problem (in particular hadoop class
soltuions) have major architectural downsides that make them
inappropriate in use cases that postgres really shines at; direct
hookups to low latency applications for example.  postgres is
fundamentally a more capable 'node' with its multiple man-millennia of
engineering behind it.  Unlimited vertical scaling (RAC etc) is
interesting too, but this is not the way the market is moving as
hardware advancements have reduced or eliminated the need for that in
many spheres.

The direction of the project is sound and we are on the cusp of the
point where multiple independent coalescing features (FDW, logical
replication, parallel query, executor enhancements) will open new
scaling avenues that will not require trading off the many other
benefits of SQL that competing contemporary solutions might.  The
broader development market is starting to realize this and that is a
major driver of the recent upswing in popularity.  This is benefiting
me tremendously personally due to having gone 'all-in' with postgres
almost 20 years ago :-D. (Time sure flies)    These are truly
wonderful times for the community.

merlin

Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

Simon Riggs
In reply to this post by Ashutosh Bapat
On 1 June 2018 at 16:56, Ashutosh Bapat <[hidden email]> wrote:

> On Fri, Jun 1, 2018 at 11:10 AM, Simon Riggs <[hidden email]> wrote:
>>
>> Using a central coordinator also allows multi-node transaction
>> control, global deadlock detection etc..
>
> But that becomes an SPOF and then we have to configure a standby for
> that. I am not saying that that's a bad design but it's not very good
> for many work-loads. But it would be good if we could avoid any
> "central server" in this configuration.
>
>>
>> And that is why both XL and "FDW approach" rely on a central coordinator.
>
> I don't think we ever specified that "FDW approach" "relies" on a
> central coordinator. One could configure and setup a cluster with
> multiple coordinators using FDWs.

Yes, of course. You're just misunderstanding me. I'm talking about a
query coordinator "role". There can be many coordinator components and
they can be spread out in variours ways, but for any one SQL query
there needs to be one coordinator node. Not a SPOF.

>>
>> FDWs alone are not enough. It is clear that some more tight coupling
>> is required to get things to work well. For example, supporting SQL
>> query plans that allow for redistribution of data for joins.
>
> I think partitioning + FDW provide basic infrastructure for
> distributing data, planning queries working with such data. We need
> more glue to support node management, cluster configuration. So, I
> agree with your statement. But I think it was clear from the beginning
> that we need more than FDW and partitioning.

No, it wasn't clear. But I'm glad to hear it. It might actually work then.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

Ashutosh Bapat
On Sat, Jun 2, 2018 at 4:05 AM, Simon Riggs <[hidden email]> wrote:

> On 1 June 2018 at 16:56, Ashutosh Bapat <[hidden email]> wrote:
>> On Fri, Jun 1, 2018 at 11:10 AM, Simon Riggs <[hidden email]> wrote:
>>>
>>> Using a central coordinator also allows multi-node transaction
>>> control, global deadlock detection etc..
>>
>> But that becomes an SPOF and then we have to configure a standby for
>> that. I am not saying that that's a bad design but it's not very good
>> for many work-loads. But it would be good if we could avoid any
>> "central server" in this configuration.
>>
>>>
>>> And that is why both XL and "FDW approach" rely on a central coordinator.
>>
>> I don't think we ever specified that "FDW approach" "relies" on a
>> central coordinator. One could configure and setup a cluster with
>> multiple coordinators using FDWs.
>
> Yes, of course. You're just misunderstanding me. I'm talking about a
> query coordinator "role". There can be many coordinator components and
> they can be spread out in variours ways, but for any one SQL query
> there needs to be one coordinator node. Not a SPOF.

In your earlier mail, which is included above, you mentioned central
coordinator for multi-node transaction control and global deadlock
detection. That doesn't sound like a "query coordinator role". It
sounds more like GTM, which is an SPOF. Anyway I am happy to clarify
that "FDW approach" relies on a query coordinator, the server which
faces the client. But I don't think we have decided how would the
transaction management and deadlock detection work in the shared
nothing cluster of PostgreSQL servers. There was discussion in
developer unconference this year, but I was not part of that as I was
holding another session the same time. May be somebody who attended
that session can post a summary here or provide a link to the summary
written elsewhere.

>
>>>
>>> FDWs alone are not enough. It is clear that some more tight coupling
>>> is required to get things to work well. For example, supporting SQL
>>> query plans that allow for redistribution of data for joins.
>>
>> I think partitioning + FDW provide basic infrastructure for
>> distributing data, planning queries working with such data. We need
>> more glue to support node management, cluster configuration. So, I
>> agree with your statement. But I think it was clear from the beginning
>> that we need more than FDW and partitioning.
>
> No, it wasn't clear. But I'm glad to hear it. It might actually work then.

Good to see some agreement.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

Simon Riggs
On 2 June 2018 at 22:46, Ashutosh Bapat <[hidden email]> wrote:

>>>> And that is why both XL and "FDW approach" rely on a central coordinator.
>>>
>>> I don't think we ever specified that "FDW approach" "relies" on a
>>> central coordinator. One could configure and setup a cluster with
>>> multiple coordinators using FDWs.
>>
>> Yes, of course. You're just misunderstanding me. I'm talking about a
>> query coordinator "role". There can be many coordinator components and
>> they can be spread out in variours ways, but for any one SQL query
>> there needs to be one coordinator node. Not a SPOF.
>
> In your earlier mail, which is included above, you mentioned central
> coordinator for multi-node transaction control and global deadlock
> detection. That doesn't sound like a "query coordinator role". It
> sounds more like GTM, which is an SPOF.

In XL, GTM is a singe component managing transaction ids. That has a
standby, so is not a SPOF.

But that is not what I mean. I don't believe that a GTM-style
component is necessary in a future in-core scalablility solution.

Each incoming query needs to be planned and executed from one
coordinator component, then the work performed across many workers on
different nodes (or just one). We could have coordinator components on
each worker node, or we could have a set of coordinator nodes and a
set of worker nodes.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

Ashutosh Bapat
On Sun, Jun 3, 2018 at 2:00 AM, Simon Riggs <[hidden email]> wrote:
>
> In XL, GTM is a singe component managing transaction ids. That has a
> standby, so is not a SPOF.
>
> But that is not what I mean. I don't believe that a GTM-style
> component is necessary in a future in-core scalablility solution.
>

I agree. I think, a distributed algorithm, which does not need a
single GTM-style node, would be better. That automatically provides
high availability without configuring a standby.

> Each incoming query needs to be planned and executed from one
> coordinator component, then the work performed across many workers on
> different nodes (or just one).

Each node need to be confiugred and maintained. That requires efforts.
So we need to keep the number of nodes to a minimum. With a
coordinator and worker node segregation, we require at least two nodes
in a cluster and just that configuration doesn't provide much
scalability. With each node functioning as coordinator (facing
clients) and worker (facing other coordinators) keeps the number of
nodes to a minimum. It is good for HA.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

MauMau
In reply to this post by Simon Riggs
From: Simon Riggs
> Passing detailed info between servers is exactly what XL does.
>
> It requires us to define a cluster, exactly as XL does.
>
> And yes, its a good idea to replicate some tables to all nodes, as
XL does.
>
> So it seems we have at last some agreement that some of the things
XL
> does are the correct approaches.

Exactly.  I was impressed when I read the paper on XC for the first
time.  I respect what Suzuki-san and people from NTT, EnterpriseDB,
and 2ndQuadrant have done for XC/XL.  Just like some people believe we
should leverage the artifact for FDW, I think we should leverage the
code and idea of XC/XL.

Regards
MauMau


Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

MauMau
In reply to this post by Simon Riggs
From: Robert Haas
On Thu, May 31, 2018 at 8:12 AM, MauMau <[hidden email]> wrote:
>> Oh, I didn't know you support FDW approach mainly for analytics.  I
>> guessed the first target was OLTP read-write scalability.
>
> That seems like a harder target to me, because you will have an
extra
> hop involved -- SQL from the client to the first server, then via
SQL
> to a second server.  The work of parsing and planning also has to be
> done twice, once for the foreign table and again for the table.  For
> longer-running queries this overhead doesn't matter as much, but for
> short-running queries it is significant.


From: Simon Riggs
On 1 June 2018 at 04:00, MauMau <[hidden email]> wrote:
>> The SQL processor should be one layer, not two layers.

> For OLTP, that would be best. But it would be restricted to
> single-node requests, leaving you the problem of how you know ahead
of
> time whether an SQL statement was single node or not.
>
> Using a central coordinator node allows us to hide the decision of
> single-node/multi-node from the user which seems essential for
general
> SQL. If you are able to restrict the types of requests users make
then
> we can do direct access to partitions - so there is scope for a
> single-node API, as Mongo provides.

I don't think an immediate server like the coordinators in XL is
necessary.  That extra hop can be eliminated by putting both the
coordinator and the data node roles in the same server process.  That
is, the node to which an application connects communicates with other
nodes only when it does not necessary data.

Furthermore, an extra hop and double parsing/planning could matter for
analytic queries, too.  For example, SAP HANA boasts of scanning 1
billion rows in one second.  In HANA's scaleout architecture, an
application can connect to any worker node and the node communicates
with other nodes only when necessary (there's one special node called
"master", but it manages the catalog and transactions; it's not an
extra hop like the coordinator in XL).  Vertica is an MPP analytics
database, but it doesn't have a node like the coordinator, either.  To
achieve maximum performance for real-time queries, the scaleout
architecture should avoid an extra hop when possible.


> Using a central coordinator also allows multi-node transaction
> control, global deadlock detection etc..

VoltDB does not have an always-pass hop like the coordinator in XL.
Our proprietary RDBMS named Symfoware, which is not based on
PostgreSQL, also doesn't have an extra hop, and can handle distributed
transactions and deadlock detection/resolution without any special
node like GTM.


Regards
MauMau


Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

MauMau
In reply to this post by Ashutosh Bapat
From: Ashutosh Bapat
> In order to normalize parse trees, we need to at
> least replace various OIDs in parse-tree with something that the
> foreign server will understand correctly like table name on the
> foreign table pointed to by local foreign table OR (schema
qualified)
> function names  and so on.

Yes, that's the drawback of each node in the cluster having different
OIDs for the same object.  That applies to XL, too.  In XL, the data
node returns the data type names of the columns in the result set to
the coordinator.  Then the coordinator seemed to parse each data type
name with base_yyparse() to convert the name to its OID on the
coordinator.  That's why base_yyparse() appeared at the top in the
perf profile.  That repeated typename-to-OID conversion would be
reduced by caching the conversion result, like the logical replication
of PostgreSQL does.  But managing the catalog at one place and using
the same OID values seems to concise to me as a concept.


Regards
MauMau


Reply | Threaded
Open this post in threaded view
|

Re: I'd like to discuss scaleout at PGCon

MauMau
In reply to this post by Ashutosh Bapat
From: Ashutosh Bapat
> In order to normalize parse trees, we need to at
> least replace various OIDs in parse-tree with something that the
> foreign server will understand correctly like table name on the
> foreign table pointed to by local foreign table OR (schema
qualified)
> function names  and so on.

Yes, that's the drawback of each node in the cluster having different
OIDs for the same object.  That applies to XL, too.  In XL, the data
node returns the data type names of the columns in the result set to
the coordinator.  Then the coordinator seemed to parse each data type
name with base_yyparse() to convert the name to its OID on the
coordinator.  That's why base_yyparse() appeared at the top in the
perf profile.  That repeated typename-to-OID conversion would be
reduced by caching the conversion result, like the logical replication
of PostgreSQL does.  But managing the catalog at one place and using
the same OID values seems to concise to me as a concept.


Regards
MauMau


-----Original Message-----
From: Ashutosh Bapat
Sent: Saturday, June 2, 2018 1:00 AM
To: Tom Lane
Cc: MauMau ; Robert Haas ; PostgreSQL Hackers
Subject: Re: I'd like to discuss scaleout at PGCon

On Fri, Jun 1, 2018 at 11:27 AM, Tom Lane <[hidden email]> wrote:
> Ashutosh Bapat <[hidden email]> writes:
>> In order to avoid double parsing, we might want to find a way to
pass
>> a "normalized" parse tree down to the foreign server. We need to
>> normalize the OIDs in the parse tree since those may be different
>> across the nodes.
>
> I don't think this is a good idea at all.  It breaks any hope of
> supporting remote servers that are not the identical version to the
local
> one (since their parsetrees might be different).  And "normalized
OIDs"
> sounds like "pie in the sky".  You might get away with asssuming
that
> built-in functions have stable OIDs, but you can't expect that for
> functions in extensions.

Sorry for confusing writeup. I didn't mean "normalized OIDs" as I
mentioned in my last sentence. I meant "normalized parse-tree" as in
the first sentence. In order to normalize parse trees, we need to at
least replace various OIDs in parse-tree with something that the
foreign server will understand correctly like table name on the
foreign table pointed to by local foreign table OR (schema qualified)
function names  and so on. There might be more things to "normalize"
in the parse tree other than OIDs, but I can't think of anything right
now.


--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


123