Pluggable Storage - Andres's take

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
179 messages Options
123456 ... 9
Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Haribabu Kommi-2

On Tue, Oct 23, 2018 at 5:49 PM Haribabu Kommi <[hidden email]> wrote:
I am able to generate the simple test and found the problem. The issue with the following
SQL.

SELECT *
   INTO TABLE xacttest
   FROM aggtest;

During the processing of the above query, the tuple that is selected from the aggtest is 
sent to the intorel_receive() function, and the same tuple is used for the insert, because
of this reason, the tuple xmin is updated and it leads to failure of selecting the data from
another query. I fixed this issue by materializing the slot.

Wrong patch attached in the earlier mail, sorry for the inconvenience.
Attached proper fix patch.

I will look into isolation test failures.

Regards,
Haribabu Kommi
Fujitsu Australia

0002-Materialize-the-slot-before-they-are-processed-using.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Haribabu Kommi-2

On Tue, Oct 23, 2018 at 6:11 PM Haribabu Kommi <[hidden email]> wrote:

On Tue, Oct 23, 2018 at 5:49 PM Haribabu Kommi <[hidden email]> wrote:
I am able to generate the simple test and found the problem. The issue with the following
SQL.

SELECT *
   INTO TABLE xacttest
   FROM aggtest;

During the processing of the above query, the tuple that is selected from the aggtest is 
sent to the intorel_receive() function, and the same tuple is used for the insert, because
of this reason, the tuple xmin is updated and it leads to failure of selecting the data from
another query. I fixed this issue by materializing the slot.

Wrong patch attached in the earlier mail, sorry for the inconvenience.
Attached proper fix patch.

I will look into isolation test failures.

Here I attached the cumulative patch with all fixes that are shared in earlier mails by me.
Except fast_default test, rest of test failures are fixed.

Regards,
Haribabu Kommi
Fujitsu Australia

0002-init-fork-API.patch (9K) Download Attachment
0001-Further-fixes-and-cleanup.patch (16K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Dmitry Dolgov
> On Fri, 26 Oct 2018 at 13:25, Haribabu Kommi <[hidden email]> wrote:
>
> Here I attached the cumulative patch with all fixes that are shared in earlier mails by me.
> Except fast_default test, rest of test failures are fixed.

Hi,

If I understand correctly, these patches are for the branch "pluggable-storage"
in [1] (at least I couldn't apply them cleanly to "pluggable-zheap" branch),
right? I've tried to experiment a bit with the current status of the patch, and
accidentally stumbled upon what seems to be an issue - when I run pgbench
agains it with some significant number of clients and script [2]:

    $ pgbench -T 60 -c 128 -j 64 -f zipfian.sql

I've got for some client an error:

    client 117 aborted in command 5 (SQL) of script 0; ERROR:
unrecognized heap_update status: 1

This problem couldn't be reproduced on the master branch, so I've tried to
investigate it. It comes from nodeModifyTable.c:1267, when we've got
HeapTupleInvisible as a result, and this value in turn comes from
table_lock_tuple. Everything points to the new way of handling HeapTupleUpdated
result from heap_update, when table_lock_tuple call was introduced. Since I
don't see anything similar in the master branch, can anyone clarify why is this
lock necessary here? Out of curiosity I've rearranged the code, that handles
HeapTupleUpdated, back to switch and removed table_lock_tuple (see the attached
patch, it can be applied on top of the lastest two patches posted by Haribabu)
and it seems to solve the issue.

[1]: https://github.com/anarazel/postgres-pluggable-storage
[2]: https://gist.github.com/erthalion/c85ba0e12146596d24c572234501e756

unrecognized_heap_status.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Haribabu Kommi-2
On Mon, Oct 29, 2018 at 7:40 AM Dmitry Dolgov <[hidden email]> wrote:
> On Fri, 26 Oct 2018 at 13:25, Haribabu Kommi <[hidden email]> wrote:
>
> Here I attached the cumulative patch with all fixes that are shared in earlier mails by me.
> Except fast_default test, rest of test failures are fixed.

Hi,

If I understand correctly, these patches are for the branch "pluggable-storage"
in [1] (at least I couldn't apply them cleanly to "pluggable-zheap" branch),
right?

Yes, the patches attached are for pluggable-storage branch.
 
I've tried to experiment a bit with the current status of the patch, and
accidentally stumbled upon what seems to be an issue - when I run pgbench
agains it with some significant number of clients and script [2]:

    $ pgbench -T 60 -c 128 -j 64 -f zipfian.sql

Thanks for testing the patches.
 
I've got for some client an error:

    client 117 aborted in command 5 (SQL) of script 0; ERROR:
unrecognized heap_update status: 1

This error is for the tuple state of HeapTupleInvisible, As per the comments
in heap_lock_tuple, this is possible in ON CONFLICT UPDATE, but because
of reorganizing of the table_lock_tuple out of EvalPlanQual(), the invisible
error is returned in other cases also. This case is missed in the new code.
 
This problem couldn't be reproduced on the master branch, so I've tried to
investigate it. It comes from nodeModifyTable.c:1267, when we've got
HeapTupleInvisible as a result, and this value in turn comes from
table_lock_tuple. Everything points to the new way of handling HeapTupleUpdated
result from heap_update, when table_lock_tuple call was introduced. Since I
don't see anything similar in the master branch, can anyone clarify why is this
lock necessary here?

In the master branch code also, there is a tuple lock that is happening in
EvalPlanQual() function, but pluggable-storage code, the lock is kept outside
and also function call rearrangements, to make it easier for the table access
methods to provide their own MVCC implementation.
 
Out of curiosity I've rearranged the code, that handles
HeapTupleUpdated, back to switch and removed table_lock_tuple (see the attached
patch, it can be applied on top of the lastest two patches posted by Haribabu)
and it seems to solve the issue.

Thanks for the patch. I didn't reproduce the problem, based on the error from
your mail, the attached draft patch of handling of invisible tuples in update and
delete cases should also fix it.

Regards,
Haribabu Kommi
Fujitsu Australia

0001-Handling-HeapTupleInvisible-case.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Dmitry Dolgov
> On Mon, 29 Oct 2018 at 05:56, Haribabu Kommi <[hidden email]> wrote:
>
>> This problem couldn't be reproduced on the master branch, so I've tried to
>> investigate it. It comes from nodeModifyTable.c:1267, when we've got
>> HeapTupleInvisible as a result, and this value in turn comes from
>> table_lock_tuple. Everything points to the new way of handling HeapTupleUpdated
>> result from heap_update, when table_lock_tuple call was introduced. Since I
>> don't see anything similar in the master branch, can anyone clarify why is this
>> lock necessary here?
>
>
> In the master branch code also, there is a tuple lock that is happening in
> EvalPlanQual() function, but pluggable-storage code, the lock is kept outside
> and also function call rearrangements, to make it easier for the table access
> methods to provide their own MVCC implementation.

Yes, now I see it, thanks. Also I can confirm that the attached patch solves
this issue.

FYI, alongside with reviewing the code changes I've ran few performance tests
(that's why I hit this issue with pgbench in the first place). In case of high
concurrecy so far I see small performance degradation in comparison with the
master branch (about 2-5% of average latency, depending on the level of
concurrency), but can't really say why exactly (perf just shows barely
noticeable overhead there and there, maybe what I see is actually a cumulative
impact).

Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Haribabu Kommi-2
On Wed, Oct 31, 2018 at 9:34 PM Dmitry Dolgov <[hidden email]> wrote:
> On Mon, 29 Oct 2018 at 05:56, Haribabu Kommi <[hidden email]> wrote:
>
>> This problem couldn't be reproduced on the master branch, so I've tried to
>> investigate it. It comes from nodeModifyTable.c:1267, when we've got
>> HeapTupleInvisible as a result, and this value in turn comes from
>> table_lock_tuple. Everything points to the new way of handling HeapTupleUpdated
>> result from heap_update, when table_lock_tuple call was introduced. Since I
>> don't see anything similar in the master branch, can anyone clarify why is this
>> lock necessary here?
>
>
> In the master branch code also, there is a tuple lock that is happening in
> EvalPlanQual() function, but pluggable-storage code, the lock is kept outside
> and also function call rearrangements, to make it easier for the table access
> methods to provide their own MVCC implementation.

Yes, now I see it, thanks. Also I can confirm that the attached patch solves
this issue.

Thanks for the testing and confirmation.
 
FYI, alongside with reviewing the code changes I've ran few performance tests
(that's why I hit this issue with pgbench in the first place). In case of high
concurrecy so far I see small performance degradation in comparison with the
master branch (about 2-5% of average latency, depending on the level of
concurrency), but can't really say why exactly (perf just shows barely
noticeable overhead there and there, maybe what I see is actually a cumulative
impact).

Thanks for sharing your observation, I will also analyze and try to find out performance
bottlenecks that are causing the overhead.

Here I attached the cumulative fixes of the patches, new API additions for zheap and
basic outline of the documentation.

Regards,
Haribabu Kommi
Fujitsu Australia

0003-First-draft-of-pluggable-storage-documentation.patch (43K) Download Attachment
0002-New-API-s-are-added.patch (12K) Download Attachment
0001-Further-fixes-and-cleanup.patch (18K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Haribabu Kommi-2

On Fri, Nov 2, 2018 at 11:17 AM Haribabu Kommi <[hidden email]> wrote:
On Wed, Oct 31, 2018 at 9:34 PM Dmitry Dolgov <[hidden email]> wrote: 
FYI, alongside with reviewing the code changes I've ran few performance tests
(that's why I hit this issue with pgbench in the first place). In case of high
concurrecy so far I see small performance degradation in comparison with the
master branch (about 2-5% of average latency, depending on the level of
concurrency), but can't really say why exactly (perf just shows barely
noticeable overhead there and there, maybe what I see is actually a cumulative
impact).

Thanks for sharing your observation, I will also analyze and try to find out performance
bottlenecks that are causing the overhead.

I tried running the pgbench performance tests with minimal clients in my laptop and I didn't
find any performance issues, may be issue is visible only with higher clients. Even with
perf tool, I am not able to get a clear problem function. As you said, combining of all changes
leads to some overhead.

Here I attached the cumulative patches with further fixes and basic syntax regress tests also.

Regards,
Haribabu Kommi
Fujitsu Australia

0003-First-draft-of-pluggable-storage-documentation.patch (43K) Download Attachment
0002-New-API-s-are-added.patch (12K) Download Attachment
0001-Further-fixes-and-cleanup.patch (32K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Asim R P
Ashwin (copied) and I got a chance to go through the latest code from
Andres' github repository.  We would like to share some
comments/quesitons:

The TupleTableSlot argument is well suited for row-oriented storage.
For a column-oriented storage engine, a projection list indicating the
columns to be scanned may be necessary.  Is it possible to share this
information with current interface?

We realized that DDLs such as heap_create_with_catalog() are not
generalized.  Haribabu's latest patch that adds
SetNewFileNode_function() and CreateInitFort_function() is a step
towards this end.  However, the current API assumes that the storage
engine uses relation forks.  Isn't that too restrictive?

TupleDelete_function() accepts changingPart as a parameter to indicate
if this deletion is part of a movement from one partition to another.
Partitioning is a higher level abstraction as compared to storage.
Ideally, storage layer should have no knowledge of partitioning. The
tuple delete API should not accept any parameter related to
partitioning.

The API needs to be more accommodating towards block sizes used in
storage engines.  Currently, the same block size as heap seems to be
assumed, as evident from the type of some members of generic scan
object:

typedef struct TableScanDescData
{
  /* state set up at initscan time */
  BlockNumber rs_nblocks;     /* total number of blocks in rel */
  BlockNumber rs_startblock;  /* block # to start at */
  BlockNumber rs_numblocks;   /* max number of blocks to scan */
  /* rs_numblocks is usually InvalidBlockNumber, meaning "scan whole rel" */
  bool        rs_syncscan;    /* report location to syncscan logic? */
}           TableScanDescData;

Using bytes to represent this information would be more generic. E.g.
rs_startlocation as bytes/offset instead of rs_startblock and so on.

Asim

Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Haribabu Kommi-2

On Thu, Nov 22, 2018 at 1:12 PM Asim R P <[hidden email]> wrote:
Ashwin (copied) and I got a chance to go through the latest code from
Andres' github repository.  We would like to share some
comments/quesitons:

Thanks for the review.
 
The TupleTableSlot argument is well suited for row-oriented storage.
For a column-oriented storage engine, a projection list indicating the
columns to be scanned may be necessary.  Is it possible to share this
information with current interface?

Currently all the interfaces are designed for row-oriented storage, as you
said we need a new API for projection list. The current patch set itself
is big and it needs to stabilized and then in the next set of the patches,
those new API's will be added that will be useful for columnar storage.
 
We realized that DDLs such as heap_create_with_catalog() are not
generalized.  Haribabu's latest patch that adds
SetNewFileNode_function() and CreateInitFort_function() is a step
towards this end.  However, the current API assumes that the storage
engine uses relation forks.  Isn't that too restrictive?

Current set of API has many assumptions and uses the existing framework.
Thanks for your point, will check it how to enhance it.
 
TupleDelete_function() accepts changingPart as a parameter to indicate
if this deletion is part of a movement from one partition to another.
Partitioning is a higher level abstraction as compared to storage.
Ideally, storage layer should have no knowledge of partitioning. The
tuple delete API should not accept any parameter related to
partitioning.

Thanks for your point, will look into it in how to change extract it.
 
The API needs to be more accommodating towards block sizes used in
storage engines.  Currently, the same block size as heap seems to be
assumed, as evident from the type of some members of generic scan
object:

typedef struct TableScanDescData
{
  /* state set up at initscan time */
  BlockNumber rs_nblocks;     /* total number of blocks in rel */
  BlockNumber rs_startblock;  /* block # to start at */
  BlockNumber rs_numblocks;   /* max number of blocks to scan */
  /* rs_numblocks is usually InvalidBlockNumber, meaning "scan whole rel" */
  bool        rs_syncscan;    /* report location to syncscan logic? */
}           TableScanDescData;

Using bytes to represent this information would be more generic. E.g.
rs_startlocation as bytes/offset instead of rs_startblock and so on.

I doubt that this may not be the only one that needs a change to support
different block sizes for different storage interfaces. Thanks for your point,
but definitely this can be taken care in the next set of patches.

Andres, as the tupletableslot changes are committed, do you want me to 
share the rebased pluggable storage patch? you already working on it?

Regards,
Haribabu Kommi
Fujitsu Australia
Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Andres Freund
Hi,

FWIW, now that oids are removed, and the tuple table slot abstraction
got in, I'm working on rebasing the pluggable storage patchset ontop of
that.


On 2018-11-27 12:48:36 +1100, Haribabu Kommi wrote:

> On Thu, Nov 22, 2018 at 1:12 PM Asim R P <[hidden email]> wrote:
>
> > Ashwin (copied) and I got a chance to go through the latest code from
> > Andres' github repository.  We would like to share some
> > comments/quesitons:
> >
>
> Thanks for the review.
>
>
> > The TupleTableSlot argument is well suited for row-oriented storage.
> > For a column-oriented storage engine, a projection list indicating the
> > columns to be scanned may be necessary.  Is it possible to share this
> > information with current interface?
> >
>
> Currently all the interfaces are designed for row-oriented storage, as you
> said we need a new API for projection list. The current patch set itself
> is big and it needs to stabilized and then in the next set of the patches,
> those new API's will be added that will be useful for columnar storage.

Precisely.


> > TupleDelete_function() accepts changingPart as a parameter to indicate
> > if this deletion is part of a movement from one partition to another.
> > Partitioning is a higher level abstraction as compared to storage.
> > Ideally, storage layer should have no knowledge of partitioning. The
> > tuple delete API should not accept any parameter related to
> > partitioning.
> >
>
> Thanks for your point, will look into it in how to change extract it.

I don't think that's actually a problem. The changingPart parameter is
just a marker that the deletion is part of moving a tuple across
partitions. For heap and everythign compatible that's used to include
information to the tuple that concurrent modifications to the tuple
should error out when reaching such a tuple via EPQ.


> Andres, as the tupletableslot changes are committed, do you want me to
> share the rebased pluggable storage patch? you already working on it?

Working on it.


Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Amit Langote-2
In reply to this post by Haribabu Kommi-2
Hi,

On 2018/11/02 9:17, Haribabu Kommi wrote:
> Here I attached the cumulative fixes of the patches, new API additions for
> zheap and
> basic outline of the documentation.

I've read the documentation patch while also looking at the code and here
are some comments.

+   Each table is stored as its own physical
<firstterm>relation</firstterm> and so
+   is described by an entry in the <structname>pg_class</structname> catalog.

I think the "so" in "and so is described by an entry in..." is not necessary.

+   The contents of an table are entirely under the control of its access
method.

"a" table

+   (All the access methods furthermore use the standard page layout
described in
+   <xref linkend="storage-page-layout"/>.)

Maybe write the two sentences above as:

A table's content is entirely controlled by its access method, although
all access methods use the same standard page layout described in <xref
linkend="storage-page-layout"/>.

+    SlotCallbacks_function slot_callbacks;
+
+    SnapshotSatisfies_function snapshot_satisfies;
+    SnapshotSatisfiesUpdate_function snapshot_satisfiesUpdate;
+    SnapshotSatisfiesVacuum_function snapshot_satisfiesVacuum;

Like other functions, how about a one sentence comment for these, like:

/*
 * Function to get an AM-specific set of functions for manipulating
 * TupleTableSlots
 */
SlotCallbacks_function slot_callbacks;

/* AM-specific snapshot visibility determination functions */
SnapshotSatisfies_function snapshot_satisfies;
SnapshotSatisfiesUpdate_function snapshot_satisfiesUpdate;
SnapshotSatisfiesVacuum_function snapshot_satisfiesVacuum;

+    TupleFetchFollow_function tuple_fetch_follow;
+
+    GetTupleData_function get_tuple_data;

How about removing the empty line so that get_tuple_data can be seen as
part of the group /* Operations on physical tuples */

+    RelationVacuum_function relation_vacuum;
+    RelationScanAnalyzeNextBlock_function scan_analyze_next_block;
+    RelationScanAnalyzeNextTuple_function scan_analyze_next_tuple;
+    RelationCopyForCluster_function relation_copy_for_cluster;
+    RelationSync_function relation_sync;

Add /* Operations to support VACUUM/ANALYZE */ as a description for this
group?

+    BitmapPagescan_function scan_bitmap_pagescan;
+    BitmapPagescanNext_function scan_bitmap_pagescan_next;

Add /* Operations to support bitmap scans */ as a description for this group?

+    SampleScanNextBlock_function scan_sample_next_block;
+    SampleScanNextTuple_function scan_sample_next_tuple;

Add /* Operations to support sampling scans */ as a description for this
group?

+    ScanEnd_function scan_end;
+    ScanRescan_function scan_rescan;
+    ScanUpdateSnapshot_function scan_update_snapshot;

Move these two to be in the /* Operations on relation scans */ group?

+    BeginIndexFetchTable_function begin_index_fetch;
+    EndIndexFetchTable_function reset_index_fetch;
+    EndIndexFetchTable_function end_index_fetch;

Add /* Operations to support index scans */ as a description for this group?

+    IndexBuildRangeScan_function index_build_range_scan;
+    IndexValidateScan_function index_validate_scan;

Add /* Operations to support index build */ as a description for this group?

+    CreateInitFork_function CreateInitFork;

Add /* Function to create an init fork for unlogged tables */?

By the way, I can see the following two in the source code, but not in the
documentation.

    EstimateRelSize_function EstimateRelSize;
    SetNewFileNode_function SetNewFileNode;


+   The table construction and maintenance functions that an table access
+   method must provide in <structname>TableAmRoutine</structname> are:

"a" table access method

+  <para>
+<programlisting>
+TupleTableSlotOps *
+slot_callbacks (Relation relation);
+</programlisting>
+   API to access the slot specific methods;
+   Following methods are available;
+   <structname>TTSOpsVirtual</structname>,
+   <structname>TTSOpsHeapTuple</structname>,
+   <structname>TTSOpsMinimalTuple</structname>,
+   <structname>TTSOpsBufferTuple</structname>,
+  </para>

Unless I'm misunderstanding what the TupleTableSlotOps abstraction is or
its relations to the TableAmRoutine abstraction, I think the text
description could better be written as:

"API to get the slot operations struct for a given table access method"

It's not clear to me why various TTSOps* structs are listed here?  Is the
point that different AMs may choose one of the listed alternatives?  For
example, I see that heap AM implementation returns TTOpsBufferTuple, so it
manipulates slots containing buffered tuples, right?  Other AMs are free
to return any one of these?  For example, some AMs may never use buffer
manager and hence not use TTOpsBufferTuple.  Is that understanding correct?

+  <para>
+<programlisting>
+bool
+snapshot_satisfies (TupleTableSlot *slot, Snapshot snapshot);
+</programlisting>
+   API to check whether the provided slot is visible to the current
+   transaction according the snapshot.
+  </para>

Do you mean:

"API to check whether the tuple contained in the provided slot is
visible...."?

+  <para>
+<programlisting>
+Oid
+tuple_insert (Relation rel, TupleTableSlot *slot, CommandId cid,
+              int options, BulkInsertState bistate);
+</programlisting>
+   API to insert the tuple and provide the <literal>ItemPointerData</literal>
+   where the tuple is successfully inserted.
+  </para>

It's not clear from the signature where you get the ItemPointerData.
Looking at heapam_tuple_insert which puts it in slot->tts_tid, I think
this should mention it a bit differently, like:

API to insert the tuple contained in the provided slot and return its TID,
that is, the location where the tuple is successfully inserted

+   API to insert the tuple with a speculative token. This API is similar
+   like <literal>tuple_insert</literal>, with additional speculative
+   information.

How about:

This API is similar to <literal>tuple_insert</literal>, although with
additional information necessary for speculative insertion

+  <para>
+<programlisting>
+void
+tuple_complete_speculative (Relation rel,
+                          TupleTableSlot *slot,
+                          uint32 specToken,
+                          bool succeeded);
+</programlisting>
+   API to complete the state of the tuple inserted by the API
<literal>tuple_insert_speculative</literal>
+   with the successful completion of the index insert.
+  </para>

How about:

API to complete the speculative insertion of a tuple started by
<literal>tuple_insert_speculative</literal>, invoked after finishing the
index insert

+  <para>
+<programlisting>
+bool
+tuple_fetch_row_version (Relation relation,
+                       ItemPointer tid,
+                       Snapshot snapshot,
+                       TupleTableSlot *slot,
+                       Relation stats_relation);
+</programlisting>
+   API to fetch and store the Buffered Heap tuple in the provided slot
+   based on the ItemPointer.
+  </para>

It seems that this description is based on what heapam_fetch_row_version()
does, but it should be more generic, maybe like:

API to fetch a buffered tuple given its TID and store it in the provided slot

+  <para>
+<programlisting>
+HTSU_Result
+TupleLock_function (Relation relation,
+                   ItemPointer tid,
+                   Snapshot snapshot,
+                   TupleTableSlot *slot,
+                   CommandId cid,
+                   LockTupleMode mode,
+                   LockWaitPolicy wait_policy,
+                   uint8 flags,
+                   HeapUpdateFailureData *hufd);

I guess you meant to write "tuple_lock" here, not "TupleLock_function".

+</programlisting>
+   API to lock the specified the ItemPointer tuple and fetches the newest
version of
+   its tuple and TID.
+  </para>

How about:

API to lock the specified tuple and return the TID of its newest version

+  <para>
+<programlisting>
+void
+tuple_get_latest_tid (Relation relation,
+                    Snapshot snapshot,
+                    ItemPointer tid);
+</programlisting>
+   API to get the the latest TID of the tuple with the given itempointer.
+  </para>

How about:

API to get TID of the latest version of the specified tuple

+  <para>
+<programlisting>
+bool
+tuple_fetch_follow (struct IndexFetchTableData *scan,
+                  ItemPointer tid,
+                  Snapshot snapshot,
+                  TupleTableSlot *slot,
+                  bool *call_again, bool *all_dead);
+</programlisting>
+   API to get the all the tuples of the page that satisfies itempointer.
+  </para>

IIUC, "all the tuples of of the page" in the above sentence means all the
tuples in the HOT chain of a given heap tuple, making this description of
the API slightly specific to the heap AM.  Can we make the description
more generic or is the API itself very specific that it cannot be
expressed in generic terms?  Ignoring that for a moment, I think the
sentence contains more "the"s than there need to be, so maybe write as:

API to get all tuples on a given page that are linked to the tuple of the
given TID

+  <para>
+<programlisting>
+tuple_data
+get_tuple_data (TupleTableSlot *slot, tuple_data_flags flags);
+</programlisting>
+   API to return the internal structure members of the HeapTuple.
+  </para>

I think this description doesn't mention enough details of both the
information that needs to be specified when calling the function (what's
in flags) and the information that's returned.

+  <para>
+<programlisting>
+bool
+scan_analyze_next_tuple (TableScanDesc scan, TransactionId OldestXmin,
+                      double *liverows, double *deadrows, TupleTableSlot
*slot));
+</programlisting>
+   API to analyze the block and fill the buffered heap tuple in the slot
and also
+   provide the live and dead rows.
+  </para>

How about:

API to get the next tuple from the block being scanned, which also updates
the number of live and dead rows encountered

+  <para>
+<programlisting>
+void
+relation_copy_for_cluster (Relation NewHeap, Relation OldHeap, Relation
OldIndex,
+                       bool use_sort,
+                       TransactionId OldestXmin, TransactionId FreezeXid,
MultiXactId MultiXactCutoff,
+                       double *num_tuples, double *tups_vacuumed, double
*tups_recently_dead);
+</programlisting>
+   API to copy one relation to another relation eith using the Index or
table scan.
+  </para>

Typo: eith -> either

But maybe, rewrite this as:

API to make a copy of the content of a relation, optionally sorted using
either the specified index or by sorting explicitly

+  <para>
+<programlisting>
+TableScanDesc
+scan_begin (Relation relation,
+            Snapshot snapshot,
+            int nkeys, ScanKey key,
+            ParallelTableScanDesc parallel_scan,
+            bool allow_strat,
+            bool allow_sync,
+            bool allow_pagemode,
+            bool is_bitmapscan,
+            bool is_samplescan,
+            bool temp_snap);
+</programlisting>
+   API to start the relation scan for the provided relation and returns the
+   <structname>TableScanDesc</structname> structure.
+  </para>

How about:

API to start a scan of a relation using specified options, which returns
the <structname>TableScanDesc</structname> structure to be used for
subsequent scan operations

+    <para>
+<programlisting>
+void
+scansetlimits (TableScanDesc sscan, BlockNumber startBlk, BlockNumber
numBlks);
+</programlisting>
+   API to fix the relation scan range limits.
+  </para>


How about:

API to set scan range endpoints

+    <para>
+<programlisting>
+bool
+scan_bitmap_pagescan (TableScanDesc scan,
+                    TBMIterateResult *tbmres);
+</programlisting>
+   API to scan the relation and fill the scan description bitmap with
valid item pointers
+   for the specified block.
+  </para>

This says "to scan the relation", but seems to be concerned with only a
page worth of data as the name also says.  Also, it's not clear what "scan
description bitmap" means.  Maybe write as:

API to scan the relation block specified in the scan descriptor to collect
and return the tuples requested by the given bitmap

+    <para>
+<programlisting>
+bool
+scan_bitmap_pagescan_next (TableScanDesc scan,
+                        TupleTableSlot *slot);
+</programlisting>
+   API to fill the buffered heap tuple data from the bitmap scanned item
pointers and store
+   it in the provided slot.
+  </para>

How about:

API to select the next tuple from the set of tuples of a given page
specified in the scan descriptor and return in the provided slot; returns
false if no more tuples to return on the given page

+    <para>
+<programlisting>
+bool
+scan_sample_next_block (TableScanDesc scan, struct SampleScanState
*scanstate);
+</programlisting>
+   API to scan the relation and fill the scan description bitmap with
valid item pointers
+   for the specified block provided by the sample method.
+  </para>

Looking at the code, this API selects the next block using the sampling
method and nothing more, although I see that the heap AM implementation
also does heapgetpage thus collecting live tuples in the array known only
to heap AM.  So, how about:

API to select the next block of the relation using the given sampling
method and set its information in the scan descriptor

+    <para>
+<programlisting>
+bool
+scan_sample_next_tuple (TableScanDesc scan, struct SampleScanState
*scanstate, TupleTableSlot *slot);
+</programlisting>
+   API to fill the buffered heap tuple data from the bitmap scanned item
pointers based on the sample
+   method and store it in the provided slot.
+  </para>

How about:

API to select the next tuple using the given sampling method from the set
of tuples collected from the block previously selected by the sampling method

+    <para>
+<programlisting>
+void
+scan_rescan (TableScanDesc scan, ScanKey key, bool set_params,
+             bool allow_strat, bool allow_sync, bool allow_pagemode);
+</programlisting>
+   API to restart the relation scan with provided data.
+  </para>

How about:

API to restart the given scan using provided options, releasing any
resources (such as buffer pins) already held by the scan

+  <para>
+<programlisting>
+void
+scan_update_snapshot (TableScanDesc scan, Snapshot snapshot);
+</programlisting>
+   API to update the relation scan with the new snapshot.
+  </para>

How about:

API to set the visibility snapshot to be used by a given scan

+  <para>
+<programlisting>
+IndexFetchTableData *
+begin_index_fetch (Relation relation);
+</programlisting>
+   API to prepare the <structname>IndexFetchTableData</structname> for
the relation.
+  </para>

This API is a bit vague.  As in, it's not clear from the name when it's to
be called and what's be to be done with the returned struct.  How about at
least adding more details about what the returned struct is for, like:
       
API to get the <structname>IndexFetchTableData</structname> to be assigned
to an index scan on the specified relation

+  <para>
+<programlisting>
+void
+reset_index_fetch (struct IndexFetchTableData* data);
+</programlisting>
+   API to reset the prepared internal members of the
<structname>IndexFetchTableData</structname>.
+  </para>

This description seems wrong if I look at the code.  Its purpose seems to
be reset the AM-specific members, such as releasing the buffer pin held in
xs_cbuf in the heap AM's case.

How about:

API to release AM-specific resources held by the
<structname>IndexFetchTableData</structname> of a given index scan

+  <para>
+<programlisting>
+void
+end_index_fetch (struct IndexFetchTableData* data);
+</programlisting>
+   API to clear and free the <structname>IndexFetchTableData</structname>.
+  </para>

Given above, how about:

API to release AM-specific resources held by the
<structname>IndexFetchTableData</structname> of a given index scan and
free the memory of <structname>IndexFetchTableData</structname> itself

+    <para>
+<programlisting>
+double
+index_build_range_scan (Relation heapRelation,
+                       Relation indexRelation,
+                       IndexInfo *indexInfo,
+                       bool allow_sync,
+                       bool anyvisible,
+                       BlockNumber start_blockno,
+                       BlockNumber end_blockno,
+                       IndexBuildCallback callback,
+                       void *callback_state,
+                       TableScanDesc scan);
+</programlisting>
+   API to perform the table scan with bounded range specified by the caller
+   and insert the satisfied records into the index using the provided
callback
+   function pointer.
+  </para>

This is a bit heavy API and the above description lacks some details.
Also, isn't it a bit misleading to use the name end_blockno if it is
interpreted as num_blocks by the internal APIs?

How about:

API to scan the specified blocks of the given table and insert them into
the specified index using the provided callback function

+    <para>
+<programlisting>
+void
+index_validate_scan (Relation heapRelation,
+                   Relation indexRelation,
+                   IndexInfo *indexInfo,
+                   Snapshot snapshot,
+                   struct ValidateIndexState *state);
+</programlisting>
+   API to perform the table scan and insert the satisfied records into
the index.
+   This API is similar like <function>index_build_range_scan</function>.
This
+   is used in the scenario of concurrent index build.
+  </para>

This one's a complicated API too.  How about:

API to scan the table according to the given snapshot and insert tuples
satisfying the snapshot into the specified index, provided their TIDs are
also present in the <structname>ValidateIndexState</structname> struct;
this API is used as the last phase of a concurrent index build

+ <sect2>
+  <title>Table scanning</title>
+
+  <para>
+  </para>
+ </sect2>
+
+ <sect2>
+  <title>Table insert/update/delete</title>
+
+  <para>
+  </para>
+  </sect2>
+
+ <sect2>
+  <title>Table locking</title>
+
+  <para>
+  </para>
+  </sect2>
+
+ <sect2>
+  <title>Table vacuum</title>
+
+  <para>
+  </para>
+ </sect2>
+
+ <sect2>
+  <title>Table fetch</title>
+
+  <para>
+  </para>
+ </sect2>


Seems like you forgot to put the individual API descriptions under these
sub-headers.  Actually, I think it'd be better to try to format this page
to looks more like the following:

https://www.postgresql.org/docs/devel/fdw-callbacks.html


-   Currently, only indexes have access methods.  The requirements for index
-   access methods are discussed in detail in <xref linkend="indexam"/>.
+   Currently, only <literal>INDEX</literal> and <literal>TABLE</literal> have
+   access methods.  The requirements for access methods are discussed in
detail
+   in <xref linkend="am"/>.

Hmm, I don't see why you decided to add literal tags to INDEX and TABLE.
Couldn't this have been written as:

Currently, only tables and indexes have access methods.  The requirements
for access methods are discussed in detail in <xref linkend="am"/>.

+        This variable specifies the default table access method using
which to create
+        objects (tables and materialized views) when a
<command>CREATE</command> command does
+        not explicitly specify a access method.

"variable" is not wrong, but "parameter" is used more often for GUCs.  "a
access method" should be "an access method".

Maybe you could write this as:

This variable specifies the default table access method to use when
creating tables or materialized views if the <command>CREATE</command>
does not explicitly specify an access method.

+        If the value does not match the name of any existing table access
methods,
+        <productname>PostgreSQL</productname> will automatically use the
default
+        table access method of the current database.

any existing table methods -> any existing table method

Although, shouldn't that cause an error instead of ignoring the error and
use the database default access method instead?


Thank you for working on this.  Really looking forward to how this shapes
up. :)

Thanks,
Amit


Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Dmitry Dolgov
> On Fri, Nov 16, 2018 at 2:05 AM Haribabu Kommi <[hidden email]> wrote:
>
> I tried running the pgbench performance tests with minimal clients in my laptop and I didn't
> find any performance issues, may be issue is visible only with higher clients. Even with
> perf tool, I am not able to get a clear problem function. As you said, combining of all changes
> leads to some overhead.

Just out of curiosity I've also tried tpc-c from oltpbench (in the very same
simple environment), it doesn't show any significant difference from master as
well.

> Here I attached the cumulative patches with further fixes and basic syntax regress tests also.

While testing the latest version I've noticed, that you didn't include the fix
for HeapTupleInvisible (so I see the error again), was it intentionally?

> On Tue, Nov 27, 2018 at 2:55 AM Andres Freund <[hidden email]> wrote:
>
> FWIW, now that oids are removed, and the tuple table slot abstraction
> got in, I'm working on rebasing the pluggable storage patchset ontop of
> that.

Yes, please. I've tried it myself for reviewing purposes, but the rebasing
speed was not impressive. Also I want to suggest to move it from github and
make a regular patchset, since it's already a bit confusing in the sense what
goes where and which patch to apply on top of which branch.

Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Andres Freund
In reply to this post by Haribabu Kommi-2
Hi,

Thanks for these changes. I've merged a good chunk of them.

On 2018-11-16 12:05:26 +1100, Haribabu Kommi wrote:

> diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
> index c3960dc91f..3254e30a45 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -1741,7 +1741,7 @@ heapam_scan_analyze_next_tuple(TableScanDesc sscan, TransactionId OldestXmin, do
>  {
>   HeapScanDesc scan = (HeapScanDesc) sscan;
>   Page targpage;
> - OffsetNumber targoffset = scan->rs_cindex;
> + OffsetNumber targoffset;
>   OffsetNumber maxoffset;
>   BufferHeapTupleTableSlot *hslot;
>  
> @@ -1751,7 +1751,9 @@ heapam_scan_analyze_next_tuple(TableScanDesc sscan, TransactionId OldestXmin, do
>   maxoffset = PageGetMaxOffsetNumber(targpage);
>  
>   /* Inner loop over all tuples on the selected page */
> - for (targoffset = scan->rs_cindex; targoffset <= maxoffset; targoffset++)
> + for (targoffset = scan->rs_cindex ? scan->rs_cindex : FirstOffsetNumber;
> + targoffset <= maxoffset;
> + targoffset++)
>   {
>   ItemId itemid;
>   HeapTuple targtuple = &hslot->base.tupdata;

I thought it was better to fix the initialization for rs_cindex - any
reason you didn't go for that?


> diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
> index 8233475aa0..7bad246f55 100644
> --- a/src/backend/access/heap/heapam_visibility.c
> +++ b/src/backend/access/heap/heapam_visibility.c
> @@ -1838,8 +1838,10 @@ HeapTupleSatisfies(HeapTuple stup, Snapshot snapshot, Buffer buffer)
>   case NON_VACUUMABLE_VISIBILTY:
>   return HeapTupleSatisfiesNonVacuumable(stup, snapshot, buffer);
>   break;
> - default:
> + case END_OF_VISIBILITY:
>   Assert(0);
>   break;
>   }
> +
> + return false; /* keep compiler quiet */

I don't understand why END_OF_VISIBILITY is good idea?  I now removed
END_OF_VISIBILITY, and the default case.



> @@ -593,6 +594,10 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
>   if (myState->rel->rd_rel->relhasoids)
>   slot->tts_tupleOid = InvalidOid;
>  
> + /* Materialize the slot */
> + if (!TTS_IS_VIRTUAL(slot))
> + ExecMaterializeSlot(slot);
> +
>   table_insert(myState->rel,
>   slot,
>   myState->output_cid,

What's the point of adding materialization here?

> @@ -570,6 +563,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
>   Assert(TTS_IS_HEAPTUPLE(scanslot) ||
>     TTS_IS_BUFFERTUPLE(scanslot));
>  
> + if (hslot->tuple == NULL)
> + ExecMaterializeSlot(scanslot);
> +
>   d = heap_getsysattr(hslot->tuple, attnum,
>   scanslot->tts_tupleDescriptor,
>   op->resnull);

Same?


> diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
> index e055c0a7c6..34ef86a5bd 100644
> --- a/src/backend/executor/execMain.c
> +++ b/src/backend/executor/execMain.c
> @@ -2594,7 +2594,7 @@ EvalPlanQual(EState *estate, EPQState *epqstate,
>   * datums that may be present in copyTuple).  As with the next step, this
>   * is to guard against early re-use of the EPQ query.
>   */
> - if (!TupIsNull(slot))
> + if (!TupIsNull(slot) && !TTS_IS_VIRTUAL(slot))
>   ExecMaterializeSlot(slot);


Same?


>  #if FIXME
> @@ -2787,16 +2787,7 @@ EvalPlanQualFetchRowMarks(EPQState *epqstate)
>   if (isNull)
>   continue;
>  
> - elog(ERROR, "frak, need to implement ROW_MARK_COPY");
> -#ifdef FIXME
> - // FIXME: this should just deform the tuple and store it as a
> - // virtual one.
> - tuple = table_tuple_by_datum(erm->relation, datum, erm->relid);
> -
> - /* store tuple */
> - EvalPlanQualSetTuple(epqstate, erm->rti, tuple);
> -#endif
> -
> + ExecForceStoreHeapTupleDatum(datum, slot);
>   }
>   }
>  }

Cool.


> diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
> index 56880e3d16..36ca07beb2 100644
> --- a/src/backend/executor/nodeBitmapHeapscan.c
> +++ b/src/backend/executor/nodeBitmapHeapscan.c
> @@ -224,6 +224,18 @@ BitmapHeapNext(BitmapHeapScanState *node)
>  
>   BitmapAdjustPrefetchIterator(node, tbmres);
>  
> + /*
> + * Ignore any claimed entries past what we think is the end of the
> + * relation.  (This is probably not necessary given that we got at
> + * least AccessShareLock on the table before performing any of the
> + * indexscans, but let's be safe.)
> + */
> + if (tbmres->blockno >= scan->rs_nblocks)
> + {
> + node->tbmres = tbmres = NULL;
> + continue;
> + }
> +

I moved this into the storage engine, there just was a minor bug
preventing the already existing check from taking effect. I don't think
we should expose this kind of thing to the outside of the storage
engine.


> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
> index 54382aba88..ea48e1d6e8 100644
> --- a/src/backend/parser/gram.y
> +++ b/src/backend/parser/gram.y
> @@ -4037,7 +4037,6 @@ CreateStatsStmt:
>   *
>   *****************************************************************************/
>  
> -// PBORKED: storage option
>  CreateAsStmt:
>   CREATE OptTemp TABLE create_as_target AS SelectStmt opt_with_data
>   {
> @@ -4068,14 +4067,16 @@ CreateAsStmt:
>   ;
>  
>  create_as_target:
> - qualified_name opt_column_list OptWith OnCommitOption OptTableSpace
> + qualified_name opt_column_list table_access_method_clause
> + OptWith OnCommitOption OptTableSpace
>   {
>   $$ = makeNode(IntoClause);
>   $$->rel = $1;
>   $$->colNames = $2;
> - $$->options = $3;
> - $$->onCommit = $4;
> - $$->tableSpaceName = $5;
> + $$->accessMethod = $3;
> + $$->options = $4;
> + $$->onCommit = $5;
> + $$->tableSpaceName = $6;
>   $$->viewQuery = NULL;
>   $$->skipData = false; /* might get changed later */
>   }
> @@ -4125,14 +4126,15 @@ CreateMatViewStmt:
>   ;
>  
>  create_mv_target:
> - qualified_name opt_column_list opt_reloptions OptTableSpace
> + qualified_name opt_column_list table_access_method_clause opt_reloptions OptTableSpace
>   {
>   $$ = makeNode(IntoClause);
>   $$->rel = $1;
>   $$->colNames = $2;
> - $$->options = $3;
> + $$->accessMethod = $3;
> + $$->options = $4;
>   $$->onCommit = ONCOMMIT_NOOP;
> - $$->tableSpaceName = $4;
> + $$->tableSpaceName = $5;
>   $$->viewQuery = NULL; /* filled at analysis time */
>   $$->skipData = false; /* might get changed later */
>   }

Cool. I wonder if we should also somehow support SELECT INTO w/ USING?
You've apparently started to do so with?


> diff --git a/src/test/regress/expected/create_am.out b/src/test/regress/expected/create_am.out
> index 47dd885c4e..a4094ca3f1 100644
> --- a/src/test/regress/expected/create_am.out
> +++ b/src/test/regress/expected/create_am.out
> @@ -99,3 +99,81 @@ HINT:  Use DROP ... CASCADE to drop the dependent objects too.
>  -- Drop access method cascade
>  DROP ACCESS METHOD gist2 CASCADE;
>  NOTICE:  drop cascades to index grect2ind2
> +-- Create a heap2 table am handler with heapam handler
> +CREATE ACCESS METHOD heap2 TYPE TABLE HANDLER heap_tableam_handler;
> +SELECT * FROM pg_am where amtype = 't';
> + amname |      amhandler       | amtype
> +--------+----------------------+--------
> + heap   | heap_tableam_handler | t
> + heap2  | heap_tableam_handler | t
> +(2 rows)
> +
> +CREATE TABLE tbl_heap2(f1 int, f2 char(100)) using heap2;
> +INSERT INTO tbl_heap2 VALUES(generate_series(1,10), 'Test series');
> +SELECT count(*) FROM tbl_heap2;
> + count
> +-------
> +    10
> +(1 row)
> +
> +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a
> + where a.oid = r.relam AND r.relname = 'tbl_heap2';
> +  relname  | relkind | amname
> +-----------+---------+--------
> + tbl_heap2 | r       | heap2
> +(1 row)
> +
> +-- create table as using heap2
> +CREATE TABLE tblas_heap2 using heap2 AS select * from tbl_heap2;
> +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a
> + where a.oid = r.relam AND r.relname = 'tblas_heap2';
> +   relname   | relkind | amname
> +-------------+---------+--------
> + tblas_heap2 | r       | heap2
> +(1 row)
> +
> +--
> +-- select into doesn't support new syntax, so it should be
> +-- default access method.
> +--
> +SELECT INTO tblselectinto_heap from tbl_heap2;
> +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a
> + where a.oid = r.relam AND r.relname = 'tblselectinto_heap';
> +      relname       | relkind | amname
> +--------------------+---------+--------
> + tblselectinto_heap | r       | heap
> +(1 row)
> +
> +DROP TABLE tblselectinto_heap;
> +-- create materialized view using heap2
> +CREATE MATERIALIZED VIEW mv_heap2 USING heap2 AS
> + SELECT * FROM tbl_heap2;
> +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a
> + where a.oid = r.relam AND r.relname = 'mv_heap2';
> + relname  | relkind | amname
> +----------+---------+--------
> + mv_heap2 | m       | heap2
> +(1 row)
> +
> +-- Try creating the unsupported relation kinds with using syntax
> +CREATE VIEW test_view USING heap2 AS SELECT * FROM tbl_heap2;
> +ERROR:  syntax error at or near "USING"
> +LINE 1: CREATE VIEW test_view USING heap2 AS SELECT * FROM tbl_heap2...
> +                              ^
> +CREATE SEQUENCE test_seq USING heap2;
> +ERROR:  syntax error at or near "USING"
> +LINE 1: CREATE SEQUENCE test_seq USING heap2;
> +                                 ^
> +-- Drop table access method, but fails as objects depends on it
> +DROP ACCESS METHOD heap2;
> +ERROR:  cannot drop access method heap2 because other objects depend on it
> +DETAIL:  table tbl_heap2 depends on access method heap2
> +table tblas_heap2 depends on access method heap2
> +materialized view mv_heap2 depends on access method heap2
> +HINT:  Use DROP ... CASCADE to drop the dependent objects too.
> +-- Drop table access method with cascade
> +DROP ACCESS METHOD heap2 CASCADE;
> +NOTICE:  drop cascades to 3 other objects
> +DETAIL:  drop cascades to table tbl_heap2
> +drop cascades to table tblas_heap2
> +drop cascades to materialized view mv_heap2
> diff --git a/src/test/regress/sql/create_am.sql b/src/test/regress/sql/create_am.sql
> index 3e0ac104f3..0472a60f20 100644
> --- a/src/test/regress/sql/create_am.sql
> +++ b/src/test/regress/sql/create_am.sql
> @@ -66,3 +66,49 @@ DROP ACCESS METHOD gist2;
>  
>  -- Drop access method cascade
>  DROP ACCESS METHOD gist2 CASCADE;
> +
> +-- Create a heap2 table am handler with heapam handler
> +CREATE ACCESS METHOD heap2 TYPE TABLE HANDLER heap_tableam_handler;
> +
> +SELECT * FROM pg_am where amtype = 't';
> +
> +CREATE TABLE tbl_heap2(f1 int, f2 char(100)) using heap2;
> +INSERT INTO tbl_heap2 VALUES(generate_series(1,10), 'Test series');
> +SELECT count(*) FROM tbl_heap2;
> +
> +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a
> + where a.oid = r.relam AND r.relname = 'tbl_heap2';
> +
> +-- create table as using heap2
> +CREATE TABLE tblas_heap2 using heap2 AS select * from tbl_heap2;
> +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a
> + where a.oid = r.relam AND r.relname = 'tblas_heap2';
> +
> +--
> +-- select into doesn't support new syntax, so it should be
> +-- default access method.
> +--
> +SELECT INTO tblselectinto_heap from tbl_heap2;
> +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a
> + where a.oid = r.relam AND r.relname = 'tblselectinto_heap';
> +
> +DROP TABLE tblselectinto_heap;
> +
> +-- create materialized view using heap2
> +CREATE MATERIALIZED VIEW mv_heap2 USING heap2 AS
> + SELECT * FROM tbl_heap2;
> +
> +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a
> + where a.oid = r.relam AND r.relname = 'mv_heap2';
> +
> +-- Try creating the unsupported relation kinds with using syntax
> +CREATE VIEW test_view USING heap2 AS SELECT * FROM tbl_heap2;
> +
> +CREATE SEQUENCE test_seq USING heap2;
> +
> +
> +-- Drop table access method, but fails as objects depends on it
> +DROP ACCESS METHOD heap2;
> +
> +-- Drop table access method with cascade
> +DROP ACCESS METHOD heap2 CASCADE;
> --
> 2.18.0.windows.1

Nice!

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Andres Freund
In reply to this post by Andres Freund
Hi,

On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
> FWIW, now that oids are removed, and the tuple table slot abstraction
> got in, I'm working on rebasing the pluggable storage patchset ontop of
> that.

I've pushed a version to that to the git tree, including a rebased
version of zheap:
https://github.com/anarazel/postgres-pluggable-storage
https://github.com/anarazel/postgres-pluggable-zheap

I'm still working on moving some of the out-of-access/zheap
modifications into pluggable storage (see e.g. the first commit of the
pluggable-zheap series). But this should allow others to start on a more
recent codebasis.

My next steps are:
- make relation creation properly pluggable
- remove the typedefs from tableam.h, instead move them into the
  TableAmRoutine struct.
- Move rs_{nblocks, startblock, numblocks} out of TableScanDescData
- Move HeapScanDesc and IndexFetchHeapData out of relscan.h
- See if the slot in SysScanDescData can be avoided, it's not exactly
  free of overhead.
- remove ExecSlotCompare(), it's entirely unrelated to these changes imo
  (and in the wrong place)
- rename HeapUpdateFailureData et al to not reference Heap
- split pluggable storage patchset, to commit earlier:
  - EvalPlanQual slotification
  - trigger slotification
  - split of IndexBuildHeapScan out of index.c

I'm wondering whether we should add
table_beginscan/table_getnextslot/index_getnext_slot using the old API
in an earlier commit that then could be committed separately, allowing
the tablecmd.c changes to be committed soon.

I'm wondering whether we should change the table_beginscan* API so it
provides a slot - pretty much every caller has to do so, and it seems
just as easy to create/dispose via table_beginscan/endscan.

Further tasks I'm not yet planning to tackle, that I'd welcome help on:
- pg_dump support
- pg_upgrade testing
- I think we should consider removing HeapTuple->t_tableOid, it should
  imo live entirely in the slot

Greetings,

Andres Freund

Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Dmitry Dolgov
> On Tue, Dec 11, 2018 at 3:13 AM Andres Freund <[hidden email]> wrote:
>
> Hi,
>
> On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
> > FWIW, now that oids are removed, and the tuple table slot abstraction
> > got in, I'm working on rebasing the pluggable storage patchset ontop of
> > that.
>
> I've pushed a version to that to the git tree, including a rebased
> version of zheap:
> https://github.com/anarazel/postgres-pluggable-storage
> https://github.com/anarazel/postgres-pluggable-zheap

Great, thanks!

As a side note, I assume the last reference should be this, right?

https://github.com/anarazel/postgres-pluggable-storage/tree/pluggable-zheap

> Further tasks I'm not yet planning to tackle, that I'd welcome help on:
> - pg_dump support
> - pg_upgrade testing
> - I think we should consider removing HeapTuple->t_tableOid, it should
>   imo live entirely in the slot

I would love to try help with pg_dump support.

Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Kyotaro HORIGUCHI-2
In reply to this post by Amit Langote-2
Hello.

At Tue, 27 Nov 2018 14:58:35 +0900, Amit Langote <[hidden email]> wrote in <[hidden email]>

> +  <para>
> +<programlisting>
> +TupleTableSlotOps *
> +slot_callbacks (Relation relation);
> +</programlisting>
> +   API to access the slot specific methods;
> +   Following methods are available;
> +   <structname>TTSOpsVirtual</structname>,
> +   <structname>TTSOpsHeapTuple</structname>,
> +   <structname>TTSOpsMinimalTuple</structname>,
> +   <structname>TTSOpsBufferTuple</structname>,
> +  </para>
>
> Unless I'm misunderstanding what the TupleTableSlotOps abstraction is or
> its relations to the TableAmRoutine abstraction, I think the text
> description could better be written as:
>
> "API to get the slot operations struct for a given table access method"
>
> It's not clear to me why various TTSOps* structs are listed here?  Is the
> point that different AMs may choose one of the listed alternatives?  For
> example, I see that heap AM implementation returns TTOpsBufferTuple, so it
> manipulates slots containing buffered tuples, right?  Other AMs are free
> to return any one of these?  For example, some AMs may never use buffer
> manager and hence not use TTOpsBufferTuple.  Is that understanding correct?

Yeah, I'm not sure why it should not be a pointer to the struct
itself but a function. And the four struct doesn't seem relevant
to table AMs. Perhaps clear, getsomeattrs and so on should be
listed instead.

> +  <para>
> +<programlisting>
> +Oid
> +tuple_insert (Relation rel, TupleTableSlot *slot, CommandId cid,
> +              int options, BulkInsertState bistate);
> +</programlisting>
> +   API to insert the tuple and provide the <literal>ItemPointerData</literal>
> +   where the tuple is successfully inserted.
> +  </para>
>
> It's not clear from the signature where you get the ItemPointerData.
> Looking at heapam_tuple_insert which puts it in slot->tts_tid, I think
> this should mention it a bit differently, like:
>
> API to insert the tuple contained in the provided slot and return its TID,
> that is, the location where the tuple is successfully inserted

It is actually an OID, not a TID in the current code. TID is
internaly handled.

> +  <para>
> +<programlisting>
> +bool
> +tuple_fetch_follow (struct IndexFetchTableData *scan,
> +                  ItemPointer tid,
> +                  Snapshot snapshot,
> +                  TupleTableSlot *slot,
> +                  bool *call_again, bool *all_dead);
> +</programlisting>
> +   API to get the all the tuples of the page that satisfies itempointer.
> +  </para>
>
> IIUC, "all the tuples of of the page" in the above sentence means all the
> tuples in the HOT chain of a given heap tuple, making this description of
> the API slightly specific to the heap AM.  Can we make the description
> more generic or is the API itself very specific that it cannot be
> expressed in generic terms?  Ignoring that for a moment, I think the
> sentence contains more "the"s than there need to be, so maybe write as:
>
> API to get all tuples on a given page that are linked to the tuple of the
> given TID

Mmm. This is exposing MVCC matters to indexam. I suppose we
should refactor this API.

> +  <para>
> +<programlisting>
> +tuple_data
> +get_tuple_data (TupleTableSlot *slot, tuple_data_flags flags);
> +</programlisting>
> +   API to return the internal structure members of the HeapTuple.
> +  </para>
>
> I think this description doesn't mention enough details of both the
> information that needs to be specified when calling the function (what's
> in flags) and the information that's returned.

(I suppose it will be described in later sections.)

> +  <para>
> +<programlisting>
> +bool
> +scan_analyze_next_tuple (TableScanDesc scan, TransactionId OldestXmin,
> +                      double *liverows, double *deadrows, TupleTableSlot
> *slot));
> +</programlisting>
> +   API to analyze the block and fill the buffered heap tuple in the slot
> and also
> +   provide the live and dead rows.
> +  </para>
>
> How about:
>
> API to get the next tuple from the block being scanned, which also updates
> the number of live and dead rows encountered

"live" and "dead" are MVCC terms. I suppose that we should stash
out the deadrows somwhere else. (But analyze code would need to
be modified if we do so.)

> +void
> +scansetlimits (TableScanDesc sscan, BlockNumber startBlk, BlockNumber
> numBlks);
> +</programlisting>
> +   API to fix the relation scan range limits.
> +  </para>
>
>
> How about:
>
> API to set scan range endpoints

This sets start point and the number of blocks.. Just "API to set
scan range" would be sifficient reffering to the parameter list.

> +    <para>
> +<programlisting>
> +bool
> +scan_bitmap_pagescan (TableScanDesc scan,
> +                    TBMIterateResult *tbmres);
> +</programlisting>
> +   API to scan the relation and fill the scan description bitmap with
> valid item pointers
> +   for the specified block.
> +  </para>
>
> This says "to scan the relation", but seems to be concerned with only a
> page worth of data as the name also says.  Also, it's not clear what "scan
> description bitmap" means.  Maybe write as:
>
> API to scan the relation block specified in the scan descriptor to collect
> and return the tuples requested by the given bitmap

"API to collect the tuples in a page requested by the given
bitmpap scan result." something? I think detailed explanation
would be required apart from the one-line description. Anyway the
name TBMIterateResult doesn't seem proper to expose.

> +    <para>
> +<programlisting>
> +bool
> +scan_sample_next_block (TableScanDesc scan, struct SampleScanState
> *scanstate);
> +</programlisting>
> +   API to scan the relation and fill the scan description bitmap with
> valid item pointers
> +   for the specified block provided by the sample method.
> +  </para>
>
> Looking at the code, this API selects the next block using the sampling
> method and nothing more, although I see that the heap AM implementation
> also does heapgetpage thus collecting live tuples in the array known only
> to heap AM.  So, how about:
>
> API to select the next block of the relation using the given sampling
> method and set its information in the scan descriptor

"block" and "page" seems randomly choosed here and there. I don't
mind that seen in the core but..

> +    <para>
> +<programlisting>
> +bool
> +scan_sample_next_tuple (TableScanDesc scan, struct SampleScanState
> *scanstate, TupleTableSlot *slot);
> +</programlisting>
> +   API to fill the buffered heap tuple data from the bitmap scanned item
> pointers based on the sample
> +   method and store it in the provided slot.
> +  </para>
>
> How about:
>
> API to select the next tuple using the given sampling method from the set
> of tuples collected from the block previously selected by the sampling method

I'm not sure "from the set of tuples collected" is true. Just
"the state of sample scan" or something wouldn't be fine?

> +    <para>
> +<programlisting>
> +void
> +scan_rescan (TableScanDesc scan, ScanKey key, bool set_params,
> +             bool allow_strat, bool allow_sync, bool allow_pagemode);
> +</programlisting>
> +   API to restart the relation scan with provided data.
> +  </para>
>
> How about:
>
> API to restart the given scan using provided options, releasing any
> resources (such as buffer pins) already held by the scan

It looks too-detailed to me, but "with provided data" looks too
coarse..

> +  <para>
> +<programlisting>
> +void
> +scan_update_snapshot (TableScanDesc scan, Snapshot snapshot);
> +</programlisting>
> +   API to update the relation scan with the new snapshot.
> +  </para>
>
> How about:
>
> API to set the visibility snapshot to be used by a given scan

If so, the function name should be "scan_set_snapshot". Anyway
the name look like "the function to update a snapshot (itself)".

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Kyotaro HORIGUCHI-2
In reply to this post by Haribabu Kommi-2
Hello.

(in the next branch:)
At Tue, 27 Nov 2018 14:58:35 +0900, Amit Langote <[hidden email]> wrote in <[hidden email]>
> Thank you for working on this.  Really looking forward to how this shapes
> up. :)

+1.

I looked through the documentation part, as where I can do something.

am.html:
> 61.1. Overview of Index access methods
>  61.1.1. Basic API Structure for Indexes
>  61.1.2. Index Access Method Functions
>  61.1.3. Index Scanning
> 61.2. Overview of Table access methods
>  61.2.1. Table access method API
>  61.2.2. Table Access Method Functions
>  61.2.3. Table scanning

Aren't 61.1 and 61.2 better in the reverse order?

Is there a reason for the difference of the titles between 61.1.1
and 61.2.1? The contents are quite similar.


+ <sect2 id="table-api">
+  <title>Table access method API</title>

The member names of index AM struct begins with "am" but they
don't have an unified prefix in table AM. It seems a bit
incosistent.  Perhaps we might should rename some long and
internal names..


+ <sect2 id="table-functions">
+  <title>Table Access Method Functions</title>

Table AM functions are far finer-grained than index AM. I think
that AM developers needs the more concrete description on what
every API function does and explanation on various
previously-internal structs.

I suppose that how the functions are used in core code paths will
be written in the following sections.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Haribabu Kommi-2
In reply to this post by Andres Freund

On Tue, Dec 11, 2018 at 12:47 PM Andres Freund <[hidden email]> wrote:
Hi,

Thanks for these changes. I've merged a good chunk of them.

Thanks.
 
On 2018-11-16 12:05:26 +1100, Haribabu Kommi wrote:
> diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
> index c3960dc91f..3254e30a45 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -1741,7 +1741,7 @@ heapam_scan_analyze_next_tuple(TableScanDesc sscan, TransactionId OldestXmin, do
>  {
>       HeapScanDesc scan = (HeapScanDesc) sscan;
>       Page            targpage;
> -     OffsetNumber targoffset = scan->rs_cindex;
> +     OffsetNumber targoffset;
>       OffsetNumber maxoffset;
>       BufferHeapTupleTableSlot *hslot;

> @@ -1751,7 +1751,9 @@ heapam_scan_analyze_next_tuple(TableScanDesc sscan, TransactionId OldestXmin, do
>       maxoffset = PageGetMaxOffsetNumber(targpage);

>       /* Inner loop over all tuples on the selected page */
> -     for (targoffset = scan->rs_cindex; targoffset <= maxoffset; targoffset++)
> +     for (targoffset = scan->rs_cindex ? scan->rs_cindex : FirstOffsetNumber;
> +                     targoffset <= maxoffset;
> +                     targoffset++)
>       {
>               ItemId          itemid;
>               HeapTuple       targtuple = &hslot->base.tupdata;

I thought it was better to fix the initialization for rs_cindex - any
reason you didn't go for that?

No specific reason. Thanks for the correction.
 

> diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
> index 8233475aa0..7bad246f55 100644
> --- a/src/backend/access/heap/heapam_visibility.c
> +++ b/src/backend/access/heap/heapam_visibility.c
> @@ -1838,8 +1838,10 @@ HeapTupleSatisfies(HeapTuple stup, Snapshot snapshot, Buffer buffer)
>               case NON_VACUUMABLE_VISIBILTY:
>                       return HeapTupleSatisfiesNonVacuumable(stup, snapshot, buffer);
>                       break;
> -             default:
> +             case END_OF_VISIBILITY:
>                       Assert(0);
>                       break;
>       }
> +
> +     return false; /* keep compiler quiet */

I don't understand why END_OF_VISIBILITY is good idea?  I now removed
END_OF_VISIBILITY, and the default case.
 
OK.


> @@ -593,6 +594,10 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
>       if (myState->rel->rd_rel->relhasoids)
>               slot->tts_tupleOid = InvalidOid;

> +     /* Materialize the slot */
> +     if (!TTS_IS_VIRTUAL(slot))
> +             ExecMaterializeSlot(slot);
> +
>       table_insert(myState->rel,
>                                slot,
>                                myState->output_cid,

What's the point of adding materialization here?

In earlier testing i observed as the slot that is received is a buffered slot
and it points to the original tuple, but when it inserts it into the new table,
the transaction id changes and it leads to invisible tuple, because of that
reason I added the materialize here.


 
> @@ -570,6 +563,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
>                               Assert(TTS_IS_HEAPTUPLE(scanslot) ||
>                                          TTS_IS_BUFFERTUPLE(scanslot));

> +                             if (hslot->tuple == NULL)
> +                                     ExecMaterializeSlot(scanslot);
> +
>                               d = heap_getsysattr(hslot->tuple, attnum,
>                                                                       scanslot->tts_tupleDescriptor,
>                                                                       op->resnull);

Same?


> diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
> index e055c0a7c6..34ef86a5bd 100644
> --- a/src/backend/executor/execMain.c
> +++ b/src/backend/executor/execMain.c
> @@ -2594,7 +2594,7 @@ EvalPlanQual(EState *estate, EPQState *epqstate,
>        * datums that may be present in copyTuple).  As with the next step, this
>        * is to guard against early re-use of the EPQ query.
>        */
> -     if (!TupIsNull(slot))
> +     if (!TupIsNull(slot) && !TTS_IS_VIRTUAL(slot))
>               ExecMaterializeSlot(slot);


Same?

Earlier virtual tuple materialize was throwing error, because of that reason I added
that check. 
 
> index 56880e3d16..36ca07beb2 100644
> --- a/src/backend/executor/nodeBitmapHeapscan.c
> +++ b/src/backend/executor/nodeBitmapHeapscan.c
> @@ -224,6 +224,18 @@ BitmapHeapNext(BitmapHeapScanState *node)

>                       BitmapAdjustPrefetchIterator(node, tbmres);

> +                     /*
> +                      * Ignore any claimed entries past what we think is the end of the
> +                      * relation.  (This is probably not necessary given that we got at
> +                      * least AccessShareLock on the table before performing any of the
> +                      * indexscans, but let's be safe.)
> +                      */
> +                     if (tbmres->blockno >= scan->rs_nblocks)
> +                     {
> +                             node->tbmres = tbmres = NULL;
> +                             continue;
> +                     }
> +

I moved this into the storage engine, there just was a minor bug
preventing the already existing check from taking effect. I don't think
we should expose this kind of thing to the outside of the storage
engine.

OK.
 

> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
> index 54382aba88..ea48e1d6e8 100644
> --- a/src/backend/parser/gram.y
> +++ b/src/backend/parser/gram.y
> @@ -4037,7 +4037,6 @@ CreateStatsStmt:
>   *
>   *****************************************************************************/

> -// PBORKED: storage option
>  CreateAsStmt:
>               CREATE OptTemp TABLE create_as_target AS SelectStmt opt_with_data
>                               {
> @@ -4068,14 +4067,16 @@ CreateAsStmt:
>               ;

>  create_as_target:
> -                     qualified_name opt_column_list OptWith OnCommitOption OptTableSpace
> +                     qualified_name opt_column_list table_access_method_clause
> +                     OptWith OnCommitOption OptTableSpace
>                               {
>                                       $$ = makeNode(IntoClause);
>                                       $$->rel = $1;
>                                       $$->colNames = $2;
> -                                     $$->options = $3;
> -                                     $$->onCommit = $4;
> -                                     $$->tableSpaceName = $5;
> +                                     $$->accessMethod = $3;
> +                                     $$->options = $4;
> +                                     $$->onCommit = $5;
> +                                     $$->tableSpaceName = $6;
>                                       $$->viewQuery = NULL;
>                                       $$->skipData = false;           /* might get changed later */
>                               }
> @@ -4125,14 +4126,15 @@ CreateMatViewStmt:
>               ;

>  create_mv_target:
> -                     qualified_name opt_column_list opt_reloptions OptTableSpace
> +                     qualified_name opt_column_list table_access_method_clause opt_reloptions OptTableSpace
>                               {
>                                       $$ = makeNode(IntoClause);
>                                       $$->rel = $1;
>                                       $$->colNames = $2;
> -                                     $$->options = $3;
> +                                     $$->accessMethod = $3;
> +                                     $$->options = $4;
>                                       $$->onCommit = ONCOMMIT_NOOP;
> -                                     $$->tableSpaceName = $4;
> +                                     $$->tableSpaceName = $5;
>                                       $$->viewQuery = NULL;           /* filled at analysis time */
>                                       $$->skipData = false;           /* might get changed later */
>                               }

Cool. I wonder if we should also somehow support SELECT INTO w/ USING?
You've apparently started to do so with?

I thought the same, but SELECT INTO is deprecated syntax, is it fine to add
the new syntax?
 
Regards,
Haribabu Kommi
Fujitsu Australia
Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Dmitry Dolgov
> On Tue, Dec 11, 2018 at 3:13 AM Andres Freund <[hidden email]> wrote:
>
> Further tasks I'm not yet planning to tackle, that I'd welcome help on:
> - pg_dump support
> - pg_upgrade testing
> - I think we should consider removing HeapTuple->t_tableOid, it should
>   imo live entirely in the slot

I'm a bit confused, but what kind of pg_dump support you're talking about?
After a quick glance I don't see so far any table access specific logic there.
To check it I've created a test access method (which is a copy of heap, but
with some small differences) and pg_dump worked as expected.

As a side note, in a table description I haven't found any mention of which
access method is used for this table, probably it's useful to show that with \d+
(see the attached patch).

describe_am.patch (53K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Andres Freund
Hi,

On 2018-12-15 20:15:12 +0100, Dmitry Dolgov wrote:

> > On Tue, Dec 11, 2018 at 3:13 AM Andres Freund <[hidden email]> wrote:
> >
> > Further tasks I'm not yet planning to tackle, that I'd welcome help on:
> > - pg_dump support
> > - pg_upgrade testing
> > - I think we should consider removing HeapTuple->t_tableOid, it should
> >   imo live entirely in the slot
>
> I'm a bit confused, but what kind of pg_dump support you're talking about?
> After a quick glance I don't see so far any table access specific logic there.
> To check it I've created a test access method (which is a copy of heap, but
> with some small differences) and pg_dump worked as expected.

We need to dump the table access method at dump time, otherwise we loose
that information.

> As a side note, in a table description I haven't found any mention of which
> access method is used for this table, probably it's useful to show that with \d+
> (see the attached patch).

I'm not convinced that's really worth the cost of including it in \d
(rather than \d+ or such). When developing an alternative access method
it's extremely useful to be able to just change the default access
method, and run the existing tests, which this makes harder. It's also a
lot of churn.

Greetings,

Andres Freund

123456 ... 9