Pluggable Storage - Andres's take

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
179 messages Options
1 ... 6789
Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Rafia Sabih
On Mon, 6 May 2019 at 22:39, Ashwin Agrawal <[hidden email]> wrote:

>
> On Mon, May 6, 2019 at 7:14 AM Andres Freund <[hidden email]> wrote:
> >
> > Hi,
> >
> > On May 6, 2019 3:40:55 AM PDT, Rafia Sabih <[hidden email]> wrote:
> > >I was trying the toyam patch and on make check it failed with
> > >segmentation fault at
> > >
> > >static void
> > >toyam_relation_set_new_filenode(Relation rel,
> > > char persistence,
> > > TransactionId *freezeXid,
> > > MultiXactId *minmulti)
> > >{
> > > *freezeXid = InvalidTransactionId;
> > >
> > >Basically, on running create table t (i int, j int) using toytable,
> > >leads to this segmentation fault.
> > >
> > >Am I missing something here?
> >
> > I assume you got compiler warmings compiling it? The API for some callbacks changed a bit.
>
> Attached patch gets toy table AM implementation to match latest master API.
> The patch builds on top of patch from Heikki in [1].
> Compiles and works but the test still continues to fail with WARNING
> for issue mentioned in [1]
>
Thanks Ashwin, this works fine with the mentioned warnings of course.

--
Regards,
Rafia Sabih


Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Ashwin Agrawal
In reply to this post by Ashwin Agrawal

On Mon, May 6, 2019 at 1:39 PM Ashwin Agrawal <[hidden email]> wrote:

>
> Also wish to point out, while working on Zedstore, we realized that
> TupleDesc from Relation object can be trusted at AM layer for
> scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()),
> catalog is updated first and hence the relation object passed to AM
> layer reflects new TupleDesc. For heapam its fine as it doesn't use
> the TupleDesc today during scans in AM layer for scan_getnextslot().
> Only TupleDesc which can trusted and matches the on-disk layout of the
> tuple for scans hence is from TupleTableSlot. Which is little
> unfortunate as TupleTableSlot is only available in scan_getnextslot(),
> and not in scan_begin(). Means if AM wishes to do some initialization
> based on TupleDesc for scans can't be done in scan_begin() and forced
> to delay till has access to TupleTableSlot. We should at least add
> comment for scan_begin() to strongly clarify not to trust Relation
> object TupleDesc. Or maybe other alternative would be have separate
> API for rewrite case.

Just to correct my typo, I wish to say, TupleDesc from Relation object can't
be trusted at AM layer for scan_begin() API.

Andres, any thoughts on above. I see you had proposed "change the table_beginscan* API so it
provides a slot" in [1], but seems received no response/comments that time.

Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Andres Freund
In reply to this post by Ashwin Agrawal
Hi,

On 2019-04-29 16:17:41 -0700, Ashwin Agrawal wrote:

> On Thu, Apr 25, 2019 at 3:43 PM Andres Freund <[hidden email]> wrote:
> > Hm. I think some of those changes would be a bit bigger than I initially
> > though. Attached is a more minimal fix that'd route
> > RelationGetNumberOfBlocksForFork() through tableam if necessary.  I
> > think it's definitely the right answer for 1), probably the pragmatic
> > answer to 2), but certainly not for 3).
> >
> > I've for now made the AM return the size in bytes, and then convert that
> > into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
> > are going to continue to want it internally as pages (otherwise there's
> > going to be way too much churn, without a benefit I can see). So I
> > think that's OK.
>
> I will provide my inputs, Heikki please correct me or add your inputs.
>
> I am not sure how much gain this practically provides, if rest of the
> system continues to use the value returned in-terms of blocks. I
> understand things being block based (and not just really block based
> but all the blocks of relation are storing data and full tuple) is
> engraved in the system. So, breaking out of it is yes much larger
> change and not just limited to table AM API.

I don't think it's that ingrained in all that many parts of the
system. Outside of the places I listed upthread, and the one index case
that stashes extra info, which places are that "block based"?


> I feel most of the issues discussed here should be faced by zheap as
> well, as not all blocks/pages contain data like TPD pages should be
> excluded from sampling and TID scans, etc...

It's not a problem so far, and zheap works on tableam. You can just skip
such blocks during sampling / analyze, and return nothing for tidscans.


> > > 2) commands/analyze.c, computing pg_class.relpages
> > >
> > >    This should imo be moved to the tableam callback. It's currently done
> > >    a bit weirdly imo, with fdws computing relpages the callback, but
> > >    then also returning the acquirefunc. Seems like it should entirely be
> > >    computed as part of calling acquirefunc.
> >
> > Here I'm not sure routing RelationGetNumberOfBlocksForFork() through
> > tableam wouldn't be the right minimal approach too. It has the
> > disadvantage of implying certain values for the
> > RelationGetNumberOfBlocksForFork(MAIN) return value.  The alternative
> > would be to return the desire sampling range in
> > table_beginscan_analyze() - but that'd require some duplication because
> > currently that just uses the generic scan_begin() callback.
>
> Yes, just routing relation size via AM layer and using its returned
> value in terms of blocks still and performing sampling based on blocks
> based on it, doesn't feel resolves the issue. Maybe need to delegate
> sampling completely to AM layer. Code duplication can be avoided by
> similar AMs (heap and zheap) possible using some common utility
> functions to achieve intended result.

I don't know what this is actually proposing.



> > I suspect - as previously mentioned- that we're going to have to extend
> > statistics collection beyond the current approach at some point, but I
> > don't think that's now. At least to me it's not clear how to best
> > represent the stats, and how to best use them, if the underlying storage
> > is fundamentally not block best.  Nor how we'd avoid code duplication...
>
> Yes, will have to give more thoughts into this.
>
> >
> > > 3) nodeTidscan, skipping over too large tids
> > >    I think this should just be moved into the AMs, there's no need to
> > >    have this in nodeTidscan.c
> >
> > I think here it's *not* actually correct at all to use the relation
> > size. It's currently doing:
> >
> >         /*
> >          * We silently discard any TIDs that are out of range at the time of scan
> >          * start.  (Since we hold at least AccessShareLock on the table, it won't
> >          * be possible for someone to truncate away the blocks we intend to
> >          * visit.)
> >          */
> >         nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
> >
> > which is fine (except for a certain abstraction leakage) for an AM like
> > heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> > Heikki's approach where tid isn't tied to physical representation.
>
> Agree, its not nice to have that optimization being performed based on
> number of block in generic layer. I feel its not efficient either for
> zheap too due to TPD pages as mentioned above, as number of blocks
> returned will be higher compared to actually data blocks.

I don't think there's a problem for zheap. The blocks are just
interspersed.

Having pondered this a lot more, I think this is the way to go for
12. Then we can improve this for v13, to be nice.


> > The proper fix seems to be to introduce a new scan variant
> > (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> > a scan as a parameter.  But it seems we'd have to introduce that as a
> > separate tableam callback, because we'd not want to incur the overhead
> > of creating an additional scan / RelationGetNumberOfBlocks() checks for
> > triggers et al.
>
> Thinking out loud here, we can possibly tackle this in multiple ways.
> First above mentioned check seems more optimization to me than
> functionally needed, correct me if wrong. If that's true we can check
> with AM if wish to apply that optimization or not based on relation
> size.

It'd be really expensive to check this differently for heap. We'd have
to check the relation size, which is out of the question imo.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Andres Freund
In reply to this post by Ashwin Agrawal
Hi,

On 2019-05-07 23:18:39 -0700, Ashwin Agrawal wrote:

> On Mon, May 6, 2019 at 1:39 PM Ashwin Agrawal <[hidden email]> wrote:
> > Also wish to point out, while working on Zedstore, we realized that
> > TupleDesc from Relation object can be trusted at AM layer for
> > scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()),
> > catalog is updated first and hence the relation object passed to AM
> > layer reflects new TupleDesc. For heapam its fine as it doesn't use
> > the TupleDesc today during scans in AM layer for scan_getnextslot().
> > Only TupleDesc which can trusted and matches the on-disk layout of the
> > tuple for scans hence is from TupleTableSlot. Which is little
> > unfortunate as TupleTableSlot is only available in scan_getnextslot(),
> > and not in scan_begin(). Means if AM wishes to do some initialization
> > based on TupleDesc for scans can't be done in scan_begin() and forced
> > to delay till has access to TupleTableSlot. We should at least add
> > comment for scan_begin() to strongly clarify not to trust Relation
> > object TupleDesc. Or maybe other alternative would be have separate
> > API for rewrite case.
>
> Just to correct my typo, I wish to say, TupleDesc from Relation object can't
> be trusted at AM layer for scan_begin() API.
>
> Andres, any thoughts on above. I see you had proposed "change the
> table_beginscan* API so it
> provides a slot" in [1], but seems received no response/comments that time.
> [1]
> https://www.postgresql.org/message-id/20181211021340.mqaown4njtcgrjvr%40alap3.anarazel.de

I don't think passing a slot at beginscan time is a good idea. There's
several places that want to use different slots for the same scan, and
we probably want to increase that over time (e.g. for batching), not
decrease it.

What kind of initialization do you want to do based on the tuple desc at
beginscan time?

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Ashwin Agrawal
On Wed, May 8, 2019 at 2:46 PM Andres Freund <[hidden email]> wrote:

>
> Hi,
>
> On 2019-05-07 23:18:39 -0700, Ashwin Agrawal wrote:
> > On Mon, May 6, 2019 at 1:39 PM Ashwin Agrawal <[hidden email]> wrote:
> > > Also wish to point out, while working on Zedstore, we realized that
> > > TupleDesc from Relation object can be trusted at AM layer for
> > > scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()),
> > > catalog is updated first and hence the relation object passed to AM
> > > layer reflects new TupleDesc. For heapam its fine as it doesn't use
> > > the TupleDesc today during scans in AM layer for scan_getnextslot().
> > > Only TupleDesc which can trusted and matches the on-disk layout of the
> > > tuple for scans hence is from TupleTableSlot. Which is little
> > > unfortunate as TupleTableSlot is only available in scan_getnextslot(),
> > > and not in scan_begin(). Means if AM wishes to do some initialization
> > > based on TupleDesc for scans can't be done in scan_begin() and forced
> > > to delay till has access to TupleTableSlot. We should at least add
> > > comment for scan_begin() to strongly clarify not to trust Relation
> > > object TupleDesc. Or maybe other alternative would be have separate
> > > API for rewrite case.
> >
> > Just to correct my typo, I wish to say, TupleDesc from Relation object can't
> > be trusted at AM layer for scan_begin() API.
> >
> > Andres, any thoughts on above. I see you had proposed "change the
> > table_beginscan* API so it
> > provides a slot" in [1], but seems received no response/comments that time.
> > [1]
> > https://www.postgresql.org/message-id/20181211021340.mqaown4njtcgrjvr%40alap3.anarazel.de
>
> I don't think passing a slot at beginscan time is a good idea. There's
> several places that want to use different slots for the same scan, and
> we probably want to increase that over time (e.g. for batching), not
> decrease it.
>
> What kind of initialization do you want to do based on the tuple desc at
> beginscan time?

For Zedstore (column store) need to allocate map (array or bitmask) to
mark which columns to project for the scan. Also need to allocate AM
internal scan descriptors corresponding to number of attributes for
the scan. Hence, need access to number of attributes involved in the
scan. Currently, not able to trust Relation's TupleDesc, for Zedstore
we worked-around the same by allocating these things on first call to
getnextslot, when have access to slot (by switching to memory context
used during scan_begin()).


Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Andres Freund
In reply to this post by Andres Freund
Hi,

On 2019-04-25 15:43:15 -0700, Andres Freund wrote:
> Hm. I think some of those changes would be a bit bigger than I initially
> though. Attached is a more minimal fix that'd route
> RelationGetNumberOfBlocksForFork() through tableam if necessary.  I
> think it's definitely the right answer for 1), probably the pragmatic
> answer to 2), but certainly not for 3).

> I've for now made the AM return the size in bytes, and then convert that
> into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
> are going to continue to want it internally as pages (otherwise there's
> going to be way too much churn, without a benefit I can see). So I
> thinkt that's OK.
>
> There's also a somewhat weird bit of returning the total relation size
> for InvalidForkNumber - it's pretty likely that other AMs wouldn't use
> postgres' current forks, but have equivalent concepts. And without that
> there'd be no way to get that size.  I'm not sure I like this, input
> welcome. But it seems good to offer the ability to get the entire size
> somehow.
I'm still reasonably happy with this.  I'll polish it a bit and push.


> > 3) nodeTidscan, skipping over too large tids
> >    I think this should just be moved into the AMs, there's no need to
> >    have this in nodeTidscan.c
>
> I think here it's *not* actually correct at all to use the relation
> size. It's currently doing:
>
> /*
> * We silently discard any TIDs that are out of range at the time of scan
> * start.  (Since we hold at least AccessShareLock on the table, it won't
> * be possible for someone to truncate away the blocks we intend to
> * visit.)
> */
> nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
>
> which is fine (except for a certain abstraction leakage) for an AM like
> heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> Heikki's approach where tid isn't tied to physical representation.
>
>
> The obvious answer would be to just move that check into the
> table_fetch_row_version implementation (currently just calling
> heap_fetch()) - but that doesn't seem OK from a performance POV, because
> we'd then determine the relation size once for each tid, rather than
> once per tidscan.  And it'd also check in cases where we know the tid is
> supposed to be valid (e.g. fetching trigger tuples and such).
>
> The proper fix seems to be to introduce a new scan variant
> (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> a scan as a parameter.  But it seems we'd have to introduce that as a
> separate tableam callback, because we'd not want to incur the overhead
> of creating an additional scan / RelationGetNumberOfBlocks() checks for
> triggers et al.
Attached is a prototype of a variation of this. I added a
table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
callback / wrapper. Currently it just takes a "plain" scan, but we could
add a separate table_beginscan variant too.

For heap that just means we can just use HeapScanDesc's rs_nblock to
filter out invalid tids, and we only need to call
RelationGetNumberOfBlocks() once, rather than every
table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
CURRENT OF) - which previously computed the relation size once per
tuple.

Needs a bit of polishing, but I think this is the right direction?
Unless somebody protests, I'm going to push something along those lines
quite soon.

Greetings,

Andres Freund

v2-0001-tableam-Don-t-assume-that-an-AM-uses-md.c-style-s.patch (7K) Download Attachment
v2-0002-tableam-Avoid-relying-on-relation-size-to-determi.patch (14K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Ashwin Agrawal

On Wed, May 15, 2019 at 11:54 AM Andres Freund <[hidden email]> wrote:
Hi,

On 2019-04-25 15:43:15 -0700, Andres Freund wrote:

> > 3) nodeTidscan, skipping over too large tids
> >    I think this should just be moved into the AMs, there's no need to
> >    have this in nodeTidscan.c
>
> I think here it's *not* actually correct at all to use the relation
> size. It's currently doing:
>
>       /*
>        * We silently discard any TIDs that are out of range at the time of scan
>        * start.  (Since we hold at least AccessShareLock on the table, it won't
>        * be possible for someone to truncate away the blocks we intend to
>        * visit.)
>        */
>       nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
>
> which is fine (except for a certain abstraction leakage) for an AM like
> heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> Heikki's approach where tid isn't tied to physical representation.
>
>
> The obvious answer would be to just move that check into the
> table_fetch_row_version implementation (currently just calling
> heap_fetch()) - but that doesn't seem OK from a performance POV, because
> we'd then determine the relation size once for each tid, rather than
> once per tidscan.  And it'd also check in cases where we know the tid is
> supposed to be valid (e.g. fetching trigger tuples and such).
>
> The proper fix seems to be to introduce a new scan variant
> (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> a scan as a parameter.  But it seems we'd have to introduce that as a
> separate tableam callback, because we'd not want to incur the overhead
> of creating an additional scan / RelationGetNumberOfBlocks() checks for
> triggers et al.

Attached is a prototype of a variation of this. I added a
table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
callback / wrapper. Currently it just takes a "plain" scan, but we could
add a separate table_beginscan variant too.

For heap that just means we can just use HeapScanDesc's rs_nblock to
filter out invalid tids, and we only need to call
RelationGetNumberOfBlocks() once, rather than every
table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
CURRENT OF) - which previously computed the relation size once per
tuple.

Needs a bit of polishing, but I think this is the right direction?

Highlevel this looks good to me. Will look into full details tomorrow. This alligns with the high level thought I made but implemented in much better way, to consult with the AM to perform the optimization or not. So, now using the new callback table_tuple_tid_valid() AM either can implement some way to perform the validation for TID to optimize the scan, or if has no way to check based on scan descriptor then can decide to always pass true and let table_fetch_row_version() handle the things.
Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Andres Freund
Hi,

On 2019-05-15 23:00:38 -0700, Ashwin Agrawal wrote:
> Highlevel this looks good to me. Will look into full details tomorrow.

Ping?

I'll push the first of the patches soon, and unless you'll comment on
the second soon, I'll also push ahead. There's a beta upcoming...

- Andres


Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Ashwin Agrawal

On Fri, May 17, 2019 at 12:54 PM Andres Freund <[hidden email]> wrote:
Hi,

On 2019-05-15 23:00:38 -0700, Ashwin Agrawal wrote:
> Highlevel this looks good to me. Will look into full details tomorrow.

Ping?

I'll push the first of the patches soon, and unless you'll comment on
the second soon, I'll also push ahead. There's a beta upcoming...

Sorry for the delay, didn't get to it yesterday. Looked into both the patches. They both look good to me, thank you.

Relation size API still doesn't address the analyze case as you mentioned but sure something we can improve on later.

Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Ashwin Agrawal
In reply to this post by Heikki Linnakangas
On Tue, Apr 9, 2019 at 6:17 AM Heikki Linnakangas <[hidden email]> wrote:
On 08/04/2019 20:37, Andres Freund wrote:
> On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
>> There's a little bug in index-only scan executor node, where it mixes up the
>> slots to hold a tuple from the index, and from the table. That doesn't cause
>> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy AM, which
>> uses a virtual slot, it caused warnings like this from index-only scans:
>
> Hm. That's another one that I think I had fixed previously :(, and then
> concluded that it's not actually necessary for some reason. Your fix
> looks correct to me.  Do you want to commit it? Otherwise I'll look at
> it after rebasing zheap, and checking it with that.

I found another slot type confusion bug, while playing with zedstore. In
an Index Scan, if you have an ORDER BY key that needs to be rechecked,
so that it uses the reorder queue, then it will sometimes use the
reorder queue slot, and sometimes the table AM's slot, for the scan
slot. If they're not of the same type, you get an assertion:

TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File:
"execExprInterp.c", Line: 1905)

Attached is a test for this, again using the toy table AM, extended to
be able to test this. And a fix.

It seems the two patches from email [1] fixing slot confusion in Index Scans are pending to be committed.

Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Ashwin Agrawal
In reply to this post by Andres Freund

On Wed, May 15, 2019 at 11:54 AM Andres Freund <[hidden email]> wrote:
Attached is a prototype of a variation of this. I added a
table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
callback / wrapper. Currently it just takes a "plain" scan, but we could
add a separate table_beginscan variant too.

For heap that just means we can just use HeapScanDesc's rs_nblock to
filter out invalid tids, and we only need to call
RelationGetNumberOfBlocks() once, rather than every
table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
CURRENT OF) - which previously computed the relation size once per
tuple.

Question on the patch, if not too late
Why call table_beginscan() in TidNext() and not in ExecInitTidScan() ? Seems cleaner to have it in ExecInitTidScan().

Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Andres Freund
Hi,

On 2019-05-17 16:56:04 -0700, Ashwin Agrawal wrote:
> Question on the patch, if not too late
> Why call table_beginscan() in TidNext() and not in ExecInitTidScan() ?
> Seems cleaner to have it in ExecInitTidScan().

Largely because it's symmetrical to where most other scans are started (
c.f. nodeSeqscan.c, nodeIndexscan.c). But also, there's no need to incur
the cost of a smgrnblocks() etc when the node might never actually be
reached during execution.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Andres Freund
In reply to this post by Ashwin Agrawal
Hi,

On 2019-05-17 14:49:19 -0700, Ashwin Agrawal wrote:

> On Fri, May 17, 2019 at 12:54 PM Andres Freund <[hidden email]> wrote:
>
> > Hi,
> >
> > On 2019-05-15 23:00:38 -0700, Ashwin Agrawal wrote:
> > > Highlevel this looks good to me. Will look into full details tomorrow.
> >
> > Ping?
> >
> > I'll push the first of the patches soon, and unless you'll comment on
> > the second soon, I'll also push ahead. There's a beta upcoming...
> >
>
> Sorry for the delay, didn't get to it yesterday. Looked into both the
> patches. They both look good to me, thank you.

Pushed both now.


> Relation size API still doesn't address the analyze case as you mentioned
> but sure something we can improve on later.

I'm much less concerned about that. You can just return a reasonable
block size from the size callback, and it'll work for block sampling
(and you can just skip pages in the analyze callback if needed, e.g. for
zheap's tpd pages). And we assume that a reasonable block size is
returned by the size callback anyway, for planning purposes (both in
relpages and for estimate_rel_size).  We'll probably want to improve
that some day, but it doesn't strike me as hugely urgent.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Heikki Linnakangas
In reply to this post by Ashwin Agrawal
On 18/05/2019 01:19, Ashwin Agrawal wrote:

> On Tue, Apr 9, 2019 at 6:17 AM Heikki Linnakangas <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 08/04/2019 20:37, Andres Freund wrote:
>      > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
>      >> There's a little bug in index-only scan executor node, where it
>     mixes up the
>      >> slots to hold a tuple from the index, and from the table. That
>     doesn't cause
>      >> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy
>     AM, which
>      >> uses a virtual slot, it caused warnings like this from
>     index-only scans:
>      >
>      > Hm. That's another one that I think I had fixed previously :(,
>     and then
>      > concluded that it's not actually necessary for some reason. Your fix
>      > looks correct to me.  Do you want to commit it? Otherwise I'll
>     look at
>      > it after rebasing zheap, and checking it with that.
>
>     I found another slot type confusion bug, while playing with
>     zedstore. In
>     an Index Scan, if you have an ORDER BY key that needs to be rechecked,
>     so that it uses the reorder queue, then it will sometimes use the
>     reorder queue slot, and sometimes the table AM's slot, for the scan
>     slot. If they're not of the same type, you get an assertion:
>
>     TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File:
>     "execExprInterp.c", Line: 1905)
>
>     Attached is a test for this, again using the toy table AM, extended to
>     be able to test this. And a fix.
>
>
> It seems the two patches from email [1] fixing slot confusion in Index
> Scans are pending to be committed.
>
> 1]
> https://www.postgresql.org/message-id/e71c4da4-3e82-cc4f-32cc-ede387fac8b0%40iki.fi

Pushed the first patch now. Andres already fixed the second issue in
commit b8b94ea129.

- Heikki


Reply | Threaded
Open this post in threaded view
|

Re: Pluggable Storage - Andres's take

Alvaro Herrera-9
On 2019-Jun-06, Heikki Linnakangas wrote:

> Pushed the first patch now. Andres already fixed the second issue in commit
> b8b94ea129.

Please don't omit the "Discussion:" tag in commit messages.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

default_table_access_method is not in sample config file

Heikki Linnakangas
In reply to this post by Andres Freund
On 11/04/2019 19:49, Andres Freund wrote:

> On 2019-04-11 14:52:40 +0300, Heikki Linnakangas wrote:
>> diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
>> index f7f726b5aec..bbcab9ce31a 100644
>> --- a/src/backend/utils/misc/guc.c
>> +++ b/src/backend/utils/misc/guc.c
>> @@ -3638,7 +3638,7 @@ static struct config_string ConfigureNamesString[] =
>>   {"default_table_access_method", PGC_USERSET, CLIENT_CONN_STATEMENT,
>>   gettext_noop("Sets the default table access method for new tables."),
>>   NULL,
>> - GUC_IS_NAME
>> + GUC_NOT_IN_SAMPLE | GUC_IS_NAME
>>   },
>>   &default_table_access_method,
>>   DEFAULT_TABLE_ACCESS_METHOD,
>
> Hm, I think we should rather add it to sample. That's an oversight, not
> intentional.

I just noticed that this is still an issue. default_table_access_method
is not in the sample config file, and it's not marked with
GUC_NOT_IN_SAMPLE. I'll add this to the open items list so we don't forget.

- Heikki


Reply | Threaded
Open this post in threaded view
|

Re: default_table_access_method is not in sample config file

Michael Paquier-2
On Fri, Aug 09, 2019 at 11:34:05AM +0300, Heikki Linnakangas wrote:
> On 11/04/2019 19:49, Andres Freund wrote:
>> Hm, I think we should rather add it to sample. That's an oversight, not
>> intentional.
>
> I just noticed that this is still an issue. default_table_access_method is
> not in the sample config file, and it's not marked with GUC_NOT_IN_SAMPLE.
> I'll add this to the open items list so we don't forget.

I think that we should give it the same visibility as default_tablespace,
so adding it to the sample file sounds good to me.
--
Michael

sample-tab-am.patch (600 bytes) Download Attachment
signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: default_table_access_method is not in sample config file

Andres Freund
On 2019-08-13 15:03:13 +0900, Michael Paquier wrote:
> On Fri, Aug 09, 2019 at 11:34:05AM +0300, Heikki Linnakangas wrote:
> > On 11/04/2019 19:49, Andres Freund wrote:
> >> Hm, I think we should rather add it to sample. That's an oversight, not
> >> intentional.
> >
> > I just noticed that this is still an issue. default_table_access_method is
> > not in the sample config file, and it's not marked with GUC_NOT_IN_SAMPLE.
> > I'll add this to the open items list so we don't forget.

Thanks!


> I think that we should give it the same visibility as default_tablespace,
> so adding it to the sample file sounds good to me.

> diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
> index 65a6da18b3..39fc787851 100644
> --- a/src/backend/utils/misc/postgresql.conf.sample
> +++ b/src/backend/utils/misc/postgresql.conf.sample
> @@ -622,6 +622,7 @@
>  #default_tablespace = '' # a tablespace name, '' uses the default
>  #temp_tablespaces = '' # a list of tablespace names, '' uses
>   # only default tablespace
> +#default_table_access_method = 'heap'

Pushed, thanks.


>  #check_function_bodies = on
>  #default_transaction_isolation = 'read committed'
>  #default_transaction_read_only = off

Hm.  I find the current ordering there a bit weird. Unrelated to your
proposed change.  The header of the group is

#------------------------------------------------------------------------------
# CLIENT CONNECTION DEFAULTS
#------------------------------------------------------------------------------

# - Statement Behavior -

but I don't quite see GUCs like default_tablespace, search_path (due to
determining a created table's schema), temp_tablespace,
default_table_access_method fit reasonably well under that heading. They
all can affect persistent state. That seems pretty different from a
number of other settings (client_min_messages,
default_transaction_isolation, lock_timeout, ...) which only have
transient effects.

Should we perhaps split that group? Not that I have a good proposal for
better names.

Greetings,

Andres Freund


Reply | Threaded
Open this post in threaded view
|

Re: default_table_access_method is not in sample config file

Michael Paquier-2
On Fri, Aug 16, 2019 at 03:29:30PM -0700, Andres Freund wrote:
> but I don't quite see GUCs like default_tablespace, search_path (due to
> determining a created table's schema), temp_tablespace,
> default_table_access_method fit reasonably well under that heading. They
> all can affect persistent state. That seems pretty different from a
> number of other settings (client_min_messages,
> default_transaction_isolation, lock_timeout, ...) which only have
> transient effects.

Agreed.

> Should we perhaps split that group? Not that I have a good proposal for
> better names.

We could have a section for transaction-related parameters, and move
the vacuum ones into the section for autovacuum so as they get
grouped, renaming the section "autovacuum and vacuum".  An idea of
group for search_path, temp_tablespace, default_tablespace & co would
be "object parameters", or "relation parameters" for all the
parameters which interfere with object definitions?
--
Michael

signature.asc (849 bytes) Download Attachment
1 ... 6789