WIP: a way forward on bootstrap data

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

WIP: a way forward on bootstrap data

John Naylor
Hi,

I was looking through the archives one day for a few topics that
interest me, and saw there was continued interest in making bootstrap
data less painful [1] [2]. There were quite a few good ideas thrown
around in those threads, but not much in the way of concrete results.
I took a few of them as a starting point and threw together the
attached WIP patchset.

==
An overview (warning: long):

1 through 3 are small tweaks worth doing even without a data format
change. 4 through 7 are separated for readability and flexibility, but
should be understood as one big patch. I tried to credit as many
people's ideas as possible. Some things are left undone until basic
agreement is reached.

--
Patch 1 - Minor corrections

--
Patch 2 - Minor cleanup

Be more consistent style-wise, change a confusing variable name, fix
perltidy junk.

--
Patch 3 - Remove hard-coded schema information about pg_attribute from
genbki.pl.

This means slightly less code maintenance, but more importantly it's a
proving ground for mechanisms used in later patches.

1. Label a column's default value in the catalog header [3].

Note: I implemented it in the simplest way possible for now, which
happens to prevents a column from having both FORCE_(NOT_)NULL and a
default at the same time, but I think in practice that would almost
never matter. If more column options are added in the future, this
will have to be rewritten.

2. Add a new function to Catalog.pm to fill in a tuple with default
values. It will complain loudly if it can't find either a default or a
given value, so change the signature of emit_pgattr_row() so we can
pass a partially built tuple to it.

3. Format the schema macro entries according to their types.

4. Commit 8137f2c32322c624e0431fac1621e8e9315202f9 labeled which
columns are variable length. Expose that label so we can exclude those
columns from schema macros in a general fashion.

--
Patch 4 - Infrastructure for the data conversion

1. Copy a modified version of Catalogs() from Catalog.pm to a new
script that turns DATA()/(SH)DESCR() statements into serialized Perl
data structures in pg_*.dat files, preserving comments along the way.
Toast and index statements are unaffected. Although it's a one-off as
far as the core project is concerned, I imagine third-parties might
want access to this tool, so it's in the patch and not separate.

2. Remove data parsing from the original Catalogs() function and
rename it to reflect its new, limited role of extracting the schema
info from a header. The data files are handled by a new function.

3. Introduce a script to rewrite pg_*.dat files in a standard format
[4], strip default values, preserve comments and normalize blank
lines. It can also change default values on the fly. It is intended to
be run when adding new data.

4. Add default values to a few catalog headers for demonstration
purposes. There might be some questionable choices here. Review is
appreciated.

Note: At this point, the build is broken.

TODO: See what pgindent does to the the modified headers.

--
Patch 5 - Mechanical data conversion

This is the result of:

cd src/include/catalog
perl convert_header2dat.pl  list-of-catalog-headers-from-the-Makefile
perl -I ../../backend/catalog  rewrite_dat.pl  *.dat
rm *.bak

Note: The data has been stripped of all double-quotes for readability,
since the Perl hash literals have single quotes everywhere. Patches 6
and 7 restore them where needed.

--
Patch 6 - Hand edits

Re-doublequote values that are macros expanded by initdb.c, remove
stray comments, fix up whitespace, and do a minimum of comment editing
to reflect the new data format.

At this point the files are ready to take a look at. Format is the
central issue, of course. I tried to structure things so it wouldn't
be a huge lift to bikeshed on the format. See pg_authid.dat for a
conveniently small example. Each entry is 1 or 2 lines long, depending
on whether oid or (sh)descr is present.

Regarding pg_proc.dat, I think readability is improved, and to some
extent line length, although it's not great:

pg_proc.h:   avg=175, stdev=25
pg_proc.dat: avg=159, stdev=43

(grep -E '^DATA' pg_proc.h | awk '{print length}'
grep prosrc pg_proc.dat | awk '{print length}')

Many lines now fit in an editor window, but the longest line has
ballooned from 576 chars to 692. I made proparallel default to 'u' for
safety, but the vast majority are 's'. If we risked making 's' the
default, most entries would shrink by 20 chars. On the other hand,
human-readable types would inflate that again, but maybe patch 8 below
can help with editing. There was some discussion about abbreviated
labels that were mapped to column names - I haven't thought about that
yet.

--
Patch 7 - Distprep scripts

1. Teach genbki.pl and Gen_fmgrtab.pl to read the data files, and
arrange for the former to double-quote certain values so bootscanner.l
can read them.

2. Introduce Makefile dependencies on the data files.

The build works now.

Note: Since the original DATA() entries contained some (as far as I
can tell) useless quotes, the postgres.bki diff won't be zero, but it
will be close.

Performance note: On my laptop, running Make in the catalog dir with
no compilation goes from ~700ms on master to ~1000ms with the new data
files.

--
Patch 8 - SQL output

1. Write out postgres.sql, which allows you to insert all the source
BKI data into a development catalog schema for viewing and possibly
bulk-editing [5]. It retains oids, and (sh)descr fields in their own
columns, and implements default values.

2. Make it a distclean target.

TODO: Find a way to dump dev catalog tuples into the canonical format.

--
Not implemented yet:

-Gut the header files of DATA() statements. I'm thinking we should
keep the #defines in the headers, but see below:
-Update README and documentation

--
Future work:
-More lookups for human-readable types, operators, etc.
-Automate pg_type #defines a la pg_proc [6], which could also be used
to maintain ecpg's copy of pg_type #defines.

--
[1] https://www.postgresql.org/message-id/flat/20150220234142.GH12653%40awork2.anarazel.de

[2] https://www.postgresql.org/message-id/flat/CAGoxFiFeW64k4t95Ez2udXZmKA%2BtazUFAaSTtYQLM4Jhzw%2B-pg%40mail.gmail.com

[3] https://www.postgresql.org/message-id/20161113171017.7sgaqdeq6jslmsr3%40alap3.anarazel.de

[4] https://www.postgresql.org/message-id/D8F1D509-6498-43AC-BEFC-052DFE16847A%402ndquadrant.com

[5] https://www.postgresql.org/message-id/20150304150712.GV29780%40tamriel.snowman.net

[6] https://www.postgresql.org/message-id/15697.1479161432%40sss.pgh.pa.us

==

I'll add this to the next commitfest soon.

-John Naylor

0001_corrections_v1.patch (7K) Download Attachment
0002_cleanup_v1.patch (13K) Download Attachment
0003_pgattr_schema_isolation_v1.patch (24K) Download Attachment
0004_conversion_scripts_and_headers_v1.patch (73K) Download Attachment
0005_data_files_mechanical_v1.patch.tar.gz (145K) Download Attachment
0006_data_files_hand_edits_v1.patch (151K) Download Attachment
0007_update_distprep_scripts_v1.patch (19K) Download Attachment
0008_dev_sql_v1.patch (14K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

John Naylor
There doesn't seem to be any interest in bootstrap data at the moment,
but rather than give up just yet, I've added a couple features to make
a data migration more compelling:

1. Human-readable types, operators, opfamilies, and access methods
2. Column abbreviations

For an example of both practices, an entry from pg_amop changes from

DATA(insert (   1976   21 21 1 s    95  403 0 ));

to

{ opf => 'btree/integer_ops', lt => 'int2', rt => 'int2', str => '1',
oper => '<(int2,int2)', am => 'btree' },


3. Reduce redundancy in pg_proc data by
-computing pronargs from proargtypes and
-leaving out prosrc if it's just a copy of proname.

This, plus a few column abbreviations drastically shrinks pg_proc.dat
line length, even with human-readable types:

pg_proc.h:   avg=175, stdev=25
pg_proc.dat: avg=92,  stdev=43

An example before:

DATA(insert OID = 300 (  float48ne         PGNSP PGUID 12 1 0 0 0 f f
f t t f i s 2 0 16 "700 701" _null_ _null_ _null_ _null_ _null_
float48ne _null_ _null_ _null_ ));

and after:

{ oid => '300',
  n => 'float48ne', lp => 't', p => 's', rt => 'bool', at => 'float4 float8' },

--
I've changed the numbering so that patches with the same number should
be taken as unit, separated only for readability. When referring to
the previous email overview, they map like this:

1-3 : unchanged
4-7 : 4A-4D
8   : N/A - I've left out the SQL generation for now, but I can add it later.

New in this patch set:
Patch 5 rips out the DATA() and DESCR() lines from the headers and
updates the comments to reflect that.
Patches 6A and 6B implement human-readable types etc. as described above.


-John Naylor

0001_corrections_v1.patch (7K) Download Attachment
0002_cleanup_v1.patch (13K) Download Attachment
0003_pgattr_schema_isolation_v1.patch (24K) Download Attachment
0004A_conversion_scripts_and_headers_v2.patch (83K) Download Attachment
0004B_data_files_mechanical_v2.patch.tar.gz (128K) Download Attachment
0004C_data_files_hand_edits_v2.patch (111K) Download Attachment
0004D_update_distprep_scripts_v2.patch (20K) Download Attachment
0005_gut_headers_v2.patch.tar.gz (147K) Download Attachment
0006A_human_readable_entries_data_v2.patch.tar.gz (214K) Download Attachment
0006B_human_readable_entries_code_v2.patch (18K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

Peter Eisentraut-6
On 12/13/17 04:06, John Naylor wrote:
> There doesn't seem to be any interest in bootstrap data at the moment,
> but rather than give up just yet, I've added a couple features to make
> a data migration more compelling:

I took a brief look at your patches, and there appear to be a number of
good cleanups in there at least.  But could you please send patches in
git format-patch format with commit messages, so we don't have to guess
what each patch does?

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

John Naylor
On 12/13/17, Peter Eisentraut <[hidden email]> wrote:
> On 12/13/17 04:06, John Naylor wrote:
>> There doesn't seem to be any interest in bootstrap data at the moment,
>> but rather than give up just yet, I've added a couple features to make
>> a data migration more compelling:
>
> I took a brief look at your patches, and there appear to be a number of
> good cleanups in there at least.  But could you please send patches in
> git format-patch format with commit messages, so we don't have to guess
> what each patch does?

Thanks for taking a look and for pointing me to git format-patch.
That's much nicer than trying to keep emails straight. I've attached a
new patchset.

Note that 4-7 and 9-10 are units as far as the build is concerned.
Meaning, once 4 is applied, the build is broken until 7 is applied.
Also, postgres.bki won't diff 100% clean with the master branch
because of some useless quotes in the latter.

One thing that occured to me while looking over patch 0004 again: It's
now a bit uglier to handle indexing.h and toasting.h. I think it might
be cleaner to keep those statements in the header of the catalog they
refer to. That has the additional benefit of making the headers the
Single Point of Truth for a catalog schema.

TODO:
-Docs and README
-Finish SQL generation patch
-Consider generating pg_type #defines

-John Naylor

0001-Fix-typos-and-oversights-in-catalog-headers_v3.patch (5K) Download Attachment
0002-Cleanup-distprep-scripts_v3.patch (9K) Download Attachment
0003-Remove-hard-coded-schema-knowledge-about-pg_attri_v3.patch (19K) Download Attachment
0004-Data-conversion-infrastructure_v3.patch (73K) Download Attachment
0005-Mechanical-data-conversion_v3.patch.tar.gz (128K) Download Attachment
0006-Hand-edits-of-data-files_v3.patch (95K) Download Attachment
0007-Update-distprep-scripts_v3.patch (16K) Download Attachment
0008-Remove-data-from-catalog-headers_v3.patch.tar.gz (146K) Download Attachment
0009-Replace-oids-with-human-readable-names_v3.patch.tar.gz (167K) Download Attachment
0010-Type-aliases-for-oid-lookups_v3.patch (17K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

David Fetter
On Thu, Dec 14, 2017 at 05:59:12PM +0700, John Naylor wrote:

> On 12/13/17, Peter Eisentraut <[hidden email]> wrote:
> > On 12/13/17 04:06, John Naylor wrote:
> >> There doesn't seem to be any interest in bootstrap data at the moment,
> >> but rather than give up just yet, I've added a couple features to make
> >> a data migration more compelling:
> >
> > I took a brief look at your patches, and there appear to be a number of
> > good cleanups in there at least.  But could you please send patches in
> > git format-patch format with commit messages, so we don't have to guess
> > what each patch does?
>
> Thanks for taking a look and for pointing me to git format-patch.
> That's much nicer than trying to keep emails straight. I've attached a
> new patchset.

Thanks for your hard work on this.  It'll really make developing this
part of the code a lot more pleasant.

Unfortunately, it no longer applies to master.  Can we get a rebase?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

John Naylor
On 12/20/17, David Fetter <[hidden email]> wrote:
> Thanks for your hard work on this.  It'll really make developing this
> part of the code a lot more pleasant.

I hope so, thanks.

> Unfortunately, it no longer applies to master.  Can we get a rebase?

Hmm, it still applied for me, except when I forgot to gunzip the
larger patches. In any case, I've attached version 4 which contains
some recent improvements. It was rebased against master as of
6719b238e8f0ea. If it doesn't apply for you please let me know the
details.
--

New in this patch set:
-Remove code duplication and improve modularity in data parsing
-Update README at appropriate points
-Shift some hunks around to make patches more focused and readable
-Comment and commit message editing
-Patch 11 reduces indentation
-Patch 12 moves all the toast/index commands into the individual
catalog headers and simplifies some #includes (Note: I failed to shave
the yak of selinux, so the #include changes for contrib/sepgsql are
untested)

At this point, I think it's no longer a WIP, and will only make
further changes based on review or if I find a mistake.

Left for future projects:
-SQL generation for querying source bootstrap data
-Generating pg_type #defines

-John Naylor

v4-0001-Fix-typos-and-oversights-in-catalog-headers.patch (7K) Download Attachment
v4-0002-Cleanup-distprep-scripts.patch (11K) Download Attachment
v4-0003-Remove-hard-coded-schema-knowledge-about-pg_attri.patch (20K) Download Attachment
v4-0004-Data-conversion-infrastructure.patch (70K) Download Attachment
v4-0005-Mechanical-data-conversion.patch.tar.gz (128K) Download Attachment
v4-0006-Hand-edits-of-data-files.patch (95K) Download Attachment
v4-0007-Update-catalog-scripts-to-read-data-files.patch (29K) Download Attachment
v4-0008-Remove-data-from-catalog-headers.patch.tar.gz (147K) Download Attachment
v4-0009-Replace-oids-with-human-readable-names.patch.tar.gz (167K) Download Attachment
v4-0010-Type-aliases-for-oid-lookups.patch (18K) Download Attachment
v4-0011-Reduce-indentation-level.patch (25K) Download Attachment
v4-0012-Move-toast-index-macros-to-the-pg_-catalog-header.patch (112K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

Álvaro Herrera
Pushed 0001 and 0002.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

Álvaro Herrera
In reply to this post by John Naylor
Hmm, patch 0008 removes data lines from the .h but leaves the dependent
OID define lines around:

#define BTREE_AM_OID 403

This is not good, because then the define depends on data that appears
in a distant file.  Another consideration is that the current system has
the property that these OIDs are discoverable by a hacker by navigating
to the containing .h file; and a missing symbol is easily fixable if
they need to hardcode the OID for which there isn't a symbol yet.

Maybe a generated .h file containing defines for OIDs from all catalogs
is okay.  Of course, not all symbols are to be listed -- we should have
a special marker in the data lines for those that are.  Maybe something
like this

{ oid => '403', descr => 'b-tree index access method',
  amname => 'btree', amhandler => 'bthandler', amtype => 'i',
  cpp_symbol => 'BTREE_AM_OID' },

(where 'cpp_symbol' would be skipped by genbki explicitly).

Any better ideas?

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

Robert Haas
On Thu, Dec 21, 2017 at 5:32 PM, Alvaro Herrera <[hidden email]> wrote:
> Hmm, patch 0008 removes data lines from the .h but leaves the dependent
> OID define lines around:

Just a question here -- do we actually have consensus on doing the
stuff that these patches do?  Because I'm not sure we do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

John Naylor
In reply to this post by Álvaro Herrera
On 12/22/17, Alvaro Herrera <[hidden email]> wrote:
> Hmm, patch 0008 removes data lines from the .h but leaves the dependent
> OID define lines around:
>
> #define BTREE_AM_OID 403
>
> This is not good, because then the define depends on data that appears
> in a distant file.

I see what you mean.

> Another consideration is that the current system has
> the property that these OIDs are discoverable by a hacker by navigating
> to the containing .h file; and a missing symbol is easily fixable if
> they need to hardcode the OID for which there isn't a symbol yet.

I'm not sure I follow you here.

> Maybe a generated .h file containing defines for OIDs from all catalogs
> is okay.  Of course, not all symbols are to be listed -- we should have
> a special marker in the data lines for those that are.  Maybe something
> like this
>
> { oid => '403', descr => 'b-tree index access method',
>   amname => 'btree', amhandler => 'bthandler', amtype => 'i',
>   cpp_symbol => 'BTREE_AM_OID' },
>
> (where 'cpp_symbol' would be skipped by genbki explicitly).

The last part makes sense and would be a fairly mechanical change. I'm
not sure of the best way to include those generated symbols back in
the code again, though. I think a single file might not be desirable
because of namespace pollution. The alternative would be to have, say,
pg_am.h include pg_am_sym.h. More complex but doable. Also, no need to
skip non-data values explicitly. The code knows where to find the
schema. :-)

Thanks for pushing 1 and 2, BTW.

-John Naylor

Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

Álvaro Herrera
In reply to this post by Robert Haas
Robert Haas wrote:
> On Thu, Dec 21, 2017 at 5:32 PM, Alvaro Herrera <[hidden email]> wrote:
> > Hmm, patch 0008 removes data lines from the .h but leaves the dependent
> > OID define lines around:
>
> Just a question here -- do we actually have consensus on doing the
> stuff that these patches do?  Because I'm not sure we do.

My reading of the old threads (linked provided by John in his initial
email in this thread) is that we have a consensus that we want to get
rid of the old data representation, because it causes endless amount of
pain.  The proposed format seems to satisfy the constraints that we all
discussed, namely

1. be easier to modify than the current format,
2. in particular, allow for default values on certain columns
3. not cause git merge problems because of too similar lines in every
record
4. not require onerous Perl modules

The one thing we seem to lack is a tool to edit the data files, as you
suggested[1].  Stephen Frost mentioned[2] that we could do this by
allowing the .data files be loaded in a database table, have the changes
made via SQL, then have a way to create an updated .data file.  Tom
said[3] he didn't like that particular choice.

So we already have Catalog.pm that (after these patches) knows how to
load .data files; we could use that as a basis to enable easy oneliners
to do whatever editing is needed.

Also, the CPP symbols remaining in the pg_*.h that I commented yesterday
was already mentioned[4] before.

[1] https://www.postgresql.org/message-id/CA%2BTgmoa4%3D5oz7wSMcLNLh8h6cXzPoxxNJKckkdSQA%2BzpUG0%2B0A%40mail.gmail.com
[2] https://www.postgresql.org/message-id/20150304150712.GV29780%40tamriel.snowman.net
[3] https://www.postgresql.org/message-id/24766.1478821303%40sss.pgh.pa.us
[4] https://www.postgresql.org/message-id/15697.1479161432@...

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

John Naylor
In reply to this post by John Naylor
I  wrote:

> On 12/22/17, Alvaro Herrera <[hidden email]> wrote:
>> Maybe a generated .h file containing defines for OIDs from all catalogs
>> is okay.  Of course, not all symbols are to be listed -- we should have
>> a special marker in the data lines for those that are.  Maybe something
>> like this
>>
>> { oid => '403', descr => 'b-tree index access method',
>>   amname => 'btree', amhandler => 'bthandler', amtype => 'i',
>>   cpp_symbol => 'BTREE_AM_OID' },
>>
>> (where 'cpp_symbol' would be skipped by genbki explicitly).
>
> The last part makes sense and would be a fairly mechanical change. I'm
> not sure of the best way to include those generated symbols back in
> the code again, though. I think a single file might not be desirable
> because of namespace pollution.
[snip]

I've attached version 5 (rebased onto be2343221fb), which removes OID
#define symbols from the headers and stores them with the records they
refer to.

I went ahead with your suggestion to try a single generated header. I
believe there is no chance of namespace pollution since the symbols
already have a nomenclature that reflects what catalog they belong to.
This required some additional Makefile changes, since some code
outside the backend needs certain OID symbols to be visible. There are
probably bugs, but I wanted to share the initial design. The MSVC
changes are untested. In addition, FindDefinedSymbol() now doesn't
work for catalog headers, so I added a new function to search within
the data.

On the plus side, there is now a mechanism to generate pg_type OID
symbols, and ECPG's knowledge of types is now maintained
automatically.

Since I had to rebase over recent additions to SP-GiST opclasses
anyway, I restructured the patches to have a clean separation between
data migration and data compaction. This makes the patches easier to
follow.

The pg_proc defaults have been tweaked slightly to match those
suggested by Andrew Dunstan [1].

There are now human-readable entries for oprcom and oprnegate in
pg_operator.dat.

Finally, assorted cosmetic improvements and README/comment editing.

[1] https://www.postgresql.org/message-id/b76d153a-33d7-7827-746c-1109f7bf529d%40dunslane.net
--

-John Naylor

v5-0001-Remove-hard-coded-schema-knowledge-about-pg_attri.patch (20K) Download Attachment
v5-0002-Data-conversion-infrastructure.patch (33K) Download Attachment
v5-0003-Mechanical-data-conversion.patch.tar.gz (168K) Download Attachment
v5-0004-Hand-edits-of-data-files.patch (187K) Download Attachment
v5-0005-Update-catalog-scripts-to-read-data-files.patch (31K) Download Attachment
v5-0006-Remove-data-from-catalog-headers.patch.tar.gz (147K) Download Attachment
v5-0007-Remove-OID-define-symbols-from-catalog-headers.patch (47K) Download Attachment
v5-0008-Implement-data-compaction-strategies.patch (49K) Download Attachment
v5-0009-Data-file-compaction.patch.tar.gz (221K) Download Attachment
v5-0010-Replace-OIDs-with-human-readable-names.patch.tar.gz (175K) Download Attachment
v5-0011-Type-aliases-for-OID-lookups.patch (18K) Download Attachment
v5-0012-Reduce-indentation-level.patch (27K) Download Attachment
v5-0013-Move-toast-index-macros-to-the-pg_-catalog-header.patch (112K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

Álvaro Herrera
Pushed 0001 with some changes of my own.  I'm afraid I created a few
conflicts for the later patches in your series; please rebase.

I don't think we introduced anything that would fail on old Perls, but
let's see what buildfarm has to say.

Others: Now is the time to raise concerns related to the proposed file
formats and tooling, so please do have a look when you have a moment.
At this stage, the proposed data format seems a good choice to me.

Thanks

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

Tom Lane-2
Alvaro Herrera <[hidden email]> writes:
> Others: Now is the time to raise concerns related to the proposed file
> formats and tooling, so please do have a look when you have a moment.
> At this stage, the proposed data format seems a good choice to me.

It's not very clear to me what the proposed data format actually is,
and I don't really want to read several hundred KB worth of patches
in order to reverse-engineer that information.  Nor do I see
anything in the patch list that obviously looks like it updates
doc/src/sgml/bki.sgml to explain things.

So could we have an explanation of what it is we're agreeing to?

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

David Fetter
On Fri, Jan 12, 2018 at 11:38:54AM -0500, Tom Lane wrote:

> Alvaro Herrera <[hidden email]> writes:
> > Others: Now is the time to raise concerns related to the proposed
> > file formats and tooling, so please do have a look when you have a
> > moment.  At this stage, the proposed data format seems a good
> > choice to me.
>
> It's not very clear to me what the proposed data format actually is,
> and I don't really want to read several hundred KB worth of patches
> in order to reverse-engineer that information.  Nor do I see
> anything in the patch list that obviously looks like it updates
> doc/src/sgml/bki.sgml to explain things.
>
> So could we have an explanation of what it is we're agreeing to?

That would be awesome.  A walk-through example or two would also help.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

Álvaro Herrera
In reply to this post by Tom Lane-2
Tom Lane wrote:

> Alvaro Herrera <[hidden email]> writes:
> > Others: Now is the time to raise concerns related to the proposed file
> > formats and tooling, so please do have a look when you have a moment.
> > At this stage, the proposed data format seems a good choice to me.
>
> It's not very clear to me what the proposed data format actually is,
> and I don't really want to read several hundred KB worth of patches
> in order to reverse-engineer that information.  Nor do I see
> anything in the patch list that obviously looks like it updates
> doc/src/sgml/bki.sgml to explain things.
>
> So could we have an explanation of what it is we're agreeing to?

Here's a small sample pg_proc entry:

 { oid => '2147', descr => 'number of input rows for which the input expression is not null',
  n => 'count', proisagg => 't', v => 'i', p => 's', rt => 'int8', at => 'any', s => 'aggregate_dummy' },

An pg_amop entry:
{ opf => 'btree/integer_ops', lt => 'int2', rt => 'int2', str => '1', oper => '<(int2,int2)', am => 'btree' },

Notes:
1. this is Perl data; it is read with 'eval' without any external modules.
2. the pg_proc entry has been compressed to two lines, to avoid
   content-free lines that would easily confuse git merge, but keep line
   length reasonable.
3. references to objects in other catalogs are by name, such as "int8"
   or "btree/integer_ops" rather than OID.
4. for each attribute, an abbreviation can be declared.  In the
   pg_proc sample we have "n" which stands for proname, because we have
   this line:
   +   NameData    proname BKI_ABBREV(n);

I think John has gone overboard with some of these choices, but we can
argue the specific choices once we decide that abbreviation is a good
idea.  (Prior discussion seems to suggest we already agreed on that.)

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

Tom Lane-2
Alvaro Herrera <[hidden email]> writes:
> Tom Lane wrote:
>> So could we have an explanation of what it is we're agreeing to?

> Here's a small sample pg_proc entry:

>  { oid => '2147', descr => 'number of input rows for which the input expression is not null',
>   n => 'count', proisagg => 't', v => 'i', p => 's', rt => 'int8', at => 'any', s => 'aggregate_dummy' },

> An pg_amop entry:
> { opf => 'btree/integer_ops', lt => 'int2', rt => 'int2', str => '1', oper => '<(int2,int2)', am => 'btree' },

> Notes:
> 1. this is Perl data; it is read with 'eval' without any external modules.

Check.  Where is it coming from --- I suppose we aren't going to try to
store this in the existing .h files?  What provisions will there be for
comments?

> 2. the pg_proc entry has been compressed to two lines, to avoid
>    content-free lines that would easily confuse git merge, but keep line
>    length reasonable.

Seems like we would almost need a per-catalog convention on how to lay out
the entries, or else we're going to end up (over time) with lots of cowboy
coding leading to entries that look randomly different from the ones
around them.

> 3. references to objects in other catalogs are by name, such as "int8"
>    or "btree/integer_ops" rather than OID.

+1

> 4. for each attribute, an abbreviation can be declared.  In the
>    pg_proc sample we have "n" which stands for proname, because we have
>    this line:
>    +   NameData    proname BKI_ABBREV(n);

I think single-letter abbreviations here are a pretty bad space vs
readability tradeoff, particularly for wider catalogs where it risks
ambiguity.  The pg_amop sample you show looks noticeably more legible than
the other one.  Still, this is something we can debate on a case-by-case
basis, it's not a defect in the mechanism.

One other question is how we'll verify the conversion.  Is there an
expectation that the .bki file immediately after the conversion will
be identical to immediately before?

                        regards, tom lane

Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

John Naylor
In reply to this post by Tom Lane-2
Tom, everyone,
It's getting late in my timezone, but I wanted to give a few quick
answers. I'll follow up tomorrow. Thanks Alvaro for committing my
refactoring of pg_attribute data creation. I think your modifications
are sensible and I'll rebase soon.

On 1/13/18, Tom Lane <[hidden email]> wrote:
> It's not very clear to me what the proposed data format actually is,
> and I don't really want to read several hundred KB worth of patches
> in order to reverse-engineer that information.  Nor do I see
> anything in the patch list that obviously looks like it updates
> doc/src/sgml/bki.sgml to explain things.

Alvaro gave a good overview, so I'll just point out a few things.

-Patches 0002 through 0007 represent a complete one-to-one migration
of data entries. I didn't see much in bki.sgml specific to the current
format, so my documentation changes are confined largely to the
README, in patch 0005.
-Patches 0008 and 0009 implement techniques to make the data lines
shorter. My choices are certainly debatable. There is a brief addition
to the README in patch 0008. The abbreviation technique was only used
in three catalogs to demonstrate.
-Patches 0010 and 0011 implement human-readable OID references.
-Patches 0012 and 0013 are cosmetic, but invasive.

> Seems like we would almost need a per-catalog convention on how to lay out
> the entries, or else we're going to end up (over time) with lots of cowboy
> coding leading to entries that look randomly different from the ones
> around them.

If I understand your concern correctly, the convention is enforced by
a script (rewrite_dat.pl). At the very least this would be done at the
same time as pg_indent and perltidy. To be sure, because of default
values many entries will look randomly different from the ones around
them regardless. I have a draft patch to load the source data into
tables for viewing, but it's difficult to rebase, so I thought I'd
offer that enhancement later.

> One other question is how we'll verify the conversion.  Is there an
> expectation that the .bki file immediately after the conversion will
> be identical to immediately before?

Not identical. First, as part of the base migration, I stripped almost
all double quotes from the data entries since the new Perl hash values
are already single-quoted. (The exception is macros expanded by
initdb.c) I made genbki.pl add quotes on output to match what
bootscanner.l expects. Where a simple rule made it possible, it also
matches the original .bki. The new .bki will only diff where the
current data has superfluous quotes. (ie. "0", "sql"). Second, if the
optional cosmetic patch 0013 is applied, the individual index and
toast commands will be in a different order.

> Check.  Where is it coming from --- I suppose we aren't going to try to
> store this in the existing .h files?  What provisions will there be for
> comments?

Yes, they're in ".dat" files. Perl comments (#) on their own line are
supported. I migrated all existing comments from the header files as
part of the conversion. This is scripted, so I can rebase over new
catalog entries that get committed.

> I think single-letter abbreviations here are a pretty bad space vs
> readability tradeoff, particularly for wider catalogs where it risks
> ambiguity.

Ironically, I got that one from you [1] ;-), but if you have a
different opinion upon seeing concrete, explicit examples, I think
that's to be expected.

--
Now is probably a good time to disclose concerns of my own:
1. MSVC dependency tracking is certainly broken until such time as I
can shave that yak and test.
2. Keeping the oid symbols with the data entries required some
Makefile trickery to make them visible to .c files outside the backend
(patch 0007). It builds fine, but the dependency tracking might have
bugs.

--
[1] https://www.postgresql.org/message-id/15697.1479161432%40sss.pgh.pa.us

Thanks,
John Naylor

Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

Peter Eisentraut-6
In reply to this post by Álvaro Herrera
On 1/12/18 12:24, Alvaro Herrera wrote:

> Here's a small sample pg_proc entry:
>
>  { oid => '2147', descr => 'number of input rows for which the input expression is not null',
>   n => 'count', proisagg => 't', v => 'i', p => 's', rt => 'int8', at => 'any', s => 'aggregate_dummy' },
>
> An pg_amop entry:
> { opf => 'btree/integer_ops', lt => 'int2', rt => 'int2', str => '1', oper => '<(int2,int2)', am => 'btree' },
>
> Notes:
> 1. this is Perl data; it is read with 'eval' without any external modules.
> 2. the pg_proc entry has been compressed to two lines, to avoid
>    content-free lines that would easily confuse git merge, but keep line
>    length reasonable.

I don't think I like this.  I know pg_proc.h is a pain to manage, but at
least right now it's approachable programmatically.  I recently proposed
to patch to replace the columns proisagg and proiswindow with a combined
column prokind.  I could easily write a small Perl script to make that
change in pg_proc.h, because the format is easy to parse and has one
line per entry.  With this new format, that approach would no longer
work, and I don't know what would replace it.

> 3. references to objects in other catalogs are by name, such as "int8"
>    or "btree/integer_ops" rather than OID.

I think we could already do this by making more use of things like
regtype and regproc.  That should be an easy change to make.

> 4. for each attribute, an abbreviation can be declared.  In the
>    pg_proc sample we have "n" which stands for proname, because we have
>    this line:
>    +   NameData    proname BKI_ABBREV(n);

I'm afraid a key value system would invite writing the attributes in
random order and create a mess over time.

But if we want to do it, I think we could also add it to the current BKI
format.  The same goes for defining default values for some columns.

--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply | Threaded
Open this post in threaded view
|

Re: WIP: a way forward on bootstrap data

David Fetter
On Fri, Jan 12, 2018 at 04:22:26PM -0500, Peter Eisentraut wrote:

> On 1/12/18 12:24, Alvaro Herrera wrote:
> > Here's a small sample pg_proc entry:
> >
> >  { oid => '2147', descr => 'number of input rows for which the input expression is not null',
> >   n => 'count', proisagg => 't', v => 'i', p => 's', rt => 'int8', at => 'any', s => 'aggregate_dummy' },
> >
> > An pg_amop entry:
> > { opf => 'btree/integer_ops', lt => 'int2', rt => 'int2', str => '1', oper => '<(int2,int2)', am => 'btree' },
> >
> > Notes:
> > 1. this is Perl data; it is read with 'eval' without any external modules.
> > 2. the pg_proc entry has been compressed to two lines, to avoid
> >    content-free lines that would easily confuse git merge, but keep line
> >    length reasonable.
>
> I don't think I like this.  I know pg_proc.h is a pain to manage,
> but at least right now it's approachable programmatically.  I
> recently proposed to patch to replace the columns proisagg and
> proiswindow with a combined column prokind.  I could easily write a
> small Perl script to make that change in pg_proc.h, because the
> format is easy to parse and has one line per entry.  With this new
> format, that approach would no longer work, and I don't know what
> would replace it.

How about ingesting with Perl, manipulating there, and spitting back
out as Perl data structures?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

12
Previous Thread Next Thread