Ltree syntax improvement

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Ltree syntax improvement

Dmitry Belyavsky
Dear all,

Please find attached the patch extending the sets of symbols allowed in ltree labels. The patch introduces 2 variants of escaping symbols, via backslashing separate symbols and via quoting the labels in whole for ltree, lquery and ltxtquery datatypes.

--
SY, Dmitry Belyavsky

ltree_20190129.diff (105K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Ltree syntax improvement

Nikolay Shaplov
В письме от вторник, 29 января 2019 г. 20:43:07 MSK пользователь Dmitry
Belyavsky написал:

> Please find attached the patch extending the sets of symbols allowed in
> ltree labels. The patch introduces 2 variants of escaping symbols, via
> backslashing separate symbols and via quoting the labels in whole for
> ltree, lquery and ltxtquery datatypes.

Hi!

Let's me join the review process...

I've just give a superficial look to the patch, read it through, and try  here
and there, so first I'll give feedback about exterior features of the patch.

So, the first question: why do we need all this? The answer seems to be
obvious, but it would be good say it explicitly. What problems would it solve?
Do these problems really exists?

The second question is about coding style. As far as I can see you've passed
your code through pg_indent, but it does not solve all problems.

As far as I know, postgres commiters are quite strict about code width, the
code should not be wider that 80 col, except places were string constants are
introduced (There you can legally exceed 80 col). So I would advice to use vim
with  tabstop=4  and colorcolumn=80 and make sure that new code text does not
cross red vertical line.

Comments. There is no fixed coding style for comments in postgres. Usually this
means that it is better to follow coding in the code around. But ltree does
not have much comments, so I would suggest to use the best style I've seen in
the code
/*
 * [function name]
 * < tab ><tab> [Short description, what does it do]
 *
 * [long explanation how it works, several subparagraph if needed
 */
(Seen this in src/include/utils/rel.h)
or something like that.
At least it would be good to have explicit separation between descriptions and
explanation of how it works.

And about comment
/*
 * Function calculating bytes to escape
 * to_escape is an array of "special" 1-byte symbols
 * Behvaiour:
 * If there is no "special" symbols, return 0
 * If there are any special symbol, we need initial and final quote, so return
2
 * If there are any quotes, we need to escape all of them and also initial and
final quote, so
 * return 2 + number of quotes
 */

It has some formatting. But it should not.  As far as I understand this
comment should be treated as single paragraph by reformatter, and refomatted.
I do not understand why pg_indent have not done it. Documentation (https://
www.postgresql.org/docs/11/source-format.html) claims that if you want to
avoid reformatting, you should add "-------------" at the beginning and at the
end of the comment. So it would be good either remove this formatting, or add
"----" if you are sure you want to keep it.

Third part is about error messages.
First I've seen errors like elog(ERROR, "internal error during splitting
levels");. I've never seen error like that in postgres. May be if this error
is in the part of the code, that should never be reached in real live, may be
it would be good to change it to Assert? Or if this code can be normally
reached, add some more explanation of what had happened...

About other error messages.

There are messages like
errmsg("syntax error"),
errdetail("Unexpected end of line.")));

or

errmsg("escaping syntax error")));


In postgres there is error message style guide
https://www.postgresql.org/docs/11/error-style-guide.html

Here I would expect messages like that

Primary:    Error parsing ltree path
Detail:     Curly quote is opened and not closed

Or something like that, I am not expert in English, but this way it would be
better for sure. Key points are: Primary message should point to general area
where error occurred (there can be a large SQL with a lot of things in it, so
it is good to point that problem is with ltree staff). And is is better to
word from the point of view of a user. What for you (and me) is unexpected end
of line, for him it is unclosed curly quote.

Also I am not sure, if it is easy, but if it is possible, it would be good to
move "^" symbol that points to the place of the error, to the place inside ' '
where unclosed curly quote is located. As a user I would appreciate that.

So that's what I've seen while giving it superficial look...


Reply | Threaded
Open this post in threaded view
|

Re: Ltree syntax improvement

Nikolay Shaplov
In reply to this post by Dmitry Belyavsky
В письме от вторник, 29 января 2019 г. 20:43:07 MSK пользователь Dmitry
Belyavsky написал:
> Dear all,
>
> Please find attached the patch extending the sets of symbols allowed in
> ltree labels. The patch introduces 2 variants of escaping symbols, via
> backslashing separate symbols and via quoting the labels in whole for
> ltree, lquery and ltxtquery datatypes.

Sorry I still did not do full revision, I've stumbled up on other things in
the code, which, as I think should be fixed before final checking through. (I
guess that code do what is expected, since it passes tests and so on, it needs
recheck, but I'd like to do this full recheck only once)

So my doubts about this code is that a lot parts of code is just a copy-paste
of the same code that was written right above. And it can be rewritten in a
way that the same lines (or parts of lines) is repeated only once...
This should make code more readable, and make code more similar to the code of
postgres core, I've seen.

Here I will speak about lquery_in function from ltree_io.c.

1. May be it is better here to switch from else if (state ==
LQPRS_[something]) to switch (stat) { case LQPRS_[something]: }
because because here the main case is to determine what is the current state
of your state machine, do something with it, and then go to the next step.

Now after you've done what should be done in this step, you should make sure
that no other code is executed writing more ifs... If you use switch/case you
will be able to say break wherever you like, it will save you some ifs.

Theoretical example

(cl is for character length fc is for fist char)

if (cl==1 && fc =='1') {do_case1();}
else if (cl==1 && fc =='2') {do_case2();}
else if (cl==1 && fc =='3') {do_case3();}
else {do_otherwise();}

If it is wrapped in switch/case it will be like

if (cl==1)
{
  if (fc =='1') {do_case1(); break;}
  if (fc =='2') {do_case2(); break;}
  if (fc =='3') {do_case3(); break;}
}
/*if nothing is found */
do_otherwise();

This for example will allow you to get rid of multiply charlen == 1 checks in

        else if ((lptr->flag & LVAR_QUOTEDPART) == 0)                                                                                                                                      
            {                                                                                                                                                                                  
                if (charlen == 1 && t_iseq(ptr, '@'))                                                                                                                                          
                {                                                                                                                                                                              
                    if (lptr->start == ptr)                                                                                                                                                    
                        UNCHAR;                                                                                                                                                                
                    lptr->flag |= LVAR_INCASE;                                                                                                                                                  
                    curqlevel->flag |= LVAR_INCASE;                                                                                                                                            
                }                                                                                                                                                                              
                else if (charlen == 1 && t_iseq(ptr, '*'))                                                                                                                                      
                {                                                                                                                                                                              
                    if (lptr->start == ptr)                                                                                                                                                    
                        UNCHAR;                                                                                                                                                                
                    lptr->flag |= LVAR_ANYEND;                                                                                                                                                  
                    curqlevel->flag |= LVAR_ANYEND;                                                                                                                                            
                }                                                                                                                                                                              
                else if (charlen == 1 && t_iseq(ptr, '%'))                                                                                                                                      
                {                                                                                                                                                                              
                    if (lptr->start == ptr)                                                                                                                                                    
                        UNCHAR;                                                                                                                                                                
                    lptr->flag |= LVAR_SUBLEXEME;                                                                                                                                              
                    curqlevel->flag |= LVAR_SUBLEXEME;                                                                                                                                          
                }                                                                                                                                                                              
                else if (charlen == 1 && t_iseq(ptr, '|'))                                                                                                                                      
                {                                                                                                                                                                              
                    real_nodeitem_len(lptr, ptr, escaped_count,
tail_space_bytes, tail_space_symbols);                                                                                          
                                                                                                                                                                                               
                    if (state == LQPRS_WAITDELIMSTRICT)                                                                                                                                        
                        adjust_quoted_nodeitem(lptr);                                                                                                                                          
                                                                                                                                                                                               
                    check_level_length(lptr, pos);                                                                                                                                              
                    state = LQPRS_WAITVAR;                                                                                                                                                      
                }                                                                                                                                                                              
                else if (charlen == 1 && t_iseq(ptr, '.'))                                                                                                                                      
                {                                                                                                                                                                              
                    real_nodeitem_len(lptr, ptr, escaped_count,
tail_space_bytes, tail_space_symbols);                                                                                          
                                                                                                                                                                                               
                    if (state == LQPRS_WAITDELIMSTRICT)                                                                                                                                        
                        adjust_quoted_nodeitem(lptr);              
 
                                                                                                                                                                                           
                    check_level_length(lptr, pos);                                                                                                                                              
                                                                                                                                                                                               
                    state = LQPRS_WAITLEVEL;                                                                                                                                                    
                    curqlevel = NEXTLEV(curqlevel);                                                                                                                                            
                }                                                                                                                                                                              
                else if (charlen == 1 && t_iseq(ptr, '\\'))                                                                                                                                    
                {                                                                                                                                                                              
                    if (state == LQPRS_WAITDELIMSTRICT)                                                                                                                                        
                        UNCHAR;                                                                                                                                                                
                    state = LQPRS_WAITESCAPED;                                                                                                                                                  
                }                                                                                                                                                                              
                else                                                                                                                                                                            
                {                                                                                                                                                                              
                    if (charlen == 1 && strchr("!{}", *ptr))                                                                                                                                    
                        UNCHAR;                                                                                                                                                                
                    if (state == LQPRS_WAITDELIMSTRICT)                                                                                                                                        
                    {                                                                                                                                                                          
                        if (t_isspace(ptr))                                                                                                                                                    
                        {                                                                                                                                                                      
                            ptr += charlen;                                                                                                                                                    
                            pos++;                                                                                                                                                              
                            tail_space_bytes += charlen;                                                                                                                                        
                            tail_space_symbols = 1;                                                                                                                                            
                            continue;                                                                                                                                                          
                        }                                                                                                                                                                      
                                                                                                                                                                                               
                        UNCHAR;                                                                                                                                                                
                    }                                                                                                                                                                          
                    if (lptr->flag & ~LVAR_QUOTEDPART)                                                                                                                                          
                        UNCHAR;                                                                                                                                                                
                }                                                                                                                                                                              
            }

There are a lot of them, I would suggest to reduce there number to one check
in the right place.

Similar thing is about "GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);" in this part of code.

            if (charlen == 1)                                                                                                                                                                  
            {                                                                                                                                                                                  
                if (t_iseq(ptr, '!'))                                                                                                                                                          
                {                                                                                                                                                                              
                    GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);                                                                                                  
                    lptr->start = ptr + 1;                                                                                                                                                      
                    state = LQPRS_WAITDELIM;                                                                                                                                                    
                    curqlevel->numvar = 1;                                                                                                                                                      
                    curqlevel->flag |= LQL_NOT;                                                                                                                                                
                    hasnot = true;                                                                                                                                                              
                }                                                                                                                                                                              
                else if (t_iseq(ptr, '*'))                                                                                                                                                      
                    state = LQPRS_WAITOPEN;                                                                                                                                                    
                else if (t_iseq(ptr, '\\'))                                                                                                                                                    
                {                                                                                                                                                                              
                    GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);                                                                                                  
                    lptr->start = ptr;                                                                                                                                                          
                    curqlevel->numvar = 1;                                                                                                                                                      
                    state = LQPRS_WAITESCAPED;                                                                                                                                                  
                }                                                                                                                                                                              
                else if (strchr(".|@%{}", *ptr))                                                                                                                                                
                {                                                                                                                                                                              
                    UNCHAR;                                                                                                                                                                    
                }                                                                                                                                                                              
                else                                                                                                                                                                            
                {                                                                                                                                                                              
                    GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);                                                                                                  
                    lptr->start = ptr;                                                                                                                                                          
                    state = LQPRS_WAITDELIM;                                                                                                                                                    
                    curqlevel->numvar = 1;                                                                                                                                                      
                    if (t_iseq(ptr, '"'))                                                                                                                                                      
                    {                                                                                                                                                                          
                        lptr->flag |= LVAR_QUOTEDPART;                                                                                                                                          
                    }                                                                                                                                                                          
                }                                                                                                                                                                              
            }
            else                                                                                                                                                                                
            {                                                                                                                                                                                  
                GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);                                                                                                      
                lptr->start = ptr;                                                                                                                                                              
                state = LQPRS_WAITDELIM;                                                                                                                                                        
                curqlevel->numvar = 1;                                                                                                                                                          
            }

It is repeated almost is every branch of this if-subtree... Can it be
rewritten the way it is repeated only once?

And so on.

So my request is to reduce multiply repetition of the same code...

PS. One other thing that I've been thinking over but did not came to final
conclusion...

It is often quite often case

if (t_iseq(ptr, 'a'){same_code(); code_for_a();}
else if (t_iseq(ptr, 'b'){same_code(); code_for_b();}
else if (t_iseq(ptr, 'c'){same_code(); code_for_c);}
else something_else();

I've been thinking if it is possible to move this to

const unsigned char c=TOUCHAR(ptr);
switch(c)
{
  case 'a':
  case 'b':
  case 'c':
    same_code();
    if (c=='a') {code_for_a();}
    if (c=='b') {code_for_b();}
    if (c=='c') {code_for_c();}
    break;
  default:
   something_else();
}  

I did not still get, how valid is replacement of t_iseq() with TOUCHAR() +
"==". It is quite equivalent now, but I do not know how the core code is going
to evolve, so can't get final understanding. I even tried to discuss it with
Teodor (you've seen it) but still did come to any conclusion.

So for now status of this idea is "my considerations about ways to deduplicate
the code".  

So this is all essential things I can say about this patch as it is now. May
be somebody can add something more...

Reply | Threaded
Open this post in threaded view
|

Re: Ltree syntax improvement

Dmitry Belyavsky
Dear Nikolay, 

On Wed, Feb 20, 2019 at 12:28 PM Nikolay Shaplov <[hidden email]> wrote:
В письме от вторник, 29 января 2019 г. 20:43:07 MSK пользователь Dmitry
Belyavsky написал:
> Dear all,
>
> Please find attached the patch extending the sets of symbols allowed in
> ltree labels. The patch introduces 2 variants of escaping symbols, via
> backslashing separate symbols and via quoting the labels in whole for
> ltree, lquery and ltxtquery datatypes.

Sorry I still did not do full revision, I've stumbled up on other things in
the code, which, as I think should be fixed before final checking through. (I
guess that code do what is expected, since it passes tests and so on, it needs
recheck, but I'd like to do this full recheck only once)

So my doubts about this code is that a lot parts of code is just a copy-paste
of the same code that was written right above. And it can be rewritten in a
way that the same lines (or parts of lines) is repeated only once...
This should make code more readable, and make code more similar to the code of
postgres core, I've seen.

Here I will speak about lquery_in function from ltree_io.c.

1. May be it is better here to switch from else if (state ==
LQPRS_[something]) to switch (stat) { case LQPRS_[something]: }
because because here the main case is to determine what is the current state
of your state machine, do something with it, and then go to the next step.

Now after you've done what should be done in this step, you should make sure
that no other code is executed writing more ifs... If you use switch/case you
will be able to say break wherever you like, it will save you some ifs.

I thought about it writing the code. But it significantly increased the size of the patch
so I decided to follow the original coding style here.
 

Theoretical example

(cl is for character length fc is for fist char)

if (cl==1 && fc =='1') {do_case1();}
else if (cl==1 && fc =='2') {do_case2();}
else if (cl==1 && fc =='3') {do_case3();}
else {do_otherwise();}

If it is wrapped in switch/case it will be like

if (cl==1)
{
  if (fc =='1') {do_case1(); break;}
  if (fc =='2') {do_case2(); break;}
  if (fc =='3') {do_case3(); break;}
}
/*if nothing is found */
do_otherwise();

This for example will allow you to get rid of multiply charlen == 1 checks in

        else if ((lptr->flag & LVAR_QUOTEDPART) == 0)                                                                                                                                       
            {                                                                                                                                                                                   
                if (charlen == 1 && t_iseq(ptr, '@'))                                                                                                                                           
                {                                                                                                                                                                               
                    if (lptr->start == ptr)                                                                                                                                                     
                        UNCHAR;                                                                                                                                                                 
                    lptr->flag |= LVAR_INCASE;                                                                                                                                                 
                    curqlevel->flag |= LVAR_INCASE;                                                                                                                                             
                }                                                                                                                                                                               
                else if (charlen == 1 && t_iseq(ptr, '*'))                                                                                                                                     
                {                                                                                                                                                                               
                    if (lptr->start == ptr)                                                                                                                                                     
                        UNCHAR;                                                                                                                                                                 
                    lptr->flag |= LVAR_ANYEND;                                                                                                                                                 
                    curqlevel->flag |= LVAR_ANYEND;                                                                                                                                             
                }                                                                                                                                                                               
                else if (charlen == 1 && t_iseq(ptr, '%'))                                                                                                                                     
                {                                                                                                                                                                               
                    if (lptr->start == ptr)                                                                                                                                                     
                        UNCHAR;                                                                                                                                                                 
                    lptr->flag |= LVAR_SUBLEXEME;                                                                                                                                               
                    curqlevel->flag |= LVAR_SUBLEXEME;                                                                                                                                         
                }                                                                                                                                                                               
                else if (charlen == 1 && t_iseq(ptr, '|'))                                                                                                                                     
                {                                                                                                                                                                               
                    real_nodeitem_len(lptr, ptr, escaped_count,
tail_space_bytes, tail_space_symbols);                                                                                         

                    if (state == LQPRS_WAITDELIMSTRICT)                                                                                                                                         
                        adjust_quoted_nodeitem(lptr);                                                                                                                                           

                    check_level_length(lptr, pos);                                                                                                                                             
                    state = LQPRS_WAITVAR;                                                                                                                                                     
                }                                                                                                                                                                               
                else if (charlen == 1 && t_iseq(ptr, '.'))                                                                                                                                     
                {                                                                                                                                                                               
                    real_nodeitem_len(lptr, ptr, escaped_count,
tail_space_bytes, tail_space_symbols);                                                                                         

                    if (state == LQPRS_WAITDELIMSTRICT)                                                                                                                                         
                        adjust_quoted_nodeitem(lptr);               


                    check_level_length(lptr, pos);                                                                                                                                             

                    state = LQPRS_WAITLEVEL;                                                                                                                                                   
                    curqlevel = NEXTLEV(curqlevel);                                                                                                                                             
                }                                                                                                                                                                               
                else if (charlen == 1 && t_iseq(ptr, '\\'))                                                                                                                                     
                {                                                                                                                                                                               
                    if (state == LQPRS_WAITDELIMSTRICT)                                                                                                                                         
                        UNCHAR;                                                                                                                                                                 
                    state = LQPRS_WAITESCAPED;                                                                                                                                                 
                }                                                                                                                                                                               
                else                                                                                                                                                                           
                {                                                                                                                                                                               
                    if (charlen == 1 && strchr("!{}", *ptr))                                                                                                                                   
                        UNCHAR;                                                                                                                                                                 
                    if (state == LQPRS_WAITDELIMSTRICT)                                                                                                                                         
                    {                                                                                                                                                                           
                        if (t_isspace(ptr))                                                                                                                                                     
                        {                                                                                                                                                                       
                            ptr += charlen;                                                                                                                                                     
                            pos++;                                                                                                                                                             
                            tail_space_bytes += charlen;                                                                                                                                       
                            tail_space_symbols = 1;                                                                                                                                             
                            continue;                                                                                                                                                           
                        }                                                                                                                                                                       

                        UNCHAR;                                                                                                                                                                 
                    }                                                                                                                                                                           
                    if (lptr->flag & ~LVAR_QUOTEDPART)                                                                                                                                         
                        UNCHAR;                                                                                                                                                                 
                }                                                                                                                                                                               
            }

Yes, this piece of code could be converted to the form you suggest, but
only partially, because the final else includes a strchr() call that,
being rewritten to the 'case' syntax will look ugly enough.
 

There are a lot of them, I would suggest to reduce there number to one check
in the right place.

Similar thing is about "GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);" in this part of code.

            if (charlen == 1)                                                                                                                                                                   
            {                                                                                                                                                                                   
                if (t_iseq(ptr, '!'))                                                                                                                                                           
                {                                                                                                                                                                               
                    GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);                                                                                                 
                    lptr->start = ptr + 1;                                                                                                                                                     
                    state = LQPRS_WAITDELIM;                                                                                                                                                   
                    curqlevel->numvar = 1;                                                                                                                                                     
                    curqlevel->flag |= LQL_NOT;                                                                                                                                                 
                    hasnot = true;                                                                                                                                                             
                }                                                                                                                                                                               
                else if (t_iseq(ptr, '*'))                                                                                                                                                     
                    state = LQPRS_WAITOPEN;                                                                                                                                                     
                else if (t_iseq(ptr, '\\'))                                                                                                                                                     
                {                                                                                                                                                                               
                    GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);                                                                                                 
                    lptr->start = ptr;                                                                                                                                                         
                    curqlevel->numvar = 1;                                                                                                                                                     
                    state = LQPRS_WAITESCAPED;                                                                                                                                                 
                }                                                                                                                                                                               
                else if (strchr(".|@%{}", *ptr))                                                                                                                                               
                {                                                                                                                                                                               
                    UNCHAR;                                                                                                                                                                     
                }                                                                                                                                                                               
                else                                                                                                                                                                           
                {                                                                                                                                                                               
                    GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);                                                                                                 
                    lptr->start = ptr;                                                                                                                                                         
                    state = LQPRS_WAITDELIM;                                                                                                                                                   
                    curqlevel->numvar = 1;                                                                                                                                                     
                    if (t_iseq(ptr, '"'))                                                                                                                                                       
                    {                                                                                                                                                                           
                        lptr->flag |= LVAR_QUOTEDPART;                                                                                                                                         
                    }                                                                                                                                                                           
                }                                                                                                                                                                               
            }
            else                                                                                                                                                                               
            {                                                                                                                                                                                   
                GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);                                                                                                     
                lptr->start = ptr;                                                                                                                                                             
                state = LQPRS_WAITDELIM;                                                                                                                                                       
                curqlevel->numvar = 1;                                                                                                                                                         
            }

It is repeated almost is every branch of this if-subtree... Can it be
rewritten the way it is repeated only once?

Well, I see the only one so long sequence of 'if'. 

And so on.

So my request is to reduce multiply repetition of the same code...

PS. One other thing that I've been thinking over but did not came to final
conclusion...

It is often quite often case

if (t_iseq(ptr, 'a'){same_code(); code_for_a();}
else if (t_iseq(ptr, 'b'){same_code(); code_for_b();}
else if (t_iseq(ptr, 'c'){same_code(); code_for_c);}
else something_else();

I've been thinking if it is possible to move this to

const unsigned char c=TOUCHAR(ptr);
switch(c)
{
  case 'a':
  case 'b':
  case 'c':
    same_code();
    if (c=='a') {code_for_a();}
    if (c=='b') {code_for_b();}
    if (c=='c') {code_for_c();}
    break;
  default:
   something_else();
}   

I did not still get, how valid is replacement of t_iseq() with TOUCHAR() +
"==". It is quite equivalent now, but I do not know how the core code is going
to evolve, so can't get final understanding. I even tried to discuss it with
Teodor (you've seen it) but still did come to any conclusion.

I agree with Teodor here. We do not know about various charsets so
the current syntax seems to be more safe. 

So for now status of this idea is "my considerations about ways to deduplicate
the code". 

So this is all essential things I can say about this patch as it is now. May
be somebody can add something more...



--
SY, Dmitry Belyavsky
Reply | Threaded
Open this post in threaded view
|

Re: Re: Ltree syntax improvement

David Steele
Hi Dmitry,

This patch appears too complex to be submitted to the last CF, as Andres
has also noted [1].

I have set the target version to PG13.

Regards,
--
-David
[hidden email]

[1]
https://www.postgresql.org/message-id/20190216054526.zss2cufdxfeudr4i%40alap3.anarazel.de

Reply | Threaded
Open this post in threaded view
|

Re: Ltree syntax improvement

Chris Travers-7
In reply to this post by Nikolay Shaplov


On Tue, Feb 5, 2019 at 2:27 PM Nikolay Shaplov <[hidden email]> wrote:
В письме от вторник, 29 января 2019 г. 20:43:07 MSK пользователь Dmitry
Belyavsky написал:

> Please find attached the patch extending the sets of symbols allowed in
> ltree labels. The patch introduces 2 variants of escaping symbols, via
> backslashing separate symbols and via quoting the labels in whole for
> ltree, lquery and ltxtquery datatypes.

Hi!

Let's me join the review process...

I've just give a superficial look to the patch, read it through, and try  here
and there, so first I'll give feedback about exterior features of the patch.

So, the first question: why do we need all this? The answer seems to be
obvious, but it would be good say it explicitly. What problems would it solve?
Do these problems really exists?

Yes, it does.  We maintain an extension (https://github.com/adjust/wltree) which has a fixed separator (::) and allows any utf-8 character in the tree.

In our case we currently use our extended tree to store user-defined hierarchies of labels, which might contain whitespace, Chinese, Japanese, or Korean characters, etc.

I would love to be able to deprecate our work on this extension and eventually use stock ltree.

The second question is about coding style. As far as I can see you've passed
your code through pg_indent, but it does not solve all problems.

As far as I know, postgres commiters are quite strict about code width, the
code should not be wider that 80 col, except places were string constants are
introduced (There you can legally exceed 80 col). So I would advice to use vim
with  tabstop=4  and colorcolumn=80 and make sure that new code text does not
cross red vertical line.

Comments. There is no fixed coding style for comments in postgres. Usually this
means that it is better to follow coding in the code around. But ltree does
not have much comments, so I would suggest to use the best style I've seen in
the code
/*
 * [function name]
 * < tab ><tab> [Short description, what does it do]
 *
 * [long explanation how it works, several subparagraph if needed
 */
(Seen this in src/include/utils/rel.h)
or something like that.
At least it would be good to have explicit separation between descriptions and
explanation of how it works.

And about comment
/*
 * Function calculating bytes to escape
 * to_escape is an array of "special" 1-byte symbols
 * Behvaiour:
 * If there is no "special" symbols, return 0
 * If there are any special symbol, we need initial and final quote, so return
2
 * If there are any quotes, we need to escape all of them and also initial and
final quote, so
 * return 2 + number of quotes
 */

It has some formatting. But it should not.  As far as I understand this
comment should be treated as single paragraph by reformatter, and refomatted.
I do not understand why pg_indent have not done it. Documentation (https://
www.postgresql.org/docs/11/source-format.html) claims that if you want to
avoid reformatting, you should add "-------------" at the beginning and at the
end of the comment. So it would be good either remove this formatting, or add
"----" if you are sure you want to keep it.

Third part is about error messages.
First I've seen errors like elog(ERROR, "internal error during splitting
levels");. I've never seen error like that in postgres. May be if this error
is in the part of the code, that should never be reached in real live, may be
it would be good to change it to Assert? Or if this code can be normally
reached, add some more explanation of what had happened...

About other error messages.

There are messages like
errmsg("syntax error"),
errdetail("Unexpected end of line.")));

or

errmsg("escaping syntax error")));


In postgres there is error message style guide
https://www.postgresql.org/docs/11/error-style-guide.html

Here I would expect messages like that

Primary:    Error parsing ltree path
Detail:     Curly quote is opened and not closed

Or something like that, I am not expert in English, but this way it would be
better for sure. Key points are: Primary message should point to general area
where error occurred (there can be a large SQL with a lot of things in it, so
it is good to point that problem is with ltree staff). And is is better to
word from the point of view of a user. What for you (and me) is unexpected end
of line, for him it is unclosed curly quote.

Also I am not sure, if it is easy, but if it is possible, it would be good to
move "^" symbol that points to the place of the error, to the place inside ' '
where unclosed curly quote is located. As a user I would appreciate that.

So that's what I've seen while giving it superficial look...




--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Reply | Threaded
Open this post in threaded view
|

Re: Ltree syntax improvement

Filip Rembiałkowski-3
Good day.

Sorry to pop in, but if you are active users of ltree, please let me
know if you rely on the exclamation operator (negative matching)?
I have just filed a patch,  fixing very old bug - and it changes the
logic substantially.
https://commitfest.postgresql.org/23/2054/ - I'd be grateful for your
comments (please use the other thread).




On Thu, Mar 7, 2019 at 1:10 PM Chris Travers <[hidden email]> wrote:

>
>
>
> On Tue, Feb 5, 2019 at 2:27 PM Nikolay Shaplov <[hidden email]> wrote:
>>
>> В письме от вторник, 29 января 2019 г. 20:43:07 MSK пользователь Dmitry
>> Belyavsky написал:
>>
>> > Please find attached the patch extending the sets of symbols allowed in
>> > ltree labels. The patch introduces 2 variants of escaping symbols, via
>> > backslashing separate symbols and via quoting the labels in whole for
>> > ltree, lquery and ltxtquery datatypes.
>>
>> Hi!
>>
>> Let's me join the review process...
>>
>> I've just give a superficial look to the patch, read it through, and try  here
>> and there, so first I'll give feedback about exterior features of the patch.
>>
>> So, the first question: why do we need all this? The answer seems to be
>> obvious, but it would be good say it explicitly. What problems would it solve?
>> Do these problems really exists?
>
>
> Yes, it does.  We maintain an extension (https://github.com/adjust/wltree) which has a fixed separator (::) and allows any utf-8 character in the tree.
>
> In our case we currently use our extended tree to store user-defined hierarchies of labels, which might contain whitespace, Chinese, Japanese, or Korean characters, etc.
>
> I would love to be able to deprecate our work on this extension and eventually use stock ltree.
>>
>>
>> The second question is about coding style. As far as I can see you've passed
>> your code through pg_indent, but it does not solve all problems.
>>
>> As far as I know, postgres commiters are quite strict about code width, the
>> code should not be wider that 80 col, except places were string constants are
>> introduced (There you can legally exceed 80 col). So I would advice to use vim
>> with  tabstop=4  and colorcolumn=80 and make sure that new code text does not
>> cross red vertical line.
>>
>> Comments. There is no fixed coding style for comments in postgres. Usually this
>> means that it is better to follow coding in the code around. But ltree does
>> not have much comments, so I would suggest to use the best style I've seen in
>> the code
>> /*
>>  * [function name]
>>  * < tab ><tab> [Short description, what does it do]
>>  *
>>  * [long explanation how it works, several subparagraph if needed
>>  */
>> (Seen this in src/include/utils/rel.h)
>> or something like that.
>> At least it would be good to have explicit separation between descriptions and
>> explanation of how it works.
>>
>> And about comment
>> /*
>>  * Function calculating bytes to escape
>>  * to_escape is an array of "special" 1-byte symbols
>>  * Behvaiour:
>>  * If there is no "special" symbols, return 0
>>  * If there are any special symbol, we need initial and final quote, so return
>> 2
>>  * If there are any quotes, we need to escape all of them and also initial and
>> final quote, so
>>  * return 2 + number of quotes
>>  */
>>
>> It has some formatting. But it should not.  As far as I understand this
>> comment should be treated as single paragraph by reformatter, and refomatted.
>> I do not understand why pg_indent have not done it. Documentation (https://
>> www.postgresql.org/docs/11/source-format.html) claims that if you want to
>> avoid reformatting, you should add "-------------" at the beginning and at the
>> end of the comment. So it would be good either remove this formatting, or add
>> "----" if you are sure you want to keep it.
>>
>> Third part is about error messages.
>> First I've seen errors like elog(ERROR, "internal error during splitting
>> levels");. I've never seen error like that in postgres. May be if this error
>> is in the part of the code, that should never be reached in real live, may be
>> it would be good to change it to Assert? Or if this code can be normally
>> reached, add some more explanation of what had happened...
>>
>> About other error messages.
>>
>> There are messages like
>> errmsg("syntax error"),
>> errdetail("Unexpected end of line.")));
>>
>> or
>>
>> errmsg("escaping syntax error")));
>>
>>
>> In postgres there is error message style guide
>> https://www.postgresql.org/docs/11/error-style-guide.html
>>
>> Here I would expect messages like that
>>
>> Primary:    Error parsing ltree path
>> Detail:     Curly quote is opened and not closed
>>
>> Or something like that, I am not expert in English, but this way it would be
>> better for sure. Key points are: Primary message should point to general area
>> where error occurred (there can be a large SQL with a lot of things in it, so
>> it is good to point that problem is with ltree staff). And is is better to
>> word from the point of view of a user. What for you (and me) is unexpected end
>> of line, for him it is unclosed curly quote.
>>
>> Also I am not sure, if it is easy, but if it is possible, it would be good to
>> move "^" symbol that points to the place of the error, to the place inside ' '
>> where unclosed curly quote is located. As a user I would appreciate that.
>>
>> So that's what I've seen while giving it superficial look...
>>
>>
>
>
> --
> Best Regards,
> Chris Travers
> Head of Database
>
> Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
> Saarbrücker Straße 37a, 10405 Berlin
>

Reply | Threaded
Open this post in threaded view
|

Re: Ltree syntax improvement

Nikolay Shaplov
In reply to this post by Chris Travers-7
В письме от четверг, 7 марта 2019 г. 13:09:49 MSK пользователь Chris Travers
написал:

>  We maintain an extension (https://github.com/adjust/wltree)
> which has a fixed separator (::) and allows any utf-8 character in the tree.
>
> In our case we currently use our extended tree to store user-defined
> hierarchies of labels, which might contain whitespace, Chinese, Japanese,
> or Korean characters, etc.
>
> I would love to be able to deprecate our work on this extension and
> eventually use stock ltree.
I am afraid, that if would not be possible to have ltree with custom
separator. Or we need to pass a very long way to reach there.

To have custom separator we will need some place to store it.

We can use GUC variable, and set separator for ltree globally. It might solve
some of the problems. But will get some more. If for example we dump database,
and then restore it on instance where that GUC option is not set, everything
will be brocken.

It is not opclass option (that are discussed  here on the list), because
opclass options is no about parsing, but about comparing things that was
already parsed.

It is not option of an attribute (if we can have custom attribute option),
because we can have ltree as a variable, without creating an attribute...

It should be an option of ltree type itself. I have no idea how we can
implement this. And who will allow us to commit such strange thing ;-)


Reply | Threaded
Open this post in threaded view
|

Re: Ltree syntax improvement

Nikolay Shaplov
In reply to this post by Dmitry Belyavsky
В письме от воскресенье, 24 февраля 2019 г. 14:31:55 MSK пользователь Dmitry
Belyavsky написал:

Hi! Am back here again.

I've been thinking about this patch a while... Come to some conclusions and
wrote some examples...

First I came to idea that the best way to simplify the code is change the
state machine from if to switch/case. Because in your code a lot of repetition
is done just because you can't say "Thats all, let's go to next symbol" in any
place in the code. Now it should follow all the branches of if-else tree that
is inside the state-machine "node" to get to the next symbol.

To show how simpler the things would be I changed the state machine processing
in lquery_in form if to switch/case, and changed the code for LQPRS_WAITLEVEL
state processing, removing duplicate code, using "break" wherever it is
possible

(The indention in this example is unproper to make diff more clear)

so from that much of code
=================
                if (state == LQPRS_WAITLEVEL)
                {
                        if (t_isspace(ptr))
                        {
                                ptr += charlen;
                                pos++;
                                continue;
                        }

                        escaped_count = 0;
                        real_levels++;
                        if (charlen == 1)
                        {
                                if (t_iseq(ptr, '!'))
                                {
                                        GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);
                                        lptr->start = ptr + 1;
                                        state = LQPRS_WAITDELIM;
                                        curqlevel->numvar = 1;
                                        curqlevel->flag |= LQL_NOT;
                                        hasnot = true;
                                }
                                else if (t_iseq(ptr, '*'))
                                        state = LQPRS_WAITOPEN;
                                else if (t_iseq(ptr, '\\'))
                                {
                                        GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);
                                        lptr->start = ptr;
                                        curqlevel->numvar = 1;
                                        state = LQPRS_WAITESCAPED;
                                }
                                else if (strchr(".|@%{}", *ptr))
                                {
                                        UNCHAR;
                                }
                                else
                                {
                                        GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);
                                        lptr->start = ptr;
                                        state = LQPRS_WAITDELIM;
                                        curqlevel->numvar = 1;
                                        if (t_iseq(ptr, '"'))
                                        {
                                                lptr->flag |= LVAR_QUOTEDPART;
                                        }
                                }
                        }
                        else
                        {
                                GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) *
numOR);
                                lptr->start = ptr;
                                state = LQPRS_WAITDELIM;
                                curqlevel->numvar = 1;
                        }
                }
=====================
I came to this
=====================
                 case LQPRS_WAITLEVEL:
                        if (t_isspace(ptr))
                                break; /* Just go to next symbol */
                        escaped_count = 0;
                        real_levels++;

                        if (charlen == 1)
                        {
                                if (strchr(".|@%{}", *ptr))
                                        UNCHAR;
                                if (t_iseq(ptr, '*'))
                                {
                                        state = LQPRS_WAITOPEN;
                                        break;
                                }
                        }
                        GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) *
numOR);
                        lptr->start = ptr;
                        curqlevel->numvar = 1;
                        state = LQPRS_WAITDELIM;
                        if (charlen == 1)
                        {
                                if (t_iseq(ptr, '\\'))
                                {
                                        state = LQPRS_WAITESCAPED;
                                        break;
                                }
                                if (t_iseq(ptr, '!'))
                                {
                                        lptr->start += 1 /*FIXME explain why */;
                                        curqlevel->flag |= LQL_NOT;
                                        hasnot = true;
                                }
                                else if (t_iseq(ptr, '"'))
                                {
                                        lptr->flag |= LVAR_QUOTEDPART;
                                }
                        }
                        break;
=======
And this code is much more readable.... You can just read it aloud:
-Spaces we just skip
- on .|@%{} throws and exception
- for asterisk change state and nothing else
then for others allocate a buffer
- for slash we just set a flag
- for exclamation mark we set a flag and skip it
- for double quote set a flag.

All the logic are clear. And in your version of the code it is difficult to get
the idea from the code that simple.

So my suggestion is following:

1. Submit and commit the patch that changes state machines from ifs to switch/
cases, and nothing else. This will reindent the code, so in order to keep
further changes visible, we should change nothing else.

2. Change your code to switch/cases, use break to deduplicate code wherever is
possible, and commit it.

I can join you while working with this (but then I am not sure I would be able
to remain a reviewer...)

I've also attached an example I've quoted above, formatted as a diff, so it
would be easy to check how does it work. It should be applied after your
patch.
(I gave it a strange name ltree.not-a-patch hoping that commitfest software
will not treat it a new version of a patch)

ltree.not-a-patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Ltree syntax improvement

Dmitry Belyavsky
Dear Nikolay,

Many thanks for your efforts!

On Sat, Apr 6, 2019 at 2:29 PM Nikolay Shaplov <[hidden email]> wrote:
В письме от воскресенье, 24 февраля 2019 г. 14:31:55 MSK пользователь Dmitry
Belyavsky написал:

Hi! Am back here again.

I've been thinking about this patch a while... Come to some conclusions and
wrote some examples...

First I came to idea that the best way to simplify the code is change the
state machine from if to switch/case. Because in your code a lot of repetition
is done just because you can't say "Thats all, let's go to next symbol" in any
place in the code. Now it should follow all the branches of if-else tree that
is inside the state-machine "node" to get to the next symbol.

To show how simpler the things would be I changed the state machine processing
in lquery_in form if to switch/case, and changed the code for LQPRS_WAITLEVEL
state processing, removing duplicate code, using "break" wherever it is
possible

(The indention in this example is unproper to make diff more clear)

so from that much of code
=================
                if (state == LQPRS_WAITLEVEL)
                {
                        if (t_isspace(ptr))
                        {
                                ptr += charlen;
                                pos++;
                                continue;
                        }

                        escaped_count = 0;
                        real_levels++;
                        if (charlen == 1)
                        {
                                if (t_iseq(ptr, '!'))
                                {
                                        GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);
                                        lptr->start = ptr + 1;
                                        state = LQPRS_WAITDELIM;
                                        curqlevel->numvar = 1;
                                        curqlevel->flag |= LQL_NOT;
                                        hasnot = true;
                                }
                                else if (t_iseq(ptr, '*'))
                                        state = LQPRS_WAITOPEN;
                                else if (t_iseq(ptr, '\\'))
                                {
                                        GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);
                                        lptr->start = ptr;
                                        curqlevel->numvar = 1;
                                        state = LQPRS_WAITESCAPED;
                                }
                                else if (strchr(".|@%{}", *ptr))
                                {
                                        UNCHAR;
                                }
                                else
                                {
                                        GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);
                                        lptr->start = ptr;
                                        state = LQPRS_WAITDELIM;
                                        curqlevel->numvar = 1;
                                        if (t_iseq(ptr, '"'))
                                        {
                                                lptr->flag |= LVAR_QUOTEDPART;
                                        }
                                }
                        }
                        else
                        {
                                GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) *
numOR);
                                lptr->start = ptr;
                                state = LQPRS_WAITDELIM;
                                curqlevel->numvar = 1;
                        }
                }
=====================
I came to this
=====================
                 case LQPRS_WAITLEVEL:
                        if (t_isspace(ptr))
                                break; /* Just go to next symbol */
                        escaped_count = 0;
                        real_levels++;

                        if (charlen == 1)
                        {
                                if (strchr(".|@%{}", *ptr))
                                        UNCHAR;
                                if (t_iseq(ptr, '*'))
                                {
                                        state = LQPRS_WAITOPEN;
                                        break;
                                }
                        }
                        GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) *
numOR);
                        lptr->start = ptr;
                        curqlevel->numvar = 1;
                        state = LQPRS_WAITDELIM;
                        if (charlen == 1)
                        {
                                if (t_iseq(ptr, '\\'))
                                {
                                        state = LQPRS_WAITESCAPED;
                                        break;
                                }
                                if (t_iseq(ptr, '!'))
                                {
                                        lptr->start += 1 /*FIXME explain why */;
                                        curqlevel->flag |= LQL_NOT;
                                        hasnot = true;
                                }
                                else if (t_iseq(ptr, '"'))
                                {
                                        lptr->flag |= LVAR_QUOTEDPART;
                                }
                        }
                        break;
=======
And this code is much more readable.... You can just read it aloud:
-Spaces we just skip
- on .|@%{} throws and exception
- for asterisk change state and nothing else
then for others allocate a buffer
- for slash we just set a flag
- for exclamation mark we set a flag and skip it
- for double quote set a flag.

All the logic are clear. And in your version of the code it is difficult to get
the idea from the code that simple.

So my suggestion is following:

1. Submit and commit the patch that changes state machines from ifs to switch/
cases, and nothing else. This will reindent the code, so in order to keep
further changes visible, we should change nothing else.

2. Change your code to switch/cases, use break to deduplicate code wherever is
possible, and commit it.

I can join you while working with this (but then I am not sure I would be able
to remain a reviewer...)

I've also attached an example I've quoted above, formatted as a diff, so it
would be easy to check how does it work. It should be applied after your
patch.
(I gave it a strange name ltree.not-a-patch hoping that commitfest software
will not treat it a new version of a patch)

I've applied your patch. 
From my point of view, there is no major difference between case and chain if here.
Neither case nor ifs allow extracting the common code to separate function - just because there seem to be no identical pieces of code. 

So yes, your patch makes sense, but I do not see any advantages of applying it.

--
SY, Dmitry Belyavsky
Reply | Threaded
Open this post in threaded view
|

Re: Ltree syntax improvement

Thomas Munro-5
On Wed, Apr 17, 2019 at 5:29 AM Dmitry Belyavsky <[hidden email]> wrote:
> I've applied your patch.
> From my point of view, there is no major difference between case and chain if here.
> Neither case nor ifs allow extracting the common code to separate function - just because there seem to be no identical pieces of code.

Hi Dmitry,

The documentation doesn't build[1], due to invalid XML.  Since I'm
here, here is some proof-reading of the English in the documentation:

   <para>
-   A <firstterm>label</firstterm> is a sequence of alphanumeric characters
-   and underscores (for example, in C locale the characters
-   <literal>A-Za-z0-9_</literal> are allowed).  Labels must be less
than 256 bytes
-   long.
+   A <firstterm>label</firstterm> is a sequence of characters. Labels must be
+   less than 256 symbols long. Label may contain any character
supported by Postgres

"fewer than 256 characters in length", and
"<productname>PostgreSQL</productname>"

+   except <literal>\0</literal>. If label contains spaces, dots,
lquery modifiers,

"spaces, dots or lquery modifiers,"

+   they may be <firstterm>escaped</firstterm>. Escaping can be done
either by preceeding
+   backslash (<literal>\\</literal>) symbol, or by wrapping the label
in whole in double
+   quotes (<literal>"</literal>). Initial and final unescaped
whitespace is stripped.

"Escaping can be done with either a preceding backslash [...] or by
wrapping the whole label in double quotes [...]."

   </para>

+    During converting text into internal representations, wrapping
double quotes

"During conversion to internal representation, "

+    and escaping backslashes are removed. During converting internal
+    representations into text, if the label does not contain any special

"During conversion from internal representation to text, "

+    symbols, it is printed as is. Otherwise, it is wrapped in quotes and, if
+    there are internal quotes, they are escaped with backslash. The
list of special

"escaped with backslashes."

+  <para>
+    Examples: <literal>42</literal>, <literal>"\\42"</literal>,
+    <literal>\\4\\2</literal>, <literal> 42 </literal> and <literal> "42"
+    </literal> will have the similar internal representation and, being

"will all have the same internal representation and,"

+    converted from internal representation, will become <literal>42</literal>.
+    Literal <literal>abc def</literal> will turn into <literal>"abc
+    def"</literal>.
   </para>

[1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/555571856

--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Ltree syntax improvement

Dmitry Belyavsky
Dear Thomas,

Thank you for your proofreading!

Please find the updated patch attached. It also contains the missing escaping.

On Mon, Jul 8, 2019 at 10:39 AM Thomas Munro <[hidden email]> wrote:
On Wed, Apr 17, 2019 at 5:29 AM Dmitry Belyavsky <[hidden email]> wrote:
> I've applied your patch.
> From my point of view, there is no major difference between case and chain if here.
> Neither case nor ifs allow extracting the common code to separate function - just because there seem to be no identical pieces of code.

Hi Dmitry,

The documentation doesn't build[1], due to invalid XML.  Since I'm
here, here is some proof-reading of the English in the documentation:

   <para>
-   A <firstterm>label</firstterm> is a sequence of alphanumeric characters
-   and underscores (for example, in C locale the characters
-   <literal>A-Za-z0-9_</literal> are allowed).  Labels must be less
than 256 bytes
-   long.
+   A <firstterm>label</firstterm> is a sequence of characters. Labels must be
+   less than 256 symbols long. Label may contain any character
supported by Postgres

"fewer than 256 characters in length", and
"<productname>PostgreSQL</productname>"

+   except <literal>\0</literal>. If label contains spaces, dots,
lquery modifiers,

"spaces, dots or lquery modifiers,"

+   they may be <firstterm>escaped</firstterm>. Escaping can be done
either by preceeding
+   backslash (<literal>\\</literal>) symbol, or by wrapping the label
in whole in double
+   quotes (<literal>"</literal>). Initial and final unescaped
whitespace is stripped.

"Escaping can be done with either a preceding backslash [...] or by
wrapping the whole label in double quotes [...]."

   </para>

+    During converting text into internal representations, wrapping
double quotes

"During conversion to internal representation, "

+    and escaping backslashes are removed. During converting internal
+    representations into text, if the label does not contain any special

"During conversion from internal representation to text, "

+    symbols, it is printed as is. Otherwise, it is wrapped in quotes and, if
+    there are internal quotes, they are escaped with backslash. The
list of special

"escaped with backslashes."

+  <para>
+    Examples: <literal>42</literal>, <literal>"\\42"</literal>,
+    <literal>\\4\\2</literal>, <literal> 42 </literal> and <literal> "42"
+    </literal> will have the similar internal representation and, being

"will all have the same internal representation and,"

+    converted from internal representation, will become <literal>42</literal>.
+    Literal <literal>abc def</literal> will turn into <literal>"abc
+    def"</literal>.
   </para>

[1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/555571856

--
Thomas Munro
https://enterprisedb.com


--
SY, Dmitry Belyavsky

ltree_20190708.diff (105K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Ltree syntax improvement

Alvaro Herrera-9
On 2019-Jul-08, Dmitry Belyavsky wrote:

> Dear Thomas,
>
> Thank you for your proofreading!
>
> Please find the updated patch attached. It also contains the missing
> escaping.

I think all these functions you're adding should have a short sentence
explaining what it does.

I'm not really convinced that we need this much testing.  It seems a bit
excessive.  Many of these very focused test SQL lines could be removed
with no loss of coverage, and many of the others could be grouped into
one.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Ltree syntax improvement

Dmitry Belyavsky
Dear Alvaro,

On Mon, Jul 8, 2019 at 11:16 PM Alvaro Herrera <[hidden email]> wrote:
On 2019-Jul-08, Dmitry Belyavsky wrote:

> Dear Thomas,
>
> Thank you for your proofreading!
>
> Please find the updated patch attached. It also contains the missing
> escaping.

I think all these functions you're adding should have a short sentence
explaining what it does.

I'm not really convinced that we need this much testing.  It seems a bit
excessive.  Many of these very focused test SQL lines could be removed
with no loss of coverage, and many of the others could be grouped into
one.

I did not introduce any functions. I've just changed the parser.
I'm not sure that it makes sense to remove any tests as most of them were written to catch really happened bugs during the implementation.

--
SY, Dmitry Belyavsky
Reply | Threaded
Open this post in threaded view
|

Re: Ltree syntax improvement

Alvaro Herrera-9
On 2019-Jul-08, Dmitry Belyavsky wrote:

> I did not introduce any functions. I've just changed the parser.

I mean the C-level functions -- count_parts_ors() and so on.

> I'm not sure that it makes sense to remove any tests as most of them were
> written to catch really happened bugs during the implementation.

Well, I don't mean to decrease the coverage, only to condense a lot of
little tests in a small powerful test.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Ltree syntax improvement

Dmitry Belyavsky


On Mon, Jul 8, 2019 at 11:33 PM Alvaro Herrera <[hidden email]> wrote:
On 2019-Jul-08, Dmitry Belyavsky wrote:

> I did not introduce any functions. I've just changed the parser.

I mean the C-level functions -- count_parts_ors() and so on.

Added a comment to  count_parts_ors()

The other functions introduced by me were already described.

--
SY, Dmitry Belyavsky

ltree_20190709.diff (107K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Ltree syntax improvement

Thomas Munro-5
On Wed, Jul 10, 2019 at 7:40 AM Dmitry Belyavsky <[hidden email]> wrote:
> [ltree_20190709.diff]

Hi Dmitry,

You need to update contrib/ltree_plpython/expected/ltree_plpython.out,
otherwise check-world fails when built with Python support.  The good
news is that it looks like it fails because you fixed something!
(Though I didn't check the details).

 CREATE FUNCTION test2() RETURNS ltree
 LANGUAGE plpythonu
 TRANSFORM FOR TYPE ltree
 AS $$
 return ['foo', 'bar', 'baz']
 $$;
 -- plpython to ltree is not yet implemented, so this will fail,
 -- because it will try to parse the Python list as an ltree input
 -- string.
 SELECT test2();
-ERROR:  syntax error at position 0
-CONTEXT:  while creating return value
-PL/Python function "test2"
+          test2
+-------------------------
+ "['foo', 'bar', 'baz']"
+(1 row)
+

--
Thomas Munro
https://enterprisedb.com


Reply | Threaded
Open this post in threaded view
|

Re: Ltree syntax improvement

Dmitry Belyavsky
Dear Thomas,

On Thu, Jul 11, 2019 at 11:20 AM Thomas Munro <[hidden email]> wrote:
On Wed, Jul 10, 2019 at 7:40 AM Dmitry Belyavsky <[hidden email]> wrote:
> [ltree_20190709.diff]

Hi Dmitry,

You need to update contrib/ltree_plpython/expected/ltree_plpython.out,
otherwise check-world fails when built with Python support.  The good
news is that it looks like it fails because you fixed something!
(Though I didn't check the details).

 CREATE FUNCTION test2() RETURNS ltree
 LANGUAGE plpythonu
 TRANSFORM FOR TYPE ltree
 AS $$
 return ['foo', 'bar', 'baz']
 $$;
 -- plpython to ltree is not yet implemented, so this will fail,
 -- because it will try to parse the Python list as an ltree input
 -- string.
 SELECT test2();
-ERROR:  syntax error at position 0
-CONTEXT:  while creating return value
-PL/Python function "test2"
+          test2
+-------------------------
+ "['foo', 'bar', 'baz']"
+(1 row)
+


See attached. I'm not familiar enough with python so I just removed the failing tests.
If the main patch is accepted, the ltree_python extension should be redesigned, I think...

--
SY, Dmitry Belyavsky

ltree_20190711.diff (109K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Ltree syntax improvement

Nikita Glukhov
Hi!

I have looked at the patch and found some problems.


1. I fixed some bugs (fixed patch with additional test cases is attached):

-- NULL 'lptr' pointer dereference at lquery_in()
=# SELECT '*'::lquery;
-- crash

-- '|' after '*{n}' is wrongly handled (LQPRS_WAITEND state)
=# SELECT '*{1}|2'::lquery;
WARNING:  problem in alloc set MessageContext: req size > alloc size for chunk 0x1c2d508 in block 0x1c2bc58
WARNING:  problem in alloc set MessageContext: req size > alloc size for chunk 0x1c2d508 in block 0x1c2bc58
   lquery    
-------------
 *{1}.*{,48}
(1 row)

-- wrong handling of trailing whitespace
=# SELECT '"a" '::ltree;
ERROR:  name of level is empty
LINE 1: SELECT '"a" '::ltree;
               ^
DETAIL:  Name length is 0 in position 4.


=# SELECT '"a" '::lquery;
ERROR:  name of level is empty
LINE 1: SELECT '"a" '::lquery;
               ^
DETAIL:  Name length is 0 in position 4.


-- backslashes are not escaped in ltree_out()/lquery_out(),
-- which is not consistent with ltree_in()/lquery_in()
=# SELECT '"\\"'::ltree;
 ltree 
-------
 "\"
(1 row)

=# SELECT '"\\"'::lquery;
 lquery 
--------
 "\"
(1 row)

=# SELECT '"\\"'::ltree::text::ltree;
ERROR:  syntax error
DETAIL:  Unexpected end of line.

=# SELECT '"\\"'::lquery::text::lquery;
ERROR:  syntax error
DETAIL:  Unexpected end of line.



2. There are inconsistencies in whitespace handling before and after *, @, %, {}
(I have not fixed this because I'm not sure how it is supposed to work):

-- whitespace before '*' is not ignored
=# SELECT '"a" *'::lquery;
 lquery 
--------
 "a\""*
(1 row)

=# SELECT 'a *'::lquery;
 lquery 
--------
 "a "*
(1 row)

-- whitespace after '*' and '{}' is disallowed
=# SELECT 'a* .b'::lquery;
ERROR:  syntax error at position 2
LINE 1: SELECT 'a* .b'::lquery;
               ^

=# SELECT 'a* |b'::lquery;
ERROR:  syntax error at position 2
LINE 1: SELECT 'a* |b'::lquery;
               ^

=# SELECT '*{1} .a'::lquery;
ERROR:  syntax error at position 4
LINE 1: SELECT '*{1} .a'::lquery;
               ^

-- but the whitespace after levels without '*@%{}' is allowed
=# SELECT 'a |b'::lquery;
 lquery 
--------
 a|b
(1 row)



3. Empty level names between '!' and '|' are allowed.  This behavior can be
seen on master, so it seems that we cannot fix it now:

-- master
=# SELECT '!|a'::lquery;
 lquery 
--------
 !|a
(1 row)

-- patched
=# SELECT '!|a'::lquery;
 lquery 
--------
 !""|a
(1 row)

-- empty level names in other places are disallowed
=# SELECT '!a|'::lquery;
ERROR:  syntax error
LINE 1: SELECT '!a|'::lquery;
               ^
DETAIL:  Unexpected end of line.

=# SELECT '|a'::lquery;
ERROR:  syntax error at position 0
LINE 1: SELECT '|a'::lquery;
               ^


4. It looks strange to me that leading and trailing unquoted whitespace is
ignored, but the internal whitespace is treated like a quoted:

=# SELECT ' a b   .  c     d     '::lquery;
     lquery      
-----------------
 "a b"."c     d"
(1 row)

I would prefer unquoted unescaped whitespace to be a delimiter always.


5. It seems wrong to me that ltree and lquery have different special character
sets now.  This leads to the fact that arbitrary ltree text cannot be used
directly as lquery text, as it seemed to be before the syntax improvements:

=# SELECT 'a|b'::ltree::text::lquery;
 lquery 
--------
 a|b
(1 row)

=# SELECT '"a|b"'::ltree::text::lquery;
 lquery 
--------
 a|b
(1 row)

=# SELECT '"a|b"'::lquery;
 lquery 
--------
 "a|b"
(1 row)

There might not be a problem if we had ltree::lquery cast.

Also I think that text[]::ltree/ltree::text[] casts for ltree
construction/deconstruction from text level names can be very useful.


6. ltree and escpecially lquery parsing code still look too complicated for me,
and I believe that the bugs described above are a direct consequence of this.
So the code needs to be refactored, maybe even without using of state machines.



On 11.07.2019 20:49, Dmitry Belyavsky wrote:

On Thu, Jul 11, 2019 at 11:20 AM Thomas Munro <[hidden email]> wrote:
On Wed, Jul 10, 2019 at 7:40 AM Dmitry Belyavsky <[hidden email]> wrote:
> [ltree_20190709.diff]

You need to update contrib/ltree_plpython/expected/ltree_plpython.out,
otherwise check-world fails when built with Python support.  The good
news is that it looks like it fails because you fixed something!
(Though I didn't check the details).

 CREATE FUNCTION test2() RETURNS ltree
 LANGUAGE plpythonu
 TRANSFORM FOR TYPE ltree
 AS $$
 return ['foo', 'bar', 'baz']
 $$;
 -- plpython to ltree is not yet implemented, so this will fail,
 -- because it will try to parse the Python list as an ltree input
 -- string.
 SELECT test2();
-ERROR:  syntax error at position 0
-CONTEXT:  while creating return value
-PL/Python function "test2"
+          test2
+-------------------------
+ "['foo', 'bar', 'baz']"
+(1 row)
+


See attached. I'm not familiar enough with python so I just removed the failing tests.
If the main patch is accepted, the ltree_python extension should be redesigned, I think...

7. ltree_plpython test does not fail now because Python list is converted to a
text and then to a ltree, and the textual representation of a Python list has
become a valid ltree text:

SELECT $$['foo', 'bar', 'baz']$$::ltree;
          ltree          
-------------------------
 "['foo', 'bar', 'baz']"
(1 row)

So Python lists can be now successfully converted to ltrees without a transform,
but the result is not that is expected ('foo.bar.baz'::ltree).


-- 
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

0001-Ltree-syntax-improvements-20190717.patch (84K) Download Attachment