to_date() validation

Previous Topic Next Topic
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view

to_date() validation

Brendan Jurd
Hi all,

Well, it's been a long time coming, but I've finally got a patch to
improve the error handling of to_date() and to_timestamp(), as
previously discussed on -hackers. [1][2]

The patch is against HEAD, and compiles cleanly and passes all
regression tests on gentoo amd64.

I did some testing to see whether the extra error checks had an
adverse effect on performance.  It turns out that my patch actually
performs materially *better*, somehow.  About 14.5% better over 1M
calls to to_date in a single statement.  Go figure (C compiler
optimisations are completely arcane to me).

What the patch does

With this patch I've tried to make DCH_from_char a bit smarter with
regard to parsing and validation of the user input.  The goal here is
that if the user makes a mistake and enters something invalid,
to_date() will throw an meaningful error instead of just acting like
everything is okay and then producing an absurd answer.

I've modified the code to error out usefully in a few different scenarios:

 * Non-integer value for an integer field
 * Out-of-range value for an integer field
 * Not enough characters in the input string to provide for all the
format fields
 * Two conflicting values given for a field
 * Illegal mixture of date conventions (ISO week date and Gregorian)

I've included new regression tests to check that these errors are
occurring when appropriate.

How I went about it

In order to detect an illegal mixture of Gregorian and ISO week date
conventions, the code needs to know which fomatting fields are
Gregorian, which are ISO WD and which are date-convention-neutral.  To
get this working I added a new enum called FromCharDateMode, and added
a field to both KeyWord and TmFromChar to hold the date mode.  In
KeyWord, it tells you which convention the keyword belongs to (YYYY is
Gregorian, IYYY is ISO WD and HH24 has nothing to do with dates).  In
TmFromChar it keeps track of which convention is currently in use.

Because TmFromChar now tracks the date mode independently, it is not
necessary to maintain separate fields for the ISO WD values, so
'iyear' and 'iw' are removed.

In order to get some consistent treatment on pulling integers out of a
string and stuffing them into a TmFromChar, I had to refactor
DCH_from_char somewhat.  The function is basically a giant switch
statement, with branches for each of the formatting fields.  In HEAD
these branches are highly duplicative.  The integer fields all perform
a sscanf() on the source string, putting the result straight into the
TmFromChar struct, and then skipping over the number of characters
consumed by sscanf().  I moved this logic into functions:

 * from_char_parse_int() uses strtol() rather than sscanf() to parse
an integer from the string, and produces meaningful errors if
something is wrong with the input,
 * from_char_set_int() saves an integer into a TmFromChar field, but
throws an error if the field has already been set to some other
non-zero value.

What the patch doesn't do

It doesn't consider to_number() at all.  I don't have any interest in
the number formatting stuff.

There is plenty of room to educate do_to_timestamp() further about
dates; it's still very naive.  Currently you can do something like
to_date('2008-50', 'YYYY-MM') and it won't even bat an eyelid.  It'll
just advance the date by 50 months.  I suppose you could consider that
a bug or a feature, depending on your point of view.  Personally, I
think of it as a bug.  If you're giving a month value outside of 1-12
to to_date(), chances are good that it's a user error in the query,
not a deliberate attempt to perform some date arithmetic.

do_to_timestamp() also doesn't try to detect logical mistakes like
'YYYY-MM-DD DDD' where the DDD part conflicts with the MM-DD part.  I
have some thoughts about how I would improve on that, but I think it's
best left for a separate patch.

The patch has been added to the September commitfest.  Thanks for your time.



Sent via pgsql-patches mailing list ([hidden email])
To make changes to your subscription:

to-date-validation-1.diff.gz (12K) Download Attachment
Reply | Threaded
Open this post in threaded view

Re: to_date() validation

Alex Hunsaker
On Fri, Aug 29, 2008 at 7:39 PM, Brendan Jurd <[hidden email]> wrote:
> Hi all,
> Well, it's been a long time coming, but I've finally got a patch to
> improve the error handling of to_date() and to_timestamp(), as
> previously discussed on -hackers. [1][2]

BTW -patches is obsolete just submit pathces to -hackers.

Im just following this: so lets get started.

Submission review: Everything looks good. Tests seem reasonable patch
applies etc.

Usability review: I think both linked threads in the parent mail give
sufficient justification.

Feature test: Everything seems to work as advertised. However before
via sscanf() most limited the max length of the input.  Before they
were just silently ignored now they get added to the result:

# SELECT to_timestamp('11111', 'HHMM');
 0009-03-19 11:00:00-06:59:56

# SELECT to_timestamp('11111', 'HHMM');
 0001-11-01 11:00:00-07 BC

Performance review: simple pgbench of to_timestamp did not show any
noticeable differences

Coding review: seemed fine to me, code matched surrounding style,
comments made sense etc

Code review: one minor nit
*** a/src/backend/utils/adt/formatting.c
--- b/src/backend/utils/adt/formatting.c
*** 781,787 **** static const KeyWord DCH_keywords[] = {

  /* last */
! {NULL, 0, 0, 0}

  /* ----------
--- 781,787 ----

  /* last */
! {NULL, 0, 0, 0, 0}

  /* ----------

Sent via pgsql-patches mailing list ([hidden email])
To make changes to your subscription: