Quantcast

Palle Girgensohn's ICU patch

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [pgsql-packagers] Palle Girgensohn's ICU patch

Palle

> 26 nov 2014 kl. 15:56 skrev Neil Tiffin <[hidden email]>:
>
>
> On Nov 26, 2014, at 8:21 AM, Greg Stark <[hidden email]> wrote:
>
>> I find it hard to believe the original premise of this thread. We knew
>> there were some problems with OSX and FreeBSD but surely they can't be
>> completely broken?
>
> Ever tried to use Spotlight for searching (English) on the Mac, not completely broken, just not reliable.  This does not surprise me in the least for OSX.  The Mac has, in recent history, become a “looks good", but the details may or may not be really correct platform.
>
> I thought FreeBSD was a preferred OS for PostgreSQL?  This does surprise me.

It works fine if you use the English language, or if you don't use utf-8. And it works fine with utf-8 if you don't care about "real world sorting", or if you do the sorting in your application anyway (most OS:es collations are really broken for non-english locales anyway).

So for a great number of people, it works great. For the rest of us, well, I use ICU... :)

>
>> What happens if you run "ls" with your locale set
>> to something like fr_FR.UTF8 ? Does Apple not sell Macs in countries
>> other than the US?
>
> Neil
> Daily Mac user for a long time.
>
> --
> Sent via pgsql-packagers mailing list ([hidden email])
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-packagers



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [pgsql-packagers] Palle Girgensohn's ICU patch

Palle
In reply to this post by Palle

> 26 nov 2014 kl. 14:06 skrev Palle Girgensohn <[hidden email]>:
>
> Well, this discussion actually pushes the priority quite a bit for me -- someone else actually beeing interested about the patch... I thought it was just me... :)=

By "pushes the priority", I mean it gets more prioritized, in case that was unclear. :)

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [pgsql-packagers] Palle Girgensohn's ICU patch

Geoff Montee
On Wed, Nov 26, 2014 at 10:17 AM, Palle Girgensohn <[hidden email]> wrote:
>
>
> > 26 nov 2014 kl. 14:06 skrev Palle Girgensohn <[hidden email]>:
> >
> > Well, this discussion actually pushes the priority quite a bit for me -- someone else actually beeing interested about the patch... I thought it was just me... :)=
>
> By "pushes the priority", I mean it gets more prioritized, in case that was unclear. :)

This topic reminds me of a thread from a couple months ago:

http://www.postgresql.org/message-id/F8268DB6-B50F-429F-8289-DA8FFA5F22BA@...

It sounds like adding ICU support to core may also allow for adding
collation versioning to indexes.

Geoff


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [pgsql-packagers] Palle Girgensohn's ICU patch

Peter Geoghegan-3
In reply to this post by Greg Stark
On Wed, Nov 26, 2014 at 6:21 AM, Greg Stark <[hidden email]> wrote:
> There were a number of problems with using ICU including the large
> dependency and the limitations of the iterator model but the main
> issue was that it's fundamentally a choice between being consistent
> with every other application on your system and being consistent with
> other Postgres databases running on other OSes. Most people run
> multiple applications on one OS, not many databases on many OSes on
> their own with no other applications. If Postgres used ICU then its
> output would be inconsistent with things like "sort" or "ls" or your
> application programming language's comparison operators.

Unless your application programming language is written in Java, as many are.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [pgsql-packagers] Palle Girgensohn's ICU patch

Peter Geoghegan-3
In reply to this post by Dave Page-3
On Wed, Nov 26, 2014 at 2:05 AM, Dave Page <[hidden email]> wrote:
> You may want to bear in mind that postgres.app is on the main PG
> downloads page on the website. If you're patching Postgres to add a
> feature like this, it would become a fork and would have to be moved
> out of the "PostgreSQL Core Distribution" section of the download area
> as we only include "pure" distributions there.

Doesn't the existing FreeBSD link go to the ports collection? And
doesn't the PostgreSQL package automatically use this very ICU patch?

It seems like the FreeBSD people were working around their poor OS
locale support here. While I think we should officially adopt ICU, it
seems a little unfair to call what they've done a fork.
--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [pgsql-packagers] Palle Girgensohn's ICU patch

Peter Eisentraut-2
On 11/26/14 12:46 PM, Peter Geoghegan wrote:

> On Wed, Nov 26, 2014 at 2:05 AM, Dave Page <[hidden email]> wrote:
>> You may want to bear in mind that postgres.app is on the main PG
>> downloads page on the website. If you're patching Postgres to add a
>> feature like this, it would become a fork and would have to be moved
>> out of the "PostgreSQL Core Distribution" section of the download area
>> as we only include "pure" distributions there.
>
> Doesn't the existing FreeBSD link go to the ports collection? And
> doesn't the PostgreSQL package automatically use this very ICU patch?
>
> It seems like the FreeBSD people were working around their poor OS
> locale support here. While I think we should officially adopt ICU, it
> seems a little unfair to call what they've done a fork.

I would welcome the addition of support for ICU and possibly other
locale libraries.  The features were designed with that in mind.

But I think what is being proposed here needs to be reigned in from time
to time.  Search the archives at various times for "debian", "gentoo",
or even "mandrake" for examples of what can happen when this goes too far.

It's a sliding scale.  FreeBSD ports are notionally a build-from-source
system targeted as experts.  Someone who installs a port has a chance to
look at the port definition and learn what will be installed.  (A build
option and a more explicit warning might be nice.)  Postgres.app is a
binary distribution apparently targeted at inexperienced or casual users
at a much bigger scale.  Users won't have an option to learn about this
unofficial feature or a chance to disable it.  Also, Postgres.app is not
the only distribution for this platform, so this could create a lot of
confusion.

It's open source, and I don't want to discourage people from
experimenting and sharing.  But I'm with Dave: listing a distribution
among the primary download options should imply that the software is as
pristine as possible.



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [pgsql-packagers] Palle Girgensohn's ICU patch

Palle

> 26 nov 2014 kl. 20:42 skrev Peter Eisentraut <[hidden email]>:
>
>  (A build
> option and a more explicit warning might be nice.)

In the freebsd ports, it is an option, default is off. :-)


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [pgsql-packagers] Palle Girgensohn's ICU patch

Peter Eisentraut-2
On 11/26/14 3:42 PM, Palle Girgensohn wrote:
>
>> 26 nov 2014 kl. 20:42 skrev Peter Eisentraut <[hidden email]>:
>>
>>  (A build
>> option and a more explicit warning might be nice.)
>
> In the freebsd ports, it is an option, default is off. :-)

That's even better.

Sorry, I looked at the port sources and couldn't identify that it was an
option.




--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [pgsql-packagers] Palle Girgensohn's ICU patch

Jakob Egger-2
In reply to this post by Geoff Montee
Am 26.11.2014 um 17:46 schrieb Geoff Montee <[hidden email]>:
> This topic reminds me of a thread from a couple months ago:
>
> http://www.postgresql.org/message-id/F8268DB6-B50F-429F-8289-DA8FFA5F22BA@...
>
> It sounds like adding ICU support to core may also allow for adding
> collation versioning to indexes.

Reading through this thread it becomes clear to me that adding support for ICU is more important than I thought, and the only problem is that no one has yet volunteered for it :)

I've started looking through the PostgreSQL source and Palle's patch to estimate what needs to be done.

MINIMUM TODO
============

* Add support for per-column collations in varstr_comp() in varlena.c. Currently the patch creates a single ICU collator for the default collation and stores it in a static variable. We would need to change this to create collators for each collation and store them in a hash table similar to pg_newlocale_from_collation() / lookup_collation_cache()

* There's a new feature in trunk for faster sorting using SortSupport, so we would also need to also patch bttextfastcmp_locale() in varlena.c

These two changes would allow using ICU for collation. This has two major advantages:
1) Systems with broken strcoll like OS X and FreeBSD can take advantage of ICU to offer proper text sorting
2) You can link with a specific version of ICU to avoid index corruption and duplicate keys caused by changing implementations of the glibc strcoll function


NEXT STEPS: Support for more collations
=======================================

ICU offers a lot more collations than the OS. For example, besides "de_CH" it also offers "de_CH@collation=phonebook". Adding support for these is a bit more involved.

* initdb would need to be extended to also look for collations offered by ICU and add them to the pg_collation catalog.

* A special case for LC_COLLATE must be added to check_locale() in the backend, get_canonical_locale_name() in pg_upgrade, check_locale_name() in initdb to support collations provided by ICU

* pg_perm_setlocale() must get a special case to handle ICU collations

* the local handling code in pgperl must be modified (when using a ICU collation as default collation, we must decide what collation to send to perl)

* convert_string_datum() in selfuncs.c could be patched to use ICU instead of strxfrm. However, as far as I understand, this is not absolutely required as this is only used by the query planner and would in the worst case prevent some optimisation in corner cases

These changes would probably have an even bigger impact, because then people would no longer be limited to the collations supported by the locales installed on their OS.

NEXT STEPS: Collation versioning in indices
===========================================

Since ICU provides reliable versioning of collations, this would allow us to finally prevent index corruption caused by changing implementations of strcoll. I haven't looked at this in detail, but I assume that this would be a small change with potentially big impact.

Ideally, PostgreSQL would detect when the collation is a different version than the one used to create the index, and stop using the index until it is rebuilt.


I'll take a shot at the MINIMUM TODO as outlined above.



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [pgsql-packagers] Palle Girgensohn's ICU patch

Dave Page-7


On Thu, Nov 27, 2014 at 9:09 AM, Jakob Egger <[hidden email]> wrote:
Am 26.11.2014 um 17:46 schrieb Geoff Montee <[hidden email]>:
> This topic reminds me of a thread from a couple months ago:
>
> http://www.postgresql.org/message-id/F8268DB6-B50F-429F-8289-DA8FFA5F22BA@...
>
> It sounds like adding ICU support to core may also allow for adding
> collation versioning to indexes.

Reading through this thread it becomes clear to me that adding support for ICU is more important than I thought, and the only problem is that no one has yet volunteered for it :)

I've started looking through the PostgreSQL source and Palle's patch to estimate what needs to be done.

MINIMUM TODO
============

* Add support for per-column collations in varstr_comp() in varlena.c. Currently the patch creates a single ICU collator for the default collation and stores it in a static variable. We would need to change this to create collators for each collation and store them in a hash table similar to pg_newlocale_from_collation() / lookup_collation_cache()

* There's a new feature in trunk for faster sorting using SortSupport, so we would also need to also patch bttextfastcmp_locale() in varlena.c

These two changes would allow using ICU for collation. This has two major advantages:
1) Systems with broken strcoll like OS X and FreeBSD can take advantage of ICU to offer proper text sorting
2) You can link with a specific version of ICU to avoid index corruption and duplicate keys caused by changing implementations of the glibc strcoll function


NEXT STEPS: Support for more collations
=======================================

ICU offers a lot more collations than the OS. For example, besides "de_CH" it also offers "de_CH@collation=phonebook". Adding support for these is a bit more involved.

* initdb would need to be extended to also look for collations offered by ICU and add them to the pg_collation catalog.

* A special case for LC_COLLATE must be added to check_locale() in the backend, get_canonical_locale_name() in pg_upgrade, check_locale_name() in initdb to support collations provided by ICU

* pg_perm_setlocale() must get a special case to handle ICU collations

* the local handling code in pgperl must be modified (when using a ICU collation as default collation, we must decide what collation to send to perl)

* convert_string_datum() in selfuncs.c could be patched to use ICU instead of strxfrm. However, as far as I understand, this is not absolutely required as this is only used by the query planner and would in the worst case prevent some optimisation in corner cases

These changes would probably have an even bigger impact, because then people would no longer be limited to the collations supported by the locales installed on their OS.

NEXT STEPS: Collation versioning in indices
===========================================

Since ICU provides reliable versioning of collations, this would allow us to finally prevent index corruption caused by changing implementations of strcoll. I haven't looked at this in detail, but I assume that this would be a small change with potentially big impact.

Ideally, PostgreSQL would detect when the collation is a different version than the one used to create the index, and stop using the index until it is rebuilt.


I'll take a shot at the MINIMUM TODO as outlined above.


We've already included ICU support in our Postgres Plus Advanced Server product. Before you spend too much time on this, give me a few days to see if we can get that change contributed back. The people I need to speak to are OOO for Thanksgiving at the moment though, so it may be a few days.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [pgsql-packagers] Palle Girgensohn's ICU patch

Greg Stark
In reply to this post by Jakob Egger-2


On 27 Nov 2014 09:09, "Jakob Egger" <[hidden email]> wrote:
>
> ICU offers a lot more collations than the OS. For example, besides "de_CH" it also offers "de_CH@collation=phonebook". Adding support for these is a bit more involved.
>
> * initdb would need to be extended to also look for collations offered by ICU and add them to the pg_collation catalog.

Hm. Actually the pg_collation catalog might give a handy way out for the issue of being inconsistent with the system collation. We could support both sets of collations and let the user select an ICU collation or system collation at runtime.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [pgsql-packagers] Palle Girgensohn's ICU patch

Tom Lane-2
Greg Stark <[hidden email]> writes:
> Hm. Actually the pg_collation catalog might give a handy way out for the
> issue of being inconsistent with the system collation. We could support
> both sets of collations and let the user select an ICU collation or system
> collation at runtime.

+1 ... this seems like a nice end-run around the backwards compatibility
problem.

Another issue is that (AFAIK) ICU doesn't support any non-Unicode
encodings, which means that a build supporting *only* ICU collations is a
nonstarter IMO.  So we really need a way to deal with both system and ICU
collations, and treating the latter as a separate subset of pg_collation
seems like a decent way to do that.  (ISTR some discussion about forcibly
converting strings in other encodings to Unicode to compare them, but
I sure don't want to do that.  I think it'd be saner just to mark the
ICU collations as only compatible with UTF8 database encoding.)

                        regards, tom lane

PS: I've removed pgsql-packagers from the cc, this thread is no
longer relevant to them.


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [pgsql-packagers] Palle Girgensohn's ICU patch

Peter Geoghegan-3
On Thu, Nov 27, 2014 at 7:03 AM, Tom Lane <[hidden email]> wrote:

> +1 ... this seems like a nice end-run around the backwards compatibility
> problem.
>
> Another issue is that (AFAIK) ICU doesn't support any non-Unicode
> encodings, which means that a build supporting *only* ICU collations is a
> nonstarter IMO.  So we really need a way to deal with both system and ICU
> collations, and treating the latter as a separate subset of pg_collation
> seems like a decent way to do that.  (ISTR some discussion about forcibly
> converting strings in other encodings to Unicode to compare them, but
> I sure don't want to do that.  I think it'd be saner just to mark the
> ICU collations as only compatible with UTF8 database encoding.)

I would like to see ICU become the defacto standard set of collations,
with support for *versioning*, in the same way that UTF-8 might be
considered the defacto standard encoding.

It seems likely that we'll want to store sort keys (strxfrm() blobs)
in indexes at some point in the future. I now believe that that's more
problematic than just using strcoll() in B-Tree support function 1.
Although that isn't the most compelling reason to pursue ICU support.
--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [pgsql-packagers] Palle Girgensohn's ICU patch

Tatsuo Ishii-4
In reply to this post by Tom Lane-2
> Another issue is that (AFAIK) ICU doesn't support any non-Unicode
> encodings, which means that a build supporting *only* ICU collations is a
> nonstarter IMO.  So we really need a way to deal with both system and ICU
> collations, and treating the latter as a separate subset of pg_collation
> seems like a decent way to do that.  (ISTR some discussion about forcibly
> converting strings in other encodings to Unicode to compare them, but
> I sure don't want to do that.  I think it'd be saner just to mark the
> ICU collations as only compatible with UTF8 database encoding.)

+1. Forcing only Unicode collation is totally unacceptable.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [pgsql-packagers] Palle Girgensohn's ICU patch

Palle
In reply to this post by Tom Lane-2

> 27 nov 2014 kl. 16:03 skrev Tom Lane <[hidden email]>:
>
> Another issue is that (AFAIK) ICU doesn't support any non-Unicode
> encodings, which means that a build supporting *only* ICU collations is a
> nonstarter IMO.

The patch I originally wrote replaces strwcoll but for keeps the original behaviour for 8-bit charsets' encodings.

--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [pgsql-packagers] Palle Girgensohn's ICU patch

Palle
In reply to this post by Dave Page-7

> 27 nov 2014 kl. 10:15 skrev Dave Page <[hidden email]>:
>
>
>
> On Thu, Nov 27, 2014 at 9:09 AM, Jakob Egger <[hidden email]> wrote:
> Am 26.11.2014 um 17:46 schrieb Geoff Montee <[hidden email]>:
> > This topic reminds me of a thread from a couple months ago:
> >
> > http://www.postgresql.org/message-id/F8268DB6-B50F-429F-8289-DA8FFA5F22BA@...
> >
> > It sounds like adding ICU support to core may also allow for adding
> > collation versioning to indexes.
>
> Reading through this thread it becomes clear to me that adding support for ICU is more important than I thought, and the only problem is that no one has yet volunteered for it :)
>
> I've started looking through the PostgreSQL source and Palle's patch to estimate what needs to be done.
>
> MINIMUM TODO
> ============
>
> * Add support for per-column collations in varstr_comp() in varlena.c. Currently the patch creates a single ICU collator for the default collation and stores it in a static variable. We would need to change this to create collators for each collation and store them in a hash table similar to pg_newlocale_from_collation() / lookup_collation_cache()
>
> * There's a new feature in trunk for faster sorting using SortSupport, so we would also need to also patch bttextfastcmp_locale() in varlena.c
>
> These two changes would allow using ICU for collation. This has two major advantages:
> 1) Systems with broken strcoll like OS X and FreeBSD can take advantage of ICU to offer proper text sorting
> 2) You can link with a specific version of ICU to avoid index corruption and duplicate keys caused by changing implementations of the glibc strcoll function
>
>
> NEXT STEPS: Support for more collations
> =======================================
>
> ICU offers a lot more collations than the OS. For example, besides "de_CH" it also offers "de_CH@collation=phonebook". Adding support for these is a bit more involved.
>
> * initdb would need to be extended to also look for collations offered by ICU and add them to the pg_collation catalog.
>
> * A special case for LC_COLLATE must be added to check_locale() in the backend, get_canonical_locale_name() in pg_upgrade, check_locale_name() in initdb to support collations provided by ICU
>
> * pg_perm_setlocale() must get a special case to handle ICU collations
>
> * the local handling code in pgperl must be modified (when using a ICU collation as default collation, we must decide what collation to send to perl)
>
> * convert_string_datum() in selfuncs.c could be patched to use ICU instead of strxfrm. However, as far as I understand, this is not absolutely required as this is only used by the query planner and would in the worst case prevent some optimisation in corner cases
>
> These changes would probably have an even bigger impact, because then people would no longer be limited to the collations supported by the locales installed on their OS.
>
> NEXT STEPS: Collation versioning in indices
> ===========================================
>
> Since ICU provides reliable versioning of collations, this would allow us to finally prevent index corruption caused by changing implementations of strcoll. I haven't looked at this in detail, but I assume that this would be a small change with potentially big impact.
>
> Ideally, PostgreSQL would detect when the collation is a different version than the one used to create the index, and stop using the index until it is rebuilt.
>
>
> I'll take a shot at the MINIMUM TODO as outlined above.
>
>
> We've already included ICU support in our Postgres Plus Advanced Server product. Before you spend too much time on this, give me a few days to see if we can get that change contributed back. The people I need to speak to are OOO for Thanksgiving at the moment though, so it may be a few days.
>
> --

Hi,

Just poking this old thread again. What happened here, is anyone putting work into this area at the moment?

Palle



signature.asc (506 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [pgsql-packagers] Palle Girgensohn's ICU patch

Peter Eisentraut-2
On 4/19/15 7:46 AM, Palle Girgensohn wrote:
> Just poking this old thread again. What happened here, is anyone putting work into this area at the moment?

I plan to look at it for 9.6.



--
Sent via pgsql-hackers mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
12
Previous Thread Next Thread
Loading...