Unicode support

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Unicode support

Dave Page
Hi Anoop and anyone else who might be interested,

I've been thinking about how the Unicode support might be improved to
allow the old 07.xx non-unicode style behaviour to work for those that
need it. At them moment, when we connect using on of the wide connect
functions, the CC->unicode flag is set to true. This only affects a few
options, such as pgtype_to_concise_type()'s mapping of PG types to SQL
types.

It seems to me that perhaps we should set CC->unicode = 1, only upon
connection to a Unicode database. Anything else, we leave it set to 0 so
that it always maps varchars etc to ANSI types, and handles other
encodings in single byte or non-unicode multibyte mode (which worked
fine in 07.xx where those encodings where appropriate, such as SJIS in
Japan). This should also help BDE based apps, which further research has
shown me are broken with Unicode columns in SQL Server and Oracle as
well as PostgreSQL (search unicode + BDE on Google Groups for more).

Am I seeing a possible improvement where in fact there isn't one, or
missing some obvious downside?

Regards, Dave.

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Hiroshi Saito
From: "Dave Page"

> Hi Anoop and anyone else who might be interested,
>
> I've been thinking about how the Unicode support might be improved to
> allow the old 07.xx non-unicode style behaviour to work for those that

Yea, I think that a libpq version is very great. However, Legacy environment
is raising the scream for a rapid change. Then, I think that a multibyte needs
to be supported.

Regards,
Hiroshi Saito


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Anoop Kumar
In reply to this post by Dave Page
Hi Dave,

Checking for the database encoding and calling the functions using the
appropriate flag seems to be fine.

Regards

Anoop

> -----Original Message-----
> From: Dave Page [mailto:[hidden email]]
> Sent: Wednesday, August 31, 2005 3:07 AM
> To: Anoop Kumar
> Cc: [hidden email]
> Subject: Unicode support
>
> Hi Anoop and anyone else who might be interested,
>
> I've been thinking about how the Unicode support might be improved to
> allow the old 07.xx non-unicode style behaviour to work for those that
> need it. At them moment, when we connect using on of the wide connect
> functions, the CC->unicode flag is set to true. This only affects a
few
> options, such as pgtype_to_concise_type()'s mapping of PG types to SQL
> types.
>
> It seems to me that perhaps we should set CC->unicode = 1, only upon
> connection to a Unicode database. Anything else, we leave it set to 0
so
> that it always maps varchars etc to ANSI types, and handles other
> encodings in single byte or non-unicode multibyte mode (which worked
> fine in 07.xx where those encodings where appropriate, such as SJIS in
> Japan). This should also help BDE based apps, which further research
has
> shown me are broken with Unicode columns in SQL Server and Oracle as
> well as PostgreSQL (search unicode + BDE on Google Groups for more).
>
> Am I seeing a possible improvement where in fact there isn't one, or
> missing some obvious downside?
>
> Regards, Dave.

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Dave Page
In reply to this post by Dave Page
 

> -----Original Message-----
> From: Hiroshi Saito [mailto:[hidden email]]
> Sent: 31 August 2005 02:56
> To: Dave Page; Anoop Kumar
> Cc: [hidden email]
> Subject: Re: [ODBC] Unicode support
>
> From: "Dave Page"
>
> > Hi Anoop and anyone else who might be interested,
> >
> > I've been thinking about how the Unicode support might be
> improved to
> > allow the old 07.xx non-unicode style behaviour to work for
> those that
>
> Yea, I think that a libpq version is very great. However,
> Legacy environment
> is raising the scream for a rapid change. Then, I think that
> a multibyte needs
> to be supported.

OK, I'll prepare a patch, and because it's an odd problem, a test build
to go with it. Any voluteers to test? It really needs people with a
reproducable encoding error that doesn't existing in 07.xx, or people
using BDE (which barfs bigtime on SQL_C_Wxxx data).

Regards, Dave.

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Johann Zuschlag
Dave Page schrieb:

>
>OK, I'll prepare a patch, and because it's an odd problem, a test build
>to go with it. Any voluteers to test? It really needs people with a
>reproducable encoding error that doesn't existing in 07.xx, or people
>using BDE (which barfs bigtime on SQL_C_Wxxx data).
>
>Regards, Dave.
>
>  
>
Hi Dave,

just send it to me (the windows dll). Even though I just switched my
linux server to UTF-8. :-)
I'll test it with the old enviroment.

Regards,
Johann




---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to [hidden email] so that your
       message can get through to the mailing list cleanly
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Dave Page
In reply to this post by Dave Page
 

> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of Dave Page
> Sent: 31 August 2005 08:20
> To: Hiroshi Saito; Anoop Kumar
> Cc: [hidden email]
> Subject: Re: [ODBC] Unicode support
>
>  
> OK, I'll prepare a patch, and because it's an odd problem, a
> test build
> to go with it. Any voluteers to test? It really needs people with a
> reproducable encoding error that doesn't existing in 07.xx, or people
> using BDE (which barfs bigtime on SQL_C_Wxxx data).
OK, patch attached. This works slightly differently than I envisaged,
because simply switching off Unicode isn't that straight forward,
especially if the DM is using the *W functions.

Basically what this does is only offer wide character types if the
database is unicode, and, in that case, sets the client encoding to
unicode. For anything else, it will report non-wide character types as
per the 07 driver, and let the user set their own encoding as required.
From what I can tell of the BDE missing fields problem, this should
almost certainly fix it.

Please look at this carefully - as most of you know, MB/Unicode issues
aren't exactly my strong point!

I'll forward test DLLs to volunteer victims privately.

Regards, Dave.


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

widetypes.diff (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Marko Ristola
In reply to this post by Dave Page

LATIN1 and UCS have one common point by design:
0x00 - 0xFF are equal numbers, so the SQL_ASCII
ignorance means, that LATIN1 characters won't get changed!

So, this means, that:
0xE4 in ISO-8859-1 is the same as
0x00E4 in UCS-2. Just the number of needed bytes change.

Reference: "man 7 unicode"

Marko Ristola

Dave Page wrote:

>Hi Anoop and anyone else who might be interested,
>
>I've been thinking about how the Unicode support might be improved to
>allow the old 07.xx non-unicode style behaviour to work for those that
>need it. At them moment, when we connect using on of the wide connect
>functions, the CC->unicode flag is set to true. This only affects a few
>options, such as pgtype_to_concise_type()'s mapping of PG types to SQL
>types.
>
>It seems to me that perhaps we should set CC->unicode = 1, only upon
>connection to a Unicode database. Anything else, we leave it set to 0 so
>that it always maps varchars etc to ANSI types, and handles other
>encodings in single byte or non-unicode multibyte mode (which worked
>fine in 07.xx where those encodings where appropriate, such as SJIS in
>Japan). This should also help BDE based apps, which further research has
>shown me are broken with Unicode columns in SQL Server and Oracle as
>well as PostgreSQL (search unicode + BDE on Google Groups for more).
>
>Am I seeing a possible improvement where in fact there isn't one, or
>missing some obvious downside?
>
>Regards, Dave.
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: Don't 'kill -9' the postmaster
>  
>


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Hiroshi Saito
In reply to this post by Dave Page
From: "Dave Page"

Thanks.!!

> Please look at this carefully - as most of you know, MB/Unicode issues
> aren't exactly my strong point!

Ok, I am going to try the specification in multibyte. :-)

>
> I'll forward test DLLs to volunteer victims privately.

Regards,
Hiroshi Saito


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to [hidden email] so that your
       message can get through to the mailing list cleanly
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Hiroshi Saito
Hi Dave.

I tried your patch by SJIS of Japan. It seems that it needs some additional
correction. Moreover, it is necessary to make the driver different from
UNICODE (WideCharacter). It seems that I have to catch up further.

BTW, I remembered the discussion original by pgAdminIII. I said that I
should support MullutiByte then. However, How is it now? It is very wonderful.
I feel that that there are many choices of a character code complicates a problem
more. but, it is although external environment is different.

Regards,
Hiroshi Saito

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

CONVERT_PATCH.txt (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Miguel Juan
In reply to this post by Dave Page
hello,

I will try it with BDE environment if you want.

Regards,

Miguel Juan


----- Original Message -----
From: "Dave Page" <[hidden email]>
To: "Anoop Kumar" <[hidden email]>
Cc: <[hidden email]>
Sent: Tuesday, August 30, 2005 11:36 PM
Subject: [ODBC] Unicode support


Hi Anoop and anyone else who might be interested,

I've been thinking about how the Unicode support might be improved to
allow the old 07.xx non-unicode style behaviour to work for those that
need it. At them moment, when we connect using on of the wide connect
functions, the CC->unicode flag is set to true. This only affects a few
options, such as pgtype_to_concise_type()'s mapping of PG types to SQL
types.

It seems to me that perhaps we should set CC->unicode = 1, only upon
connection to a Unicode database. Anything else, we leave it set to 0 so
that it always maps varchars etc to ANSI types, and handles other
encodings in single byte or non-unicode multibyte mode (which worked
fine in 07.xx where those encodings where appropriate, such as SJIS in
Japan). This should also help BDE based apps, which further research has
shown me are broken with Unicode columns in SQL Server and Oracle as
well as PostgreSQL (search unicode + BDE on Google Groups for more).

Am I seeing a possible improvement where in fact there isn't one, or
missing some obvious downside?

Regards, Dave.

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Miguel Juan
In reply to this post by Dave Page

----- Original Message -----
From: "Dave Page" <[hidden email]>
To: "Miguel Juan" <[hidden email]>
Cc: <[hidden email]>
Sent: Thursday, September 01, 2005 10:32 AM
Subject: RE: [ODBC] Unicode support


Yes please - attached.

> -----Original Message-----
> From: Miguel Juan [mailto:[hidden email]]
> Sent: 01 September 2005 09:25
> To: Dave Page
> Cc: [hidden email]
> Subject: Re: [ODBC] Unicode support
>
> hello,
>
> I will try it with BDE environment if you want.
>
> Regards,
>
> Miguel Juan
>
>
> ----- Original Message -----
> From: "Dave Page" <[hidden email]>
> To: "Anoop Kumar" <[hidden email]>
> Cc: <[hidden email]>
> Sent: Tuesday, August 30, 2005 11:36 PM
> Subject: [ODBC] Unicode support
>
>
> Hi Anoop and anyone else who might be interested,
>
> I've been thinking about how the Unicode support might be improved to
> allow the old 07.xx non-unicode style behaviour to work for those that
> need it. At them moment, when we connect using on of the wide connect
> functions, the CC->unicode flag is set to true. This only
> affects a few
> options, such as pgtype_to_concise_type()'s mapping of PG types to SQL
> types.
>
> It seems to me that perhaps we should set CC->unicode = 1, only upon
> connection to a Unicode database. Anything else, we leave it
> set to 0 so
> that it always maps varchars etc to ANSI types, and handles other
> encodings in single byte or non-unicode multibyte mode (which worked
> fine in 07.xx where those encodings where appropriate, such as SJIS in
> Japan). This should also help BDE based apps, which further
> research has
> shown me are broken with Unicode columns in SQL Server and Oracle as
> well as PostgreSQL (search unicode + BDE on Google Groups for more).
>
> Am I seeing a possible improvement where in fact there isn't one, or
> missing some obvious downside?
>
> Regards, Dave.
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>
>
>



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Dave Page
In reply to this post by Dave Page
Did this miss something?

:-)

/D

> -----Original Message-----
> From: Miguel Juan [mailto:[hidden email]]
> Sent: 01 September 2005 10:06
> To: Dave Page
> Cc: [hidden email]
> Subject: Re: [ODBC] Unicode support
>
>
> ----- Original Message -----
> From: "Dave Page" <[hidden email]>
> To: "Miguel Juan" <[hidden email]>
> Cc: <[hidden email]>
> Sent: Thursday, September 01, 2005 10:32 AM
> Subject: RE: [ODBC] Unicode support
>
>
> Yes please - attached.
>
> > -----Original Message-----
> > From: Miguel Juan [mailto:[hidden email]]
> > Sent: 01 September 2005 09:25
> > To: Dave Page
> > Cc: [hidden email]
> > Subject: Re: [ODBC] Unicode support
> >
> > hello,
> >
> > I will try it with BDE environment if you want.
> >
> > Regards,
> >
> > Miguel Juan
> >
> >
> > ----- Original Message -----
> > From: "Dave Page" <[hidden email]>
> > To: "Anoop Kumar" <[hidden email]>
> > Cc: <[hidden email]>
> > Sent: Tuesday, August 30, 2005 11:36 PM
> > Subject: [ODBC] Unicode support
> >
> >
> > Hi Anoop and anyone else who might be interested,
> >
> > I've been thinking about how the Unicode support might be
> improved to
> > allow the old 07.xx non-unicode style behaviour to work for
> those that
> > need it. At them moment, when we connect using on of the
> wide connect
> > functions, the CC->unicode flag is set to true. This only
> > affects a few
> > options, such as pgtype_to_concise_type()'s mapping of PG
> types to SQL
> > types.
> >
> > It seems to me that perhaps we should set CC->unicode = 1, only upon
> > connection to a Unicode database. Anything else, we leave it
> > set to 0 so
> > that it always maps varchars etc to ANSI types, and handles other
> > encodings in single byte or non-unicode multibyte mode (which worked
> > fine in 07.xx where those encodings where appropriate, such
> as SJIS in
> > Japan). This should also help BDE based apps, which further
> > research has
> > shown me are broken with Unicode columns in SQL Server and Oracle as
> > well as PostgreSQL (search unicode + BDE on Google Groups for more).
> >
> > Am I seeing a possible improvement where in fact there isn't one, or
> > missing some obvious downside?
> >
> > Regards, Dave.
> >
> > ---------------------------(end of
> > broadcast)---------------------------
> > TIP 2: Don't 'kill -9' the postmaster
> >
> >
> >
>
>
>

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Dave Page
In reply to this post by Dave Page
 

> -----Original Message-----
> From: Hiroshi Saito [mailto:[hidden email]]
> Sent: 31 August 2005 21:00
> To: Hiroshi Saito; Dave Page; Anoop Kumar
> Cc: [hidden email]
> Subject: Re: [ODBC] Unicode support
>
> Hi Dave.
>
> I tried your patch by SJIS of Japan. It seems that it needs
> some additional
> correction. Moreover, it is necessary to make the driver
> different from
> UNICODE (WideCharacter). It seems that I have to catch up further.

Hmmm, well I can't remove the Unicode functions. Do your apps request
SQL_C_WCHAR etc even if the driver doesn't offer it?

> BTW, I remembered the discussion original by pgAdminIII. I
> said that I
> should support MullutiByte then. However, How is it now? It
> is very wonderful.
> I feel that that there are many choices of a character code
> complicates a problem
> more. but, it is although external environment is different.

Hmm, I hate multibyte :-(!!

Regards, Dave

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to [hidden email] so that your
       message can get through to the mailing list cleanly
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Miguel Juan
In reply to this post by Dave Page
Hello Dave,

I'm just trying the last fix (for BDE) and I can see some odd behavior.

- It shows the TEXT fields as MEMO. But you can see the data if you make a
double click on it. It looks like it doesn't use the "text as LongVarchar"
option (this works in version 7.x).

- After a "SELECT * FROM table" The Borland SQL Explorer shows an error
('Invalid Blob Handle') for empty TEXT fields (NULL) when you try to view
them. This works fine for table view.

- After an error inserting a row ('not null' constraint) it closes the
connection (dead connection error)


Regards,

Miguel Juan




----- Original Message -----
From: "Dave Page" <[hidden email]>
To: "Miguel Juan" <[hidden email]>
Cc: <[hidden email]>
Sent: Thursday, September 01, 2005 10:32 AM
Subject: RE: [ODBC] Unicode support


Yes please - attached.

> -----Original Message-----
> From: Miguel Juan [mailto:[hidden email]]
> Sent: 01 September 2005 09:25
> To: Dave Page
> Cc: [hidden email]
> Subject: Re: [ODBC] Unicode support
>
> hello,
>
> I will try it with BDE environment if you want.
>
> Regards,
>
> Miguel Juan
>
>
> ----- Original Message -----
> From: "Dave Page" <[hidden email]>
> To: "Anoop Kumar" <[hidden email]>
> Cc: <[hidden email]>
> Sent: Tuesday, August 30, 2005 11:36 PM
> Subject: [ODBC] Unicode support
>
>
> Hi Anoop and anyone else who might be interested,
>
> I've been thinking about how the Unicode support might be improved to
> allow the old 07.xx non-unicode style behaviour to work for those that
> need it. At them moment, when we connect using on of the wide connect
> functions, the CC->unicode flag is set to true. This only
> affects a few
> options, such as pgtype_to_concise_type()'s mapping of PG types to SQL
> types.
>
> It seems to me that perhaps we should set CC->unicode = 1, only upon
> connection to a Unicode database. Anything else, we leave it
> set to 0 so
> that it always maps varchars etc to ANSI types, and handles other
> encodings in single byte or non-unicode multibyte mode (which worked
> fine in 07.xx where those encodings where appropriate, such as SJIS in
> Japan). This should also help BDE based apps, which further
> research has
> shown me are broken with Unicode columns in SQL Server and Oracle as
> well as PostgreSQL (search unicode + BDE on Google Groups for more).
>
> Am I seeing a possible improvement where in fact there isn't one, or
> missing some obvious downside?
>
> Regards, Dave.
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>
>
>



---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Dave Page
In reply to this post by Dave Page
 

> -----Original Message-----
> From: Miguel Juan [mailto:[hidden email]]
> Sent: 01 September 2005 11:06
> To: Dave Page
> Cc: [hidden email]
> Subject: Re: [ODBC] Unicode support
>
> Hello Dave,
>
> I'm just trying the last fix (for BDE) and I can see some odd
> behavior.
>
> - It shows the TEXT fields as MEMO. But you can see the data
> if you make a
> double click on it. It looks like it doesn't use the "text as
> LongVarchar"
> option (this works in version 7.x).

Right, I'll look at that.

> - After a "SELECT * FROM table" The Borland SQL Explorer
> shows an error
> ('Invalid Blob Handle') for empty TEXT fields (NULL) when you
> try to view
> them. This works fine for table view.

Strange.

> - After an error inserting a row ('not null' constraint) it
> closes the
> connection (dead connection error)

I'll test that as well.

To be honest though, I've been researching BDE on Google Groups and
there are lots of people reporting similar problems with SQL Server and
Oracle - apparently BDE fails to work properly with any Unicode data.
I'm happy to spend a little time trying to work around that, but I can't
spend masses of time on it.

Regards, Dave.

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Marko Ristola
In reply to this post by Dave Page

Hi all.

How about creating a charset conversion interface
and taking UTF-8 as an internal format for ODBC?:

At least the following functions might be needed:

Internal2WChar()
WChar2Internal()

Internal2Char()
Char2Internal()

Backend would talk only UTF-8.

Here is a minimum set of interface
(Object oriented design term) functions:

cvt_FromUTF8()
cvt_ToUTF8()
cvt_Free()

Interface implementation:

struct CvtInterface {
  char (*cvt_FromUTF8)(void *internalData, char *source, size_t bytes);
  char (*cvt_ToUTF8)(void *internalData, char *source, size_t bytes);
  void (*cvt_Free)(void *internalData);

  void *internalData;
}
Object creation:

struct Env {
    struct CvtInterface char_cvt; // C program char conversions
    struct CvtInterface wchar_cvt; // C program wchar_t conversions
};

struct CvtInterface utf8_to_utf8_New();
env->char_cvt = utf8_to_utf8_New();

These are some interface implementation functions:
(I don't know, how many are needed, but at least
supporting of char, wchar and multibyte is needed).

sjis_new()
sjis_FromUTF8()
sjis_ToUTF8()
sjis_Free()

wchar_FromUTF8()
wchar_ToUTF8()
wchar_Free()

char_FromUTF8()
char_ToUTF8()
char_Free()

utf8_FromUTF8()
utf8_ToUTF8()
utf8_Free()

ascii8_FromUTF8()
ascii8_ToUTF8()
ascii8_Free()

So, there would be a single internal UTF-8 format inside PsqlODBC.
The backend could always deliver UTF-8, so the need for internal
format <-> backend format layer is not needed.

This implementation would be easy to implement.

Examples:

A C program calls SQLExecuteW.
AllocEnv has found out, that the wchar format is UCS-2.
So it has created an object:
env->char_cvt = cvt_ucs2_UTF8_New();

The PGAPI function needs to convert from WCHAR into internal format:
sqlquery = (*env->char_cvt->cvt_ToUTF8)(wcharquery);
Then the sqlquery is in UTF8, and the query is in
an easilly manageable format!

A C program uses SQLGetDataW to get a string.
So when the data will be converted in convert.c, psqlodbc calls:
result = (*env->char_cvt->cvt_FromUTF8)(internalformat);

I don't know, wether ENV handle is the best place to put the converter
objects.

I like about this implementation:
- Simplifies support for clients using different charsets.
- Simplifies psqodbc internally, because of internal UTF8 assumption.
- Easy to implement and to test.
- Easy to add more converters, when the initial implementation works.
- Enables usage of advanced lexers and parsers when needed to improve
performance.
- PSQLODBC will support well all UTF-8 supported charsets.
 
I have not suggested this before, because of the following reasons:
- psqlodbc charset conversion implementation seems to work most times.
- Avoiding unnecessary charset conversions is good for performance.
- It takes time to implement and test this.
- Unnecessary malloc + free is bad for performance.

What do you think about this?
Would this solve the problems?
Is this implementable?
Would the performance be good enough?
Would this simplify things (that's the Goal)?

Regards,
Marko Ristola


Dave Page wrote:

>
>
>  
>
>>-----Original Message-----
>>From: Hiroshi Saito [mailto:[hidden email]]
>>Sent: 31 August 2005 21:00
>>To: Hiroshi Saito; Dave Page; Anoop Kumar
>>Cc: [hidden email]
>>Subject: Re: [ODBC] Unicode support
>>
>>Hi Dave.
>>
>>I tried your patch by SJIS of Japan. It seems that it needs
>>some additional
>>correction. Moreover, it is necessary to make the driver
>>different from
>>UNICODE (WideCharacter). It seems that I have to catch up further.
>>    
>>
>
>Hmmm, well I can't remove the Unicode functions. Do your apps request
>SQL_C_WCHAR etc even if the driver doesn't offer it?
>
>  
>



>>BTW, I remembered the discussion original by pgAdminIII. I
>>said that I
>>should support MullutiByte then. However, How is it now? It
>>is very wonderful.
>>I feel that that there are many choices of a character code
>>complicates a problem
>>more. but, it is although external environment is different.
>>    
>>
>
>Hmm, I hate multibyte :-(!!
>
>Regards, Dave
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to [hidden email] so that your
>       message can get through to the mailing list cleanly
>  
>


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Dave Page
In reply to this post by Dave Page
 

> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of Marko Ristola
> Sent: 01 September 2005 18:21
> Cc: Hiroshi Saito; Anoop Kumar; [hidden email]
> Subject: Re: [ODBC] Unicode support
>
>
> Hi all.

Hi Marko,

> How about creating a charset conversion interface
> and taking UTF-8 as an internal format for ODBC?:
>

<snip>

>
> So, there would be a single internal UTF-8 format inside PsqlODBC.
> The backend could always deliver UTF-8, so the need for internal
> format <-> backend format layer is not needed.
>
> This implementation would be easy to implement.

This is what already happens (if you ignore my recent experimental
patch).

If the connection is made using one of the *W connect functions, then
the ConnectionClass->unicode flag is set to true, and SET
client_encoding = 'UTF-8' is sent to the backend. From then on, data
going out to the client is fed through utf8_to_ucs2_lf() *if * the data
type is specified as SQL_C_WCHAR, and data coming in to *W functions is
fed through ucs2_to_utf8().

Afaict, Unicode mode works exactly as it should.

If the connection is made using a non-wide function, the
ConnectionClass->unicode is not set. In this case, the client is
expected to continue using non-wide functions, and the client encoding
left at default. In this case, the driver will never report data types
as SQL_C_WCHAR.

This, is where I believe the major problem occurs - if the ODBC Driver
Manager sees that SQLConnectW (iirc) exists, it will automatically map
ANSI calls (eg. SQLConnect()) to Unicode (eg. SQLConnectW()). This then
causes the driver to report text/char columns as SQL_C_WCHAR. Less well
written apps then fall over because they aren't clever enough to request
data as SQL_C_CHAR instead of SQL_C_WCHAR.

My recent experimental patch aims to address this, by forcing the driver
to report SQL_C_CHAR instead of SQL_C_WCHAR for non-unicode databases.
This should (and seems to, with minor side effects yet to be fully
investigated) fix the BDE problem.

As for multibyte (non-unicode) data such as Hiroshi's, my understanding
is that in the presence of a Unicode driver, apps are expected to use
Unicode (and in fact, are forced to by the driver manager's mapping of
ANSI function calls to Unicode calls).

Anoop, do you or any of your guys (or anyone else) know
unicode/multibyte/encoding well? I'm learning as I go at the moment, so
some more experienced help would be *really* appreciated.

Regards, Dave.

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Anoop Kumar
In reply to this post by Dave Page
> Anoop, do you or any of your guys (or anyone else) know
> unicode/multibyte/encoding well? I'm learning as I go at the moment,
so
> some more experienced help would be *really* appreciated.

Sorry Dave, No one in my list! :-(
 
Regards

Anoop

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Marko Ristola
In reply to this post by Hiroshi Saito

So, I don't have much experience with Windows ODBC. That's true.
Is it possible to compile psqlodbc with MinGW tools for Windows?

After using Google, I found out, that GLIB libraries are able to convert
UTF-8 into multibyte under Windows. Windows should be
able to convert UTF-8 into Multibyte and vice versa with it's character
set conversion
functions.

After using Google, I found out, that Windows XP had a problem with
Korean multibyte:


  "Windows XP Device Driver Does Not Convert Multibyte Data to Korean"

 Article ID: 817522.

That was fixed in Service Pack 2.

So I ask you, how you have thought about these things:

If I have understood Windows correctly, it uses UCS-2 as it's internal
UNICODE
character set. Linux prefers into UTF-8. So, If we classify UCS-2 and
UTF-8 equal inside psqoldbc,
that makes sense. That's what has been implemented into psqlodbc already
for Windows.

Then there is the world before Unicode existed. There were DOS codepages,
character sets for groups of countries and Multibyte character sets.

JIS X 0208 is a character set (see man 7 charsets).
Shift_JIS is an encoding that can contain JIS X 0208 multibyte
characters (see man 7 charsets).

So it seems, that one working implementation can be done by using UTF-8
PostgreSQL server
and UTF-8 to multibyte conversions.

However, according to Samba team's UNICODE problem descriptions,
there are some problems: UTF-8 to EUC_JP conversion may be different
on Linux and Windows, and on different conversion library implementations.

Some multibyte character sets are contraditory with each other.

If we drop the *W() functions away, we might get a working implementation,
but we might not support the full ODBC API?

So if and only if one single conversion library does the conversions, it
works.

So if and only if the PostgreSQL backend, or only the PSQLODBC side
does the needed conversions, psqlodbc should work with multibyte
encodings, with UTF-8. If the PostgreSQL Server is in a same kind of
Windows environment than the clients, it should work
fully with UTF-8 and the multibyte character sets. This should be the
best working option.

Windows does have a working UCS-2 to multibyte conversion implementation
on the psqlodbc client (since Service Pack 2).

Unfortunately pg_dump + restore from SJIS into UTF-8 might not work,
because Linux's ICONV might not do the conversion correctly.

The conversion into UTF-8 must be done using fully working Windows
conversion functions.
So one way might be something like using such pg_dump under Windows,
that does the multibyte into UTF-8 conversion in Windows side.

How about the following implementation:
ODBC against the backend:
- Backend has multibyte characters.
- Windows uses multibyte characters.
psqlodbc has UTF-8 as it's internal formats.

=> A fully working implementation:
- Backend deliveres multibyte characters.
PSQLODBC converts them into UTF-8.
PSQLODBC deliveres multibyte characters to the client
using utf8_to_locale Windows functions, when necessary.

So the solution might be here to do all conversions on the client side!
However the reasoning for this is, that two separate conversion
libraries might
be contradictory with each other, at least with the Asian character sets.
(With MACs, UTF-8 implementation differs from the standard.)

Or then Asian users should move and use UTF-8 as their PostgreSQL
Server's backend format.
That's the other solution for the same problem. Then PostgreSQL Server
doesn't
have to do the conversion.

It does not seem possible to do all the conversion functions inside
PostgreSQL Server under Windows,
because of the xx() -> xxW() mapping inside Windows ODBC manager. We
can't control that.

What do you think about these thoughts?

Marko Ristola

Hiroshi Saito wrote:

>Hi Dave.
>
>I tried your patch by SJIS of Japan. It seems that it needs some additional
>correction. Moreover, it is necessary to make the driver different from
>UNICODE (WideCharacter). It seems that I have to catch up further.
>
>BTW, I remembered the discussion original by pgAdminIII. I said that I
>should support MullutiByte then. However, How is it now? It is very wonderful.
>I feel that that there are many choices of a character code complicates a problem
>more. but, it is although external environment is different.
>
>Regards,
>Hiroshi Saito
>
>------------------------------------------------------------------------
>
>--- convert.c.orig Thu Aug  4 21:26:57 2005
>+++ convert.c Thu Sep  1 04:38:45 2005
>@@ -762,7 +762,7 @@
> {
> BOOL lf_conv = conn->connInfo.lf_conversion;
>
>- if (fCType == SQL_C_WCHAR)
>+ if ((conn->unicode && conn->report_wide_types) && (fCType == SQL_C_WCHAR))
> {
> len = utf8_to_ucs2_lf(neut_str, -1, lf_conv, NULL, 0);
> len *= WCLEN;
>@@ -778,7 +778,7 @@
> }
> else
> #ifdef WIN32
>- if (fCType == SQL_C_CHAR)
>+ if ((conn->unicode && conn->report_wide_types) && (fCType == SQL_C_CHAR))
> {
> wstrlen = utf8_to_ucs2_lf(neut_str, -1, lf_conv, NULL, 0);
> allocbuf = (SQLWCHAR *) malloc(WCLEN * (wstrlen + 1));
>@@ -810,7 +810,7 @@
> pgdc->ttlbuflen = len + 1;
> }
>
>- if (fCType == SQL_C_WCHAR)
>+ if ((conn->unicode && conn->report_wide_types) && (fCType == SQL_C_WCHAR))
> {
> utf8_to_ucs2_lf(neut_str, -1, lf_conv, (SQLWCHAR *) pgdc->ttlbuf, len / WCLEN);
> }
>@@ -824,7 +824,7 @@
> }
> else
> #ifdef WIN32
>- if (fCType == SQL_C_CHAR)
>+ if ((conn->unicode && conn->report_wide_types) && (fCType == SQL_C_CHAR))
> {
> len = WideCharToMultiByte(CP_ACP, 0, allocbuf, wstrlen, pgdc->ttlbuf, pgdc->ttlbuflen, NULL, NULL);
> free(allocbuf);
>@@ -871,7 +871,7 @@
>
> copy_len = (len >= cbValueMax) ? cbValueMax - 1 : len;
>
>- if (fCType == SQL_C_WCHAR)
>+ if ((conn->unicode && conn->report_wide_types) && (fCType == SQL_C_WCHAR))
> {
> copy_len /= WCLEN;
> copy_len *= WCLEN;
>@@ -911,7 +911,7 @@
> memcpy(rgbValueBindRow, ptr, copy_len);
> /* Add null terminator */
>
>- if (fCType == SQL_C_WCHAR)
>+ if ((conn->unicode && conn->report_wide_types) && (fCType == SQL_C_WCHAR))
> memset(rgbValueBindRow + copy_len, 0, WCLEN);
> else
>
>@@ -942,7 +942,7 @@
> break;
> }
>
>- if (SQL_C_WCHAR == fCType && ! wchanged)
>+ if ((conn->unicode && conn->report_wide_types) && (SQL_C_WCHAR == fCType && ! wchanged))
> {
> if (cbValueMax > (SDWORD) (WCLEN * (len + 1)))
> {
>@@ -2629,6 +2629,8 @@
> case SQL_WCHAR:
> case SQL_WVARCHAR:
> case SQL_WLONGVARCHAR:
>+ if (conn->unicode && conn->report_wide_types)
>+ {
> if (SQL_NTS == used)
> used = strlen(buffer);
> allocbuf = malloc(WCLEN * (used + 1));
>@@ -2637,6 +2639,11 @@
> buf = ucs2_to_utf8((SQLWCHAR *) allocbuf, used, (UInt4 *) &used, FALSE);
> free(allocbuf);
> allocbuf = buf;
>+ {
>+ else
>+ {
>+ buf = buffer;
>+ }
> break;
> default:
> buf = buffer;
>@@ -2647,10 +2654,17 @@
> break;
>
> case SQL_C_WCHAR:
>+ if (conn->unicode && conn->report_wide_types)
>+ {
>             if (SQL_NTS == used)
>                 used = WCLEN * wcslen((SQLWCHAR *) buffer);
> buf = allocbuf = ucs2_to_utf8((SQLWCHAR *) buffer, used / WCLEN, (UInt4 *) &used, FALSE);
> used *= WCLEN;
>+ }
>+ else
>+ {
>+ buf = buffer;
>+ }
> break;
>
> case SQL_C_DOUBLE:
>--- psqlodbc_win32.def.orig Thu Sep  1 04:41:37 2005
>+++ psqlodbc_win32.def Thu Sep  1 04:42:08 2005
>@@ -78,31 +78,3 @@
> DllMain @201
> ConfigDSN @202
>
>-SQLColAttributeW @101
>-SQLColumnPrivilegesW @102
>-SQLColumnsW @103
>-SQLConnectW @104
>-SQLDescribeColW @106
>-SQLExecDirectW @107
>-SQLForeignKeysW @108
>-SQLGetConnectAttrW @109
>-SQLGetCursorNameW @110
>-SQLGetInfoW @111
>-SQLNativeSqlW @112
>-SQLPrepareW @113
>-SQLPrimaryKeysW @114
>-SQLProcedureColumnsW @115
>-SQLProceduresW @116
>-SQLSetConnectAttrW @117
>-SQLSetCursorNameW @118
>-SQLSpecialColumnsW @119
>-SQLStatisticsW @120
>-SQLTablesW @121
>-SQLTablePrivilegesW @122
>-SQLDriverConnectW @123
>-SQLGetDiagRecW @124
>-SQLGetStmtAttrW @125
>-SQLSetStmtAttrW @126
>-SQLSetDescFieldW @127
>-SQLGetTypeInfoW @128
>-SQLGetDiagFieldW @129
>  
>
>------------------------------------------------------------------------
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>  
>


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq
Reply | Threaded
Open this post in threaded view
|

Re: Unicode support

Marko Ristola
Marko Ristola wrote:

>However, according to Samba team's UNICODE problem descriptions,
>there are some problems: UTF-8 to EUC_JP conversion may be different
>on Linux and Windows, and on different conversion library implementations.
>
>  
>

This was the Samba reference. I recommend you to read the applicable parts.
http://www.samba.org/samba/docs/man/Samba-HOWTO-Collection/unicode.html

I hope, that the multibyte into UTF-8 and vice versa is possible.
If not, disabling UTF-8 and UCS-2 seems to be the only workable choise :(

Regards,
Marko

>Some multibyte character sets are contraditory with each other.
>
>  
>


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
12