14 messages in com.mysql.lists.perlRe: blessing db data as utf8
FromSent OnAttachments
Gaal Yahas09 Jun 2004 06:00 
Jochen Wiedmann09 Jun 2004 07:05 
Gaal Yahas09 Jun 2004 07:38 
Jochen Wiedmann09 Jun 2004 07:45 
Gaal Yahas09 Jun 2004 07:56 
Gaal Yahas09 Jun 2004 13:00 
Jochen Wiedmann10 Jun 2004 03:07 
Gaal Yahas10 Jun 2004 03:25 
Steve Hay10 Jun 2004 03:44 
Gaal Yahas10 Jun 2004 11:17 
Gaal Yahas10 Jun 2004 11:49 
Steve Hay11 Jun 2004 01:13 
Gaal Yahas11 Jun 2004 02:41 
Steve Hay11 Jun 2004 03:03 
Subject:Re: blessing db data as utf8
From:Gaal Yahas (ga@forum2.org)
Date:06/10/2004 11:49:35 AM
List:com.mysql.lists.perl

[I hope nobody minds that I'm moving this thread to the DBD::mysql list, because it seems like the best place for it. Please drop cdbi-talk from replies.]

On Thu, Jun 10, 2004 at 07:01:30PM +0100, Tim Bunce wrote:

On Thu, Jun 10, 2004 at 12:18:42PM +0300, Gaal Yahas wrote:

On Thu, Jun 10, 2004 at 09:51:06AM +0100, Tim Bunce wrote:

This isn't a good way to check for utf8:

+int is_high_bit_set(char *val) { + while (*val++) + if (*val & 0x80) return 1; + return 0; +}

because it make it hard for any latin-1 data to coexist. The perl guts probably has a function to check for well-formed utf8 and that should be used instead.

This function is only used as an optimization. The actual decision is here:

+ if (imp_dbh->enable_utf8 && + is_high_bit_set(col) && is_utf8_string(col, len)) + SvUTF8_on(sv);

Ah, okay.

That said, bad things are going to happen sooner of later if a table has both latin-1 and utf8 data.

I'm thinking more about different fields having either latin-1 or utf8 data.

But now that I think of it, I'm not sure the call to is_high_bit_set is a good idea there, since SvUTF8_on() on a pure (7 bit) ASCII string shouldn't do any harm

It does add overhead (and is actually harmful on 5.6.x where many utf8 bugs lurk) so the check is worthwhile.

and may even be more correct if the string is later concatenated with utf8 data.

No, perl will do-the-right-thing.

So all in all it sounds like this patch is simple, but correct? Steve Hay mentioned another similar patch had been written but didn't reach CPAN; I'd like to encourage the maintainers to put either version :-)

I'm not sure what the cleanest way would be to go about this in the long run (whose responsibility it is to say what is and what isn't utf8) but the patch addresses an immediate need for people with utf8-only data. Maybe this problem would go away in mysql 4.1; I'd prefer not to wait.

Something along these lines is needed. But it does require careful thought.

Perhaps the application, or Class::DBI::mysql (which already has some provisions for similar things) should be responsible for keeping track of what fields are which charset, with no policy (except a default one) being enforced on the DBD level. In this scheme the current approach becomes part of the default handling, so it still makes sense to put it in now.