6 messages in com.mysql.lists.perlRe: UTF-8 support in DBD::mysql
FromSent OnAttachments
Dominic Mitchell24 Feb 2006 09:02 
Jan Kratochvil24 Feb 2006 09:53.patch
Dominic Mitchell25 Feb 2006 02:32 
Jan Kratochvil25 Feb 2006 03:05 
Dominic Mitchell25 Feb 2006 11:12 
Dominic Mitchell07 Mar 2006 06:02.patch
Subject:Re: UTF-8 support in DBD::mysql
From:Dominic Mitchell (Domi@semantico.com)
Date:02/25/2006 11:12:37 AM
List:com.mysql.lists.perl

Jan Kratochvil said:

Hi,

On Sat, 25 Feb 2006 11:33:08 +0100, Dominic Mitchell wrote: ...

It is a hack, but it's a useful one. ... You could get into long details about the correct API for transcoding automatically into the desired charset from whatever charset the database has stored your data in. But it smacks of overengineering, and not making the common case simple.

I was checking now that utf-8 looks really complicated enough to not to be fooled by random data as "false positive". I can report that my engine was getting MMSE (MMS Encapsulation - mobile phones binary format) data marked as utf-8 (and therefore failing binary decoding of bytes-oriented MMSE).

Well, UTF-8 is designed so that the longer the string, the smaller chance of something which is not UTF-8 being identified as UTF-8. The wikipedia article explains this better.

http://en.wikipedia.org/wiki/UTF-8#Advantages_and_disadvantages

Also, are you talking about data going into MySQL? I'm not actually concerned about that, only about retrieveing it on the way out.

Maybe I did there some other mistake or that previously attached patch of mine is broken (still does not look so to me). Still the hassle around and unpredictable behavior on possibly random failing service for the clients prevented me from using it for real.

Well, it behaviour should be hidden behind a flag, so it can be turned off if needed. I'm not proposing that it's enabled by default.

...

In fact I gave up and rather mark it utf-8 by hand from Perl when appropriate.

That's exactly what I *don't* want to be doing. I gave it UTF-8 -- it should be able to give me UTF-8 back.

You gave it utf-8 marker when it was really utf-8. It should give back utf-8 marker when it is really utf-8.

I think that's what I'm proposing. :-)

-Dom