atom feed2 messages in com.mysql.lists.internalsRe: Fw: MySQL UTF8 Serbian (and other...
FromSent OnAttachments
Danilo ŠeganAug 12, 2002 6:47 pm 
Alexander BarkovAug 16, 2002 5:15 am 
Subject:Re: Fw: MySQL UTF8 Serbian (and other) charset support: basic implementation
From:Alexander Barkov (ba@mysql.com)
Date:Aug 16, 2002 5:15:21 am
List:com.mysql.lists.internals

Hello, Danilo!

Thanks for contribution!

I'm the person who maintains charset related things in MySQL. I took a look into your sources and I liked them. Please note 4.0 branch is already frozen, so only bug fixes will be done there, new features are not accepted in 4.0 anymor. Your code will be incorporated into 4.1 version. We already have UTF8 support in 4.1, so I will use only collation related code from your file. I also have a very related contribution for Czech language and I'm going to think how to join both trying to avoid duplicating of code. I give you feedback if I have news or questions for you.

Regards! And thanks agains!

Begin forwarded message:

Date: Tue, 13 Aug 2002 03:48:04 +0200 From: Danilo ©egan <dse@gmx.net> To: internals <inte@lists.mysql.com> Subject: MySQL UTF8 Serbian (and other) charset support: basic implementation

Note: Attachments are not allowed, so you can find the files reffered to in the text on http://alas.matf.bg.ac.yu/~mm01142/mysql-srpski/

The reasoning: - I have a home library with some 2-3 thousand books. There are some titles in Russian, English, German, Serbian and other languages. The best encoding for all these is UTF-8 (because of it being transparently accepted anywhere where null-terminated strings are accepted). - I use Serbian language, with it's default Cyrillic writing system, and a Latin transcription (with almost a 1-1 correspondence between the two). For me, cyrillic A is same as a latin one, so is any other letter which has an equivalent in both. So, I need a flexible sorting order. - I don't want to spend too much time developing the code, so there are some ,,intentional'' (resulting from laziness) limits: format of the data file is pretty strict, speed was not a major concern (some straightforward optimizations could be made easily). - UTF-16 and IBM ICU (which uses the former as the base encoding) are not suitable for my purposes.

Features: - Resemblance of Unicode Collation Algorithm is purely coincidental :) I do not claim to have implemented the UCA in part, nor in whole. - ,,Contractions'' in UCA terminology are supported (required for NJ, LJ, DZ -- combinations of two or more letters acting as one) - 4 levels of differences (something like primary, secondary, tertiary differences in UCA, but not fixed) - The significance of each level of differences can be changed - The ,,LIKE'' is not yet working as I want it to (apparently not using my_strcoll or my_strxfrm, I'll have to check sql/opt_range.cc code).

For example of data file check attached srpski.txt. Also attached is ctype-srpski.c (the sourcecode) which has a hard-coded value for filename placement of ,,/usr/local/mysql/share/mysql/charsets/srpski.txt''.

Tested with MySQL 3.23.51 on Linux 2.2.19. Some client programs needed recompilation (like PHP).

Warning: the code is really, really ugly. Comments might be misleading! I used some code I wrote quite a bit earlier. I'm sending it out just so anyone can see if there's any use for it. If there is, I might clean it up a bit (or you could do it :) . It's not ready for general use. It could be a base for future UCA implementation, and to show that UTF-8 can be used with MySQL (contrary to superstitions).

Though I put the code in public domain (no restrictions), I'd like to be notified of any improvements you make. If you need help understanding the code (I'd need if I didn't write it), complain, and I'll add better comments.

Flames will be read, and then responed to :) All other mail will be forwarded to /dev/null (just kidding, please post to either inte@lists.mysql.com or dse@gmx.net or the mail mentioned in the source).

ctype, to_lower, to_upper and sort_order tables are copied verbatim from some other ctype-*.c file, and haven't been checked. Any suggestions about them??

And also the question before I browse the sql/* code, how to make LIKE and RLIKE use strcoll?

Hoping some one (at least one :) will find this useful, with best regards

Files: http://alas.matf.bg.ac.yu/~mm01142/mysql-srpski/ctype-srpski.c http://alas.matf.bg.ac.yu/~mm01142/mysql-srpski/srpski.txt

Danilo ©egan

--------------------------------------------------------------------- Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive)

To request this thread, e-mail inte@lists.mysql.com To unsubscribe, e-mail <inte@lists.mysql.com>

Regards,