| From | Sent On | Attachments |
|---|---|---|
| Danilo Šegan | Aug 12, 2002 6:47 pm | |
| Alexander Barkov | Aug 16, 2002 5:15 am |
| Subject: | MySQL UTF8 Serbian (and other) charset support: basic implementation | |
|---|---|---|
| From: | Danilo Šegan (dse...@gmx.net) | |
| Date: | Aug 12, 2002 6:47:42 pm | |
| List: | com.mysql.lists.internals | |
Note: Attachments are not allowed, so you can find the files reffered to in the text on http://alas.matf.bg.ac.yu/~mm01142/mysql-srpski/
The reasoning: - I have a home library with some 2-3 thousand books. There are some titles in Russian, English, German, Serbian and other languages. The best encoding for all these is UTF-8 (because of it being transparently accepted anywhere where null-terminated strings are accepted). - I use Serbian language, with it's default Cyrillic writing system, and a Latin transcription (with almost a 1-1 correspondence between the two). For me, cyrillic A is same as a latin one, so is any other letter which has an equivalent in both. So, I need a flexible sorting order. - I don't want to spend too much time developing the code, so there are some ,,intentional'' (resulting from laziness) limits: format of the data file is pretty strict, speed was not a major concern (some straightforward optimizations could be made easily). - UTF-16 and IBM ICU (which uses the former as the base encoding) are not suitable for my purposes.
Features: - Resemblance of Unicode Collation Algorithm is purely coincidental :) I do not claim to have implemented the UCA in part, nor in whole. - ,,Contractions'' in UCA terminology are supported (required for NJ, LJ, DZ -- combinations of two or more letters acting as one) - 4 levels of differences (something like primary, secondary, tertiary differences in UCA, but not fixed) - The significance of each level of differences can be changed - The ,,LIKE'' is not yet working as I want it to (apparently not using my_strcoll or my_strxfrm, I'll have to check sql/opt_range.cc code).
For example of data file check attached srpski.txt. Also attached is ctype-srpski.c (the sourcecode) which has a hard-coded value for filename placement of ,,/usr/local/mysql/share/mysql/charsets/srpski.txt''.
Tested with MySQL 3.23.51 on Linux 2.2.19. Some client programs needed recompilation (like PHP).
Warning: the code is really, really ugly. Comments might be misleading! I used some code I wrote quite a bit earlier. I'm sending it out just so anyone can see if there's any use for it. If there is, I might clean it up a bit (or you could do it :) . It's not ready for general use. It could be a base for future UCA implementation, and to show that UTF-8 can be used with MySQL (contrary to superstitions).
Though I put the code in public domain (no restrictions), I'd like to be notified of any improvements you make. If you need help understanding the code (I'd need if I didn't write it), complain, and I'll add better comments.
Flames will be read, and then responed to :) All other mail will be forwarded to /dev/null (just kidding, please post to either inte...@lists.mysql.com or dse...@gmx.net or the mail mentioned in the source).
ctype, to_lower, to_upper and sort_order tables are copied verbatim from some other ctype-*.c file, and haven't been checked. Any suggestions about them??
And also the question before I browse the sql/* code, how to make LIKE and RLIKE use strcoll?
Hoping some one (at least one :) will find this useful, with best regards
Files: http://alas.matf.bg.ac.yu/~mm01142/mysql-srpski/ctype-srpski.c http://alas.matf.bg.ac.yu/~mm01142/mysql-srpski/srpski.txt
Danilo Šegan





