|august||Feb 10, 2009 6:38 am|
|IOhannes m zmölnig||Feb 10, 2009 10:12 am|
|august||Feb 10, 2009 12:14 pm|
|Hans-Christoph Steiner||Feb 10, 2009 6:04 pm||.pd|
|Bryan Jurish||Feb 11, 2009 3:33 am||.pd, .pd|
|Hans-Christoph Steiner||Feb 11, 2009 9:24 pm||.png, .pd|
|Bryan Jurish||Feb 12, 2009 1:40 am|
|Hans-Christoph Steiner||Feb 12, 2009 11:21 am|
|Loic Kessous||Feb 12, 2009 12:14 pm|
|Hans-Christoph Steiner||Feb 12, 2009 1:26 pm|
|Bryan Jurish||Feb 12, 2009 1:44 pm|
|Hans-Christoph Steiner||Feb 12, 2009 6:13 pm|
|Ilias Anagnostopoulos||Feb 12, 2009 6:40 pm|
|Bryan Jurish||Feb 13, 2009 1:38 am|
|John Harrison||Feb 13, 2009 6:01 am|
|Ilias Anagnostopoulos||Feb 13, 2009 9:54 am|
|John Harrison||Feb 13, 2009 12:29 pm|
|Hans-Christoph Steiner||Feb 13, 2009 1:59 pm|
|Loic Kessous||Feb 16, 2009 8:39 am|
|Bryan Jurish||Feb 17, 2009 2:53 pm||.pd, .png|
|Hans-Christoph Steiner||Feb 19, 2009 9:43 am|
|Bryan Jurish||Feb 19, 2009 1:13 pm|
|Hans-Christoph Steiner||Feb 19, 2009 9:19 pm|
|Bryan Jurish||Feb 20, 2009 2:53 am|
|John Harrison||Feb 22, 2009 8:48 pm|
|Subject:||Re: [PD] UTF-8 for pd-devel WAS: locales for Pd WAS: japanese encoded chars in PD|
|From:||Hans-Christoph Steiner (ha...@eds.org)|
|Date:||Feb 19, 2009 9:19:55 pm|
On Feb 19, 2009, at 4:13 PM, Bryan Jurish wrote:
moin Hans, moin list,
On 2009-02-19 18:43:49, Hans-Christoph Steiner <ha...@eds.org> appears to have written:
This is good news! While the C changes aren't dead simple, they are not bad. I think they could be slightly simplified. One thing that would make it much easier to read the diff is if you create it without whitespace changes. So like this:
svn diff -x -w
oops, sorry... duly noted for future diffs ... I also set my emacs' tcl-indent-width to 8 ... sorry sorry sorry ...
As for the Tcl changes, I think we can include those now in Pd- devel, as long they can work ok with unchanged C code.
Then once the new Tcl GUI is included we can refactor the C side of things with things like this.
One other thing, it seems that the ASCII char are handled differently than the UTF-8 chars in g_rtext.c, I think you could use instead wcswidth(), mbstowcs() or other UTF-8 functions as described in the UTF-8 FAQ
Certainly, but (A) we already have the UTF-8 byte string in keysym, and we need to append that whole string to the buffer anyways, and (B) using wcswidth() & co requires forcing the locale to have a UTF-8 LC_CTYPE. I know I did this in m_pd.c, but I think that was a HACK and that using locale functions here is the Wrong Way To Do It, because it's dangerous, unportable, and slow (warning: rant follows):
__dangerous__: setting the locale is global for all threads of a process; in forcing the locale, we could conceivably mess with desired behavior elsewhere (e.g. in externals).
__unportable__: we don't even know if all users' machines *have* a UTF-8 locale installed, and even if they do, we don't know what it's called. If we don't force the encoding, we're stuck with either "C" (e.g. ASCII; what we've got now in Pd-vanilla), or whatever the user is currently employing (after setlocale(LC_ALL,"")), which makes patches' appearance dependent on the user's encoding (e.g. what we've got now in Pd-vanilla), and doesn't even work in the case of variable-length encodings such as UTF-8.
__slow__: many locale-based conversion functions are known to be pretty darned slow. if we assume we're always dealing with (valid) UTF-8, we can speed things up considerably. going straight to wchar_t is another option, but would require many more changes on the C side, likely break the C API, and wouldn't solve the locale-dependency of patches' appearances, which I think is a really good argument for UTF-8.
Isn't it pretty safe to assume these days that UTF-8 is supported? One thing I just found out is that Windows uses a 2-byte char natively (UCS-2?), I think Mac OS X uses UTF-8 natively. I think that most Linux tools should work with UTF-8 too, especially since it can work as ASCII.
So you think we can have full UTF-8 support without using those functions?
(rant finished now, sorry)
That said, a faster implementation would probably result from mixing (something like) wcswidth() and strncpy(...,keysym). Functions like wcswidth() and mbstowcs() are pretty easy to cook up if we assume wchar_t is UCS-4 and the multibyte encoding is UTF-8.
It seems to me that the wcswidth() would be used for measuring the length of the text for display in boxes. I suppose strlen() could still be used for allocating and freeing memory, but I think that we should aim for clean code. If you think the current way in your diff is the best, that's fine by me.
There are a number of libraries and code snippets floating about in the net making just such assumptions. In this context: are there any licensing restrictions on code included in pd-devel? So far, I've found one useful-looking (.c,.h) pair in the public domain, as well as some LGPL code from gnulib, which could be linked in statically. There's also code from the Unicode Consortium themselves, but it's pretty monstrous (read "pedantic") and limited to string-to-string conversions.
Well, Pd-vanilla is BSD licensed, and Pd-extended is GPL'ed. For this stage of Pd-devel, it would be good to keep it to something that can be BSD licensed.
On Feb 17, 2009, at 5:53 PM, Bryan Jurish wrote:
So I've tried to get the pd-devel 0.41.4 branch to use UTF-8 across the board. The TK side was easy (as Hans predicted); [snip] The C side is much hairier.
-- Bryan Jurish "There is *always* one more bug." jur...@ling.uni-potsdam.de -Lubarsky's Law of Cybernetic Entomology
Access to computers should be unlimited and total. - the hacker ethic