|august||Feb 10, 2009 6:38 am|
|IOhannes m zmölnig||Feb 10, 2009 10:12 am|
|august||Feb 10, 2009 12:14 pm|
|Hans-Christoph Steiner||Feb 10, 2009 6:04 pm||.pd|
|Bryan Jurish||Feb 11, 2009 3:33 am||.pd, .pd|
|Hans-Christoph Steiner||Feb 11, 2009 9:24 pm||.png, .pd|
|Bryan Jurish||Feb 12, 2009 1:40 am|
|Hans-Christoph Steiner||Feb 12, 2009 11:21 am|
|Loic Kessous||Feb 12, 2009 12:14 pm|
|Hans-Christoph Steiner||Feb 12, 2009 1:26 pm|
|Bryan Jurish||Feb 12, 2009 1:44 pm|
|Hans-Christoph Steiner||Feb 12, 2009 6:13 pm|
|Ilias Anagnostopoulos||Feb 12, 2009 6:40 pm|
|Bryan Jurish||Feb 13, 2009 1:38 am|
|John Harrison||Feb 13, 2009 6:01 am|
|Ilias Anagnostopoulos||Feb 13, 2009 9:54 am|
|John Harrison||Feb 13, 2009 12:29 pm|
|Hans-Christoph Steiner||Feb 13, 2009 1:59 pm|
|Loic Kessous||Feb 16, 2009 8:39 am|
|Bryan Jurish||Feb 17, 2009 2:53 pm||.pd, .png|
|Hans-Christoph Steiner||Feb 19, 2009 9:43 am|
|Bryan Jurish||Feb 19, 2009 1:13 pm|
|Hans-Christoph Steiner||Feb 19, 2009 9:19 pm|
|Bryan Jurish||Feb 20, 2009 2:53 am|
|John Harrison||Feb 22, 2009 8:48 pm|
|Subject:||Re: [PD] locales for Pd WAS: japanese encoded chars in PD|
|From:||Hans-Christoph Steiner (ha...@eds.org)|
|Date:||Feb 12, 2009 11:21:59 am|
On Feb 12, 2009, at 4:40 AM, Bryan Jurish wrote:
moin Hans, moin all,
On 2009-02-12 06:24:44, Hans-Christoph Steiner <ha...@eds.org> appears to have written:
On Feb 11, 2009, at 6:34 AM, Bryan Jurish wrote:
for me, pd *does* display utf-8 strings correctly in message boxes (tested with umlauts äöü, as well as Greek πδ
Hmm, I am not sure that UTF-8 really is well supported. Some chars get thru, but many don't. Here's an example. I typed these chars in a UTF-8 text editor as an png and a pd patch. Not quite the same.
... I'm not really sure what (if anything) we can conclude from this. Maybe the text editor is making UTF-8 out of the keyboard input? The Pd patch itself is most cetainly not UTF-8 encoded, which makes me suspect that either (a) Pd is dropping non-printing shift bytes (IOhannes has pointed out similar goofiness in t_binbuf, but I thought it was only restricted to NUL bytes) or (b) Tk isn't receiving UTF-8 character codes at all (whether this is Tk's fault or a system configuration issue is another question). At least the latter should be testable with a few quick wish hacks...
Pd does seem to measure the bytes of the string, measuring the UTF-8 shift bytes as chars. For exmaple, in barf-both.pd, the message box of the utf-8 example is much longer than the text inside, while with the latin1, it is the correct size.
I don't know if you have followed Pd-devel 0.41.4 at all, but I have gotten to the point where the GUI is 100% Tcl/Tk so playing with this stuff should be a lot easier. Check out the branch, if you would like to try things.
Setting LC_CTYPE=en_US.UTF-8 and re-loading "unibarf.pd" got me an odd error message from Pd though:
Pd: buffer space wasn't sufficient for long GUI string (repeated 3 times)
I am guessing that the above error comes from the fact that Pd is written for latin1 where every char is always 1 byte, so sending UTF-8 could confuse things, since UTF-8 can have multi-byte chars.
Kinda; but why is it only the presence of *latin-1* message boxes that cause complaints about "long GUI strings" (try deleting the utf-8 message box & reloading: the error disappears). I think an error is certainly justified in this case (we're feeding a latin-1 encoded message box to a Pd using a UTF-8 locale); I was just surprised by the form the error took ;-)
I think that Tcl/Tk tries to guess the locale of the data coming in from the network socket, then translate it to UTF-8 and back. Some of the weirdness we are seeing could be related to that. In Pd-devel, its much clearer, so it would be straightforward to play with this encoding translation stuff, and perhaps turn it off. Ideally we could have UTF-8 coming from Pd so that Tk doesn't need to do any translation. That could speed up things like array/graph redrawing.
I don't know for sure, but I suspect one problem might be in the interpretation of user input
I don't know about the pd side, but Tcl/Tk is all UTF-8 natively, so that is no problem.
Hmm... not sure what you mean by "natively" here... I mean, Perl uses UTF-8 as its "native" string encoding, but you can still manipulate byte strings, read & write files etc in other encodings too.
Yes, same idea. Internally, Tcl/Tk is using UTF-8, but it can freely translate between other encodings.
If we're talking about user input and the Pd GUI, I think the main issue is how keyboard input is captured by Tk and passed on to Pd. If the keyboard input is being grabbed by Tk bind()ing KeyPress events, then maybe we just need to edit that bind() call... looks like the KeyPress relevant "%"-substitutions are (from the Tk bind() manpage):
%k - The keycode field from the event. Valid only for KeyPress and KeyRelease events.
%A - Substitutes the UNICODE character corresponding to the event, or the empty string if the event does not correspond to a UNICODE character (e.g. the shift key was pressed). XmbLookupString (or XLookupString when input method support is turned off) does all the work of translating from the event to a UNICODE character. Valid only for KeyPress and KeyRelease events.
%K - The keysym corresponding to the event, substituted as a textual string. Valid only for KeyPress and KeyRelease events.
%N - The keysym corresponding to the event, substituted as a decimal number. Valid only for KeyPress and KeyRelease events.
... so if we're lucky, we can just replace "%k" with "%A" and all will be good... except for file I/O, which will likely still be done at a raw byte level. At this point, all "pure" latin-1 patches will proceed to break (maybe just display problems, maybe more serious). If we say we're going whole-hog utf-8, we can say that it's the user's problem to recode any such files (e.g. with iconv or recode; I'm happy to help out with a few scripts); otherwise we might want to do something paranoid and try to guess a patch's encoding when it's loaded. Or we use locale-dependent functions, but that makes sharing patches harder between people using different locales. Or we use the XML-style solution and just save the encoding to use in the patch header ;-)
Yeah, this would be a good thing to rewrite. The canvas_key code is definitely in need of refactoring anyway. Pd has never really supported latin1 or any encoding besides ASCII, so I think we should just aim to make everything UTF-8, then make conversion utilities like you mentioned.
bash$ export LC_CTYPE=en_DK.UTF-8 bash$ pd uselocale.pd barf-both.pd ##-- latin-1 displays incorrectly
bash$ export LC_CTYPE=en_DK.ISO-8859-1 bash$ pd uselocale.pd barf-both.pd ##-- all displays ok
If it turns out to work well, we can of course make a trivial "dummy" external out of it for use with "-lib" ...
Hmm, I tried this on Mac OS X and it didn't seem to make a difference. Perhaps its a platform issue, though on this level, Mac OS X is very much BSD, so I think it should work.
The locale strategy also depends on what locales your system has installed. Here (linux/debian), I can see which locales are installed with:
bash$ locale -a
... I would expect goofiness trying to use "en_DK.UTF-8" if it's not been installed ...
I was using en_US.UTF-8. It seems to me that there is an extra dash in your locale. On Mac OS X, 'locale -a' tells me: en_US.ISO8859-1 On debian/stable, it tells me en_US.iso88591. Does every system have different names for the latin1? Arg.... I tried a bunch of variations of the locale and LANG and LC_CTYPE on Mac OS X, but I couldn't get the barf-both.pd to look different.
-- Bryan Jurish "There is *always* one more bug." jur...@ling.uni-potsdam.de -Lubarsky's Law of Cybernetic Entomology
As we enjoy great advantages from inventions of others, we should be glad of an opportunity to serve others by any invention of ours; and this we should do freely and generously. - Benjamin Franklin