|august||Feb 10, 2009 6:38 am|
|IOhannes m zmölnig||Feb 10, 2009 10:12 am|
|august||Feb 10, 2009 12:14 pm|
|Hans-Christoph Steiner||Feb 10, 2009 6:04 pm||.pd|
|Bryan Jurish||Feb 11, 2009 3:33 am||.pd, .pd|
|Hans-Christoph Steiner||Feb 11, 2009 9:24 pm||.png, .pd|
|Bryan Jurish||Feb 12, 2009 1:40 am|
|Hans-Christoph Steiner||Feb 12, 2009 11:21 am|
|Loic Kessous||Feb 12, 2009 12:14 pm|
|Hans-Christoph Steiner||Feb 12, 2009 1:26 pm|
|Bryan Jurish||Feb 12, 2009 1:44 pm|
|Hans-Christoph Steiner||Feb 12, 2009 6:13 pm|
|Ilias Anagnostopoulos||Feb 12, 2009 6:40 pm|
|Bryan Jurish||Feb 13, 2009 1:38 am|
|John Harrison||Feb 13, 2009 6:01 am|
|Ilias Anagnostopoulos||Feb 13, 2009 9:54 am|
|John Harrison||Feb 13, 2009 12:29 pm|
|Hans-Christoph Steiner||Feb 13, 2009 1:59 pm|
|Loic Kessous||Feb 16, 2009 8:39 am|
|Bryan Jurish||Feb 17, 2009 2:53 pm||.pd, .png|
|Hans-Christoph Steiner||Feb 19, 2009 9:43 am|
|Bryan Jurish||Feb 19, 2009 1:13 pm|
|Hans-Christoph Steiner||Feb 19, 2009 9:19 pm|
|Bryan Jurish||Feb 20, 2009 2:53 am|
|John Harrison||Feb 22, 2009 8:48 pm|
|Subject:||Re: [PD] japanese encoded chars in PD|
|From:||Bryan Jurish (moo...@ling.uni-potsdam.de)|
|Date:||Feb 11, 2009 3:33:46 am|
barf-both.pd - 0.5k
uselocale.pd - 0.4k
On 2009-02-11 03:04:34, Hans-Christoph Steiner <ha...@eds.org> appears to have written:
On Feb 10, 2009, at 3:14 PM, august wrote:
are there also objects for handling conversions between character encodings? Or, an object to convert between utf8 or UCS-2 and the unicode char code numbers that GEM takes?
Well, there are [bytes2wchars] and [wchars2bytes] in the newest [pdstring] library, which convert between multibyte encodings such as utf8 and your C library's wchar_t, which if I'm not entirely mistaken is a system-dependent encoding, but at least here (linux, glibc), it looks a heckuva lot like UCS-4.
Is there a default character encoding for PD messages? I assume it is LATIN1 because I have seen umlauts in comments before(I think). It doesn't look like I can make comments in UTF8 encoded chars.
I have my char problems solved right now, but now as I discover more about the difficulties of character encodings and the treachery that ASCII has caused....I am just curious.
Its a weird bastard mix currrently of Latin1 and UTF-8. The Tk GUI can handle UTF-8 and uses UTF-8 natively. The C side is basically Latin1 but doesn't really check:
Out of curiosity, I just checked with a variant of 'unibarf.pd' (attached as "barf-both.pd"), and for me, pd *does* display utf-8 strings correctly in message boxes (tested with umlauts äöü, as well as Greek πδ -- other characters can be tested with the [pdstring] help patches). Surprisingly (to me), I don't have to do anything special to get UTF-8 characters displayed correctly, but setting LC_CTYPE=en_US.UTF-8 causes a latin-1 message to be displayed improperly (characters disappear, but are still passed and present in raw byte form).
Setting LC_CTYPE=en_US.UTF-8 and re-loading "unibarf.pd" got me an odd error message from Pd though:
Pd: buffer space wasn't sufficient for long GUI string (repeated 3 times)
... this appears on stderr, rather than the console. I get the same message once for "barf-both.pd"; assumedly due to mis-parsing of the latin-1 message box(es).
This is something that I would really like to have working properly in Pd-devel. Tcl/Tk is natively UTF-8, so it seems that we should support UTF-8 in Pd. Anyone feel like trying to fix it? I don't understand encodings so well.
I don't know for sure, but I suspect one problem might be in the interpretation of user input -- I use latin-1 myself, so I can't judge whether the Tk GUI accepts UTF-8 input or not (I use [pdstring] or just hack the .pd file for my tests). If we want to be paranoid about things, we're likely to run into problems with symbols too; symbol identity (hash value and raw byte string) can change depending on whether the C internals use UTF-8 strings or not: this depends not only on what they get from the GUI, but also on how file data is interpreted, netsend/netreceive, etc etc... (mostly t_binbuf, I guess). UTF-8 should be largely safe for pd symbols, although I'm not sure whether backslash or brackets can appear as shift bytes for any characters: that could certainly cause problems.
As an experiment, you could try calling the following on Pd startup:
setlocale(LC_ALL,""); /*-- set locale from environment --*/ setlocale(LC_NUMERIC,"C"); /*-- ... but leave floats alone! --*/
... and see what breaks (or doesn't) ;-) Alternatively, you can achieve pretty much the same effect with the "locale" external in userspace (see attached "uselocale.pd"). Of course, to test UTF-8 you should have your environment variables set accordingly (in particular LC_CTYPE, potentially via LANG):
bash$ export LC_CTYPE=en_DK.UTF-8 bash$ pd uselocale.pd barf-both.pd ##-- latin-1 displays incorrectly
bash$ export LC_CTYPE=en_DK.ISO-8859-1 bash$ pd uselocale.pd barf-both.pd ##-- all displays ok
If it turns out to work well, we can of course make a trivial "dummy" external out of it for use with "-lib" ...
-- Bryan Jurish "There is *always* one more bug." jur...@ling.uni-potsdam.de -Lubarsky's Law of Cybernetic Entomology