25 messages in org.codehaus.groovy.userRe: [groovy-user] File encoding problem
FromSent OnAttachments
Michael BaehrMar 23, 2007 7:25 am 
Guillaume LaforgeMar 23, 2007 9:31 am 
Michael BaehrMar 23, 2007 9:57 am 
Guillaume LaforgeMar 23, 2007 10:07 am 
Michael BaehrMar 23, 2007 10:16 am 
Guillaume LaforgeMar 23, 2007 10:20 am 
Michael BaehrMar 23, 2007 10:24 am 
Guillaume LaforgeMar 23, 2007 12:19 pm 
Michael BaehrMar 23, 2007 12:36 pm.groovy
Russel WinderMar 23, 2007 12:41 pm 
Guillaume LaforgeMar 23, 2007 12:47 pm 
Michael BaehrMar 23, 2007 12:52 pm 
Russel WinderMar 23, 2007 12:53 pm 
Michael BaehrMar 23, 2007 12:54 pm 
Russel WinderMar 23, 2007 12:58 pm 
Michael BaehrMar 23, 2007 1:01 pm 
Russel WinderMar 23, 2007 1:04 pm 
Michael BaehrMar 23, 2007 1:05 pm 
Russel WinderMar 23, 2007 1:06 pm 
Michael BaehrMar 23, 2007 1:24 pm 
Russel WinderMar 23, 2007 1:31 pm 
Michael BaehrMar 23, 2007 1:40 pm 
Gavin GroverMar 23, 2007 4:51 pm 
Barzilai SpinakMar 23, 2007 6:39 pm 
Jochen TheodorouMar 25, 2007 9:59 am 
Actions with this message:
Paste this link in email or IM:
Paste this link in email or IM:
Atom feed for this thread
Paste this URL into your reader:
Subject:Re: [groovy-user] File encoding problemActions...
From:Barzilai Spinak (bar@creacion.com.uy)
Date:Mar 23, 2007 6:39:07 pm
List:org.codehaus.groovy.user

Let's see if I can put some order into all this, and I hope I don't make many conceptual mistakes, but I have "fought" a lot with i18n issues and platform differences. Most of you may already know most of what I'll say but it's always good to remember some details.

Windows XP *can* use Unicode if the program specifies it in some way or (I think) if it was installed to have Unicode as the default character encoding. But *normally* most Windows will be setup to use a local character encoding by default. In Windows, the most common one for Western Europe/America is technically Windows-1252 (also known as code page 1252), which is more or less a superset of ISO-8859-1, which in turn coincides with Latin-1 (the first "page" or "chart" in the Unicode set, from U+0000 to U+00FF). Win-1252 is a one-byte-per-char encoding.

Now, if you go to the Windows console (aka "DOS prompt") you'll find that it's probably using a *different* encoding/charset than the graphical part of Windows. Go to the prompt and type "CHCP". If you are using an English Windows, it will probably show "code page 437", which is more or less the old IBM Extended ASCII set. If you are in Western Europe you'll probably see "code page 850" or something similar. These are also one-byte-per-char encodings, but the upper 128 chars are different from Win-1252. That's why sometimes (in Java) your println's of accented characters will show different from when you display the same String in a Swing component for example.

Now Java: Internally, all Java chars/Strings are Unicode. But when you have to translate text from/to the outside world, you have to specify in which encoding your data will be coming in (or going out). Your source .java file can also be written in any encoding (with some quirks...). If you don't tell the compiler (javac) what encoding your source code is, it will use the platform default. Please not, that even though you may be running javac from the command prompt (DOS prompt), the "platform default encoding" will be the one Windows reports it to be (the one used in the graphical Windows), and not the one reported in the console/command prompt!!!!! In any case, if you give the wrong encoding (or your file is in a different encoding and you don't tell javac about it) the file will be wrongly interpreted and if you're lucky it may even fail the compilation. After compilation, the .class file will store all String constants in your code as UTF-8 (or whatever the compiler thought was the correct translation of your source text from the default/given encoding into UTF-8) Now, imagine your Java program only outputs text using System.out.println() and you run your program from the console. What will happen now? The Java serializer for the String will convert the internal Unicode String into the sequence of bytes for your platform default (for instance Windows-1252). But since your console is probably in cp437 or something else from the DOS days, you will see strange characters where your accented letters should be.

Now Groovy: This is where I don't have enough experience. I took the utf8.groovy attachment that Michael sent. First, it is correctly encoded in UTF-8 for the 3 umlauted chars, without BOM. All the tests below done in Windows using Windows-1252 platform encoding and cp850 as console encoding.

If you compile it with: "groovyc utf8.groovy", the .class file will have a "garbage string" in it, with 6 utf chars (12 bytes), because, without further information, the 3 utf-8 chars (6 bytes) in the source file, would have been interpreted as the platform default (win-1252) which is a one-byte-per-char encoding. Therefore, 6 bytes are interprted as 6 chars, and translated into 12 bytes of utf-8 "garbage".

However if you compile using "groovyc --encoding utf-8 utf8.groovy", the .class file will have a correct UTF-8 constant String in it.

The problem comes when executing the utf8.class (the correctly compiled one)

1) First test, execute without specifying encoding. D:\temp>groovy utf8 6

In any case, it should print a "3", since THREE is the number of characters in the internal Java String. However, I think the "groovy" command is using the GroovyClassLoader, and "recompiling" the .groovy file in memory again, and since this time I did not specify an encoding, it will use the platform encoding as in the first of my examples using "groovyc". This seems to be a case of the Groovy behaviour that tries to recompile everything again. I thought it only happened when the .groovy file had changed, but it seems to be doing it always. As a test, I renamed the source file and tried again.

D:\temp>ren utf8.groovy utf8.groovy.OLD D:\temp>groovy utf8 Caught: java.io.FileNotFoundException: utf8 (D:\temp\utf8)

See? *EVEN THOUGH* it has a completely valid .class file, it tries to look for the source .groovy file and it fails after not finding it. This problem is completely unrelated to the email in question but I'd like to know why does Groovy need the source file even though it has all the necessary .class files. Or is there any way to override this behaviour?

2) Second tests, execute the "groovy" command specifying the encoding. D:\temp>ren utf8.groovy utf8.groovy.otra

D:\temp>groovy -c utf8 utf8 Caught: BUG! exception in phase 'parsing' in source unit 'utf8.groovy' charsetName

D:\temp>groovy --encoding utf8 utf8 Caught: BUG! exception in phase 'parsing' in source unit 'utf8.groovy' charsetName

(And other attempts to specify an encoding)

The problem here: a) If it didn't try to parse/compile anything (since it's already compiled and up to date!), this problem would not happen. b) The --encoding or -c options are *ignored* or not passed down to "groovyc" (maybe groovyStarter doesn't know what to do with it?).

As Michael discovered, it can be "fixed" by setting JAVA_OPTS=-Dfile.encoding=UTF-8 since that will change the default encoding for the compiler AND the Java Virtual Machine.

Yes, i18n is difficult, and charsets/encodings are worse! Not helped by the fact that a lot of implementations don't handle things right or fully. Everyone should speak Spansh!

(this became longer than I thought)

BarZ

Michael Baehr wrote:

I'm not sure about Windows XP - isn't it Unicode anyways? It is set to German though. My Linux is en_US.UTF-8.

But I'm not even talking about printing the String to the console (this is a problem with my Linux installation as well, as it seems to have some problems with Unicode), but the interpretation of the Groovy file by the Groovy interpreter/compiler.

How are the -c / --encoding switches supposed to be used?

Michael

Guillaume Laforge schrieb:

What are the default system charsets of your OSes?

On 3/23/07, Michael Baehr <code@googlemail.com> wrote:

Hi there,

I created a Groovy script with the following content:

def s = "äöü" // three German umlaut characters println s.size()

The script is saved as UTF-8.

If I run it on Linux, it correctly prints "3", but on Windows it prints "6", interpreting each double-byte UTF-8 character as two distinct characters.

I tried to play with the -c / --encoding command line parameters, but got error messages like the following:

$ groovy -c utf-8 test.groovy Caught: BUG! exception in phase 'parsing' in source unit 'test.groovy' charsetName

Any clues what's going on?

cu