5

I have been trying to retrieve "unicode user input" in my Java application for a small utility snippet. The problem is, it seems to be working on Ubuntu "out of the box" which has I guess OS wide encoding at UTF-8 but doesn't work on Windows when run from "cmd". The code in consideration is as follows:

public class SerTest {

    public static void main(String[] args) throws Exception {
        testUnicode();
    }

    public static void testUnicode() throws Exception {
        System.out.println("Default charset: " +
           Charset.defaultCharset().name());
        BufferedReader in  =
           new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
        System.out.printf("Enter 'абвгд эюя': ");
        String line = in.readLine();
        String s = "абвгд эюя";
        byte[] sBytes = s.getBytes();
        System.out.println("strg bytes: " + Arrays.toString(sBytes));
        byte[] lineBytes = line.getBytes();
        System.out.println("line bytes: " + Arrays.toString(lineBytes));
        PrintStream out = new PrintStream(System.out, true, "UTF-8");
        out.print("--->" + s + "<----\n");
        out.print("--->" + line + "<----\n");
    }

}

Output on Ubuntu (without any changes to configuration):

me@host> javac SerTest.java  && java SerTest
Default charset: UTF-8
Enter 'абвгд эюя': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
line bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
--->абвгд эюя<----
--->абвгд эюя<----

Output on windows CMD prompt (in no way affected by JAVA_TOOL_OPTIONS):

E:\>chcp 65001
Active code page: 65001

E:\>java -Dfile.encoding=utf8 SerTest
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=utf8
Default charset: UTF-8
Enter 'абвгд эюя': юя': ': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
Exception in thread "main" java.lang.NullPointerException
        at SerTest.testUnicode(SerTest.java:26) # byte[] lineBytes = line.getBytes();
        at SerTest.main(SerTest.java:15)

Output in Eclipse console (after using JAVA_TOOL_OPTIONS):

Default charset: UTF-8
Enter 'абвгд эюя': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=utf8
line bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
--->абвгд эюя<----
--->абвгд эюя<----

On Eclipse console, it is working because I have added a system wide environment variable (JAVA_TOOL_OPTIONS) which if possible I would like to avoid.

Output in Eclipse console (after removing JAVA_TOOL_OPTIONS):

Default charset: UTF-8
Enter 'абвгд эюя': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
line bytes: [-61, -112, -62, -80, -61, -112, -62, -79, -61, -112, -62, -78, -61, -112, -62, -77, -61, -112, -62, -76, 32, -61, -111, -17, -65, -67, -61, -111, -59, -67, -61, -111, -17, -65, -67]
--->абвгд эюя<----
--->абвгд �ю�<----

So my question is: what exactly is going on here? What code changes would be required to ensure that this snippet works for all sorts of "Unicode" input?

Sorry for the long winded question and thanks in advance,
Sasuke

2 Answers 2

4

Some notes:

  • -Dfile.encoding=utf8 is not supported and may cause unintended side-effects:

The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.

  • The Console class will detect and use the terminal encoding but doesn't support 65001 (UTF-8) on Windows - at least, it didn't the last time I tried it

I believe that the correct, documented way to use Unicode with cmd.exe is to use WriteConsoleW and ReadConsoleW.

I wrote a couple of blog posts when I was looking at this:

Sign up to request clarification or add additional context in comments.

3 Comments

Ah, so basically no sane way of reading/writing unicode stuff when writing windows command line apps? And here I was debugging UTFEncoder/Decoder from sun.* packages...
As far as I am aware, there is no cross-platform way. There are a number of 3rd party console libraries out there that may give you a common interface to write to for all platforms but I don't know what level of I18N support they have.
Thanks. I guess I'll have to look into the few curses implementations floating around (like this one: slashie.net/libjcsi) and hope they handle unicode in a sane way. Accepted!
3

NPE is throws when you are trying to call Arrays.toString(lineBytes), that means that lineBytes is null.

lineBytes holds value: line.getBytes(). getBytes() can return null only if UnsupportedEncodingException is throws inside.

It happens on windows because windows command prompt does not support unicode by default. This works on Ubuntu because its command prompt is fully unicode enabled. It works partially with eclipse because Eclipse's console window is a java component that supports unicode for input and does it for output with JAVA_TOOL_OPTIONS.

The bottom line is that you wish to configure windows command prompt to be able to use unicode characters. I saw several discussions on this topic. Please take a look on this one: Unicode characters in Windows command line - how?

I hope this will help you.

4 Comments

That's the way to go. I don't think anyone could add anything to this answer.
Thanks for the reply. A couple of clarifications: The NPE is because of calling getBytes() on line which means line is NULL which doesn't make a lot of sense. I can confirm that there is no UnsupportedEncodingException thrown (at least I don't see it). Lastly, I tried out the suggestion mentioned in the linked thread, same result. Any idea what might be going bad here?
@sasuke, I think you are wrong. See your stack trace: at SerTest.testUnicode(SerTest.java:26)line.getBytes(); at SerTest.main(SerTest.java:15) that means that there are 11 lines between main() and point where NPE is thrown. And this is exactly byte[] lineBytes = line.getBytes();.
Hi Alex, I can tell it's line.getBytes() because I added a new line System.out.println(line) and it gave me null. Also, if you are on Windows, I would appreciate if you could run the same code and let me know if it works for you. Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.