Unicode input in a console application in Java

Question

I have been trying to retrieve "unicode user input" in my Java application for a small utility snippet. The problem is, it seems to be working on Ubuntu "out of the box" which has I guess OS wide encoding at UTF-8 but doesn't work on Windows when run from "cmd". The code in consideration is as follows:

public class SerTest {

    public static void main(String[] args) throws Exception {
        testUnicode();
    }

    public static void testUnicode() throws Exception {
        System.out.println("Default charset: " +
           Charset.defaultCharset().name());
        BufferedReader in  =
           new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
        System.out.printf("Enter 'абвгд эюя': ");
        String line = in.readLine();
        String s = "абвгд эюя";
        byte[] sBytes = s.getBytes();
        System.out.println("strg bytes: " + Arrays.toString(sBytes));
        byte[] lineBytes = line.getBytes();
        System.out.println("line bytes: " + Arrays.toString(lineBytes));
        PrintStream out = new PrintStream(System.out, true, "UTF-8");
        out.print("--->" + s + "<----\n");
        out.print("--->" + line + "<----\n");
    }

}

Output on Ubuntu (without any changes to configuration):

me@host> javac SerTest.java  && java SerTest
Default charset: UTF-8
Enter 'абвгд эюя': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
line bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
--->абвгд эюя<----
--->абвгд эюя<----

Output on windows CMD prompt (in no way affected by JAVA_TOOL_OPTIONS):

E:\>chcp 65001
Active code page: 65001

E:\>java -Dfile.encoding=utf8 SerTest
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=utf8
Default charset: UTF-8
Enter 'абвгд эюя': юя': ': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
Exception in thread "main" java.lang.NullPointerException
        at SerTest.testUnicode(SerTest.java:26) # byte[] lineBytes = line.getBytes();
        at SerTest.main(SerTest.java:15)

Output in Eclipse console (after using JAVA_TOOL_OPTIONS):

Default charset: UTF-8
Enter 'абвгд эюя': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=utf8
line bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
--->абвгд эюя<----
--->абвгд эюя<----

On Eclipse console, it is working because I have added a system wide environment variable (JAVA_TOOL_OPTIONS) which if possible I would like to avoid.

Output in Eclipse console (after removing JAVA_TOOL_OPTIONS):

Default charset: UTF-8
Enter 'абвгд эюя': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
line bytes: [-61, -112, -62, -80, -61, -112, -62, -79, -61, -112, -62, -78, -61, -112, -62, -77, -61, -112, -62, -76, 32, -61, -111, -17, -65, -67, -61, -111, -59, -67, -61, -111, -17, -65, -67]
--->абвгд эюя<----
--->Ð°Ð±Ð²Ð³Ð´ Ñ�ÑŽÑ�<----

So my question is: what exactly is going on here? What code changes would be required to ensure that this snippet works for all sorts of "Unicode" input?

Sorry for the long winded question and thanks in advance,
Sasuke

Community · Accepted Answer · 2023-11-13 19:27:07Z

4

Some notes:

-Dfile.encoding=utf8 is not supported and may cause unintended side-effects:

The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.

The Console class will detect and use the terminal encoding but doesn't support 65001 (UTF-8) on Windows - at least, it didn't the last time I tried it

I believe that the correct, documented way to use Unicode with cmd.exe is to use WriteConsoleW and ReadConsoleW.

I wrote a couple of blog posts when I was looking at this:

edited Nov 13, 2023 at 19:27

CommunityBot

11 silver badge

answered Dec 30, 2011 at 10:48

McDowell

109k31 gold badges207 silver badges272 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

sasuke Over a year ago

Ah, so basically no sane way of reading/writing unicode stuff when writing windows command line apps? And here I was debugging UTFEncoder/Decoder from sun.* packages...

McDowell Over a year ago

As far as I am aware, there is no cross-platform way. There are a number of 3rd party console libraries out there that may give you a common interface to write to for all platforms but I don't know what level of I18N support they have.

sasuke Over a year ago

Thanks. I guess I'll have to look into the few curses implementations floating around (like this one: slashie.net/libjcsi) and hope they handle unicode in a sane way. Accepted!

Community · Accepted Answer · 2017-05-23 12:08:41Z

3

NPE is throws when you are trying to call Arrays.toString(lineBytes), that means that lineBytes is null.

lineBytes holds value: line.getBytes(). getBytes() can return null only if UnsupportedEncodingException is throws inside.

It happens on windows because windows command prompt does not support unicode by default. This works on Ubuntu because its command prompt is fully unicode enabled. It works partially with eclipse because Eclipse's console window is a java component that supports unicode for input and does it for output with JAVA_TOOL_OPTIONS.

The bottom line is that you wish to configure windows command prompt to be able to use unicode characters. I saw several discussions on this topic. Please take a look on this one: Unicode characters in Windows command line - how?

I hope this will help you.

edited May 23, 2017 at 12:08

CommunityBot

11 silver badge

answered Dec 29, 2011 at 14:50

AlexR

116k16 gold badges137 silver badges216 bronze badges

4 Comments

Milad Naseri Over a year ago

That's the way to go. I don't think anyone could add anything to this answer.

sasuke Over a year ago

Thanks for the reply. A couple of clarifications: The NPE is because of calling getBytes() on line which means line is NULL which doesn't make a lot of sense. I can confirm that there is no UnsupportedEncodingException thrown (at least I don't see it). Lastly, I tried out the suggestion mentioned in the linked thread, same result. Any idea what might be going bad here?

AlexR Over a year ago

@sasuke, I think you are wrong. See your stack trace: at SerTest.testUnicode(SerTest.java:26)line.getBytes(); at SerTest.main(SerTest.java:15) that means that there are 11 lines between main() and point where NPE is thrown. And this is exactly byte[] lineBytes = line.getBytes();.

sasuke Over a year ago

Hi Alex, I can tell it's line.getBytes() because I added a new line System.out.println(line) and it gave me null. Also, if you are on Windows, I would appreciate if you could run the same code and let me know if it works for you. Thanks.

Collectives™ on Stack Overflow

Unicode input in a console application in Java

2 Answers 2

3 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related