0

How to read a UTF8 encoded file in Java into a String accurately?

When i change the encoding of this .java file to UTF-8 (Eclipse > Rightclick on App.java > Properties > Resource > Text file encoding) , it works fine from within Eclipse but not command line. Seems like eclipse is setting file.encoding parameter when running App.

Why should the encoding of the source file have any impact on creating String from bytes. What is the fool-proof way to create String from bytes when encoding is known? I may have files with different encodings. Once encoding of a file is known, I must be able to read into string, regardless of value of file.encoding?

The content of utf8 file is below

English Hello World.
Korean 안녕하세요.
Japanese 世界こんにちは。
Russian Привет мир.
German Hallo Welt.
Spanish Hola mundo.
Hindi हैलो वर्ल्ड।
Gujarati હેલો વર્લ્ડ.
Thai สวัสดีชาวโลก.

-end of file-

The code is below. MY observations are in the comments within.

public class App {
public static void main(String[] args) {
    String slash = System.getProperty("file.separator");
    File inputUtfFile = new File("C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text.txt");
    File outputUtfFile = new File("C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text_out.txt");
    File outputUtfByteWrittenFile = new File(
            "C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text_byteout.txt");
    outputUtfFile.delete();
    outputUtfByteWrittenFile.delete();

    try {

        /*
         * read a utf8 text file with internationalized strings into bytes.
         * there should be no information loss here, when read into raw bytes.
         * We are sure that this file is UTF-8 encoded. 
         * Input file created using Notepad++. Text copied from Google translate.
         */
        byte[] fileBytes = readBytes(inputUtfFile);

        /*
         * Create a string from these bytes. Specify that the bytes are UTF-8 bytes.
         */
        String str = new String(fileBytes, StandardCharsets.UTF_8);

        /*
         * The console is incapable of displaying this string.
         * So we write into another file. Open in notepad++ to check.
         */
        ArrayList<String> list = new ArrayList<>();
        list.add(str);
        writeLines(list, outputUtfFile);

        /*
         * Works fine when I read bytes and write bytes. 
         * Open the other output file in notepad++ and check. 
         */
        writeBytes(fileBytes, outputUtfByteWrittenFile);

        /*
         * I am using JDK 8u60.
         * I tried running this on command line instead of eclipse. Does not work.
         * I tried using apache commons io library. Does not work. 
         *  
         * This means that new String(bytes, charset); does not work correctly. 
         * There is no real effect of specifying charset to string.
         */
    } catch (IOException e) {
        e.printStackTrace();
    }

}

public static void writeLines(List<String> lines, File file) throws IOException {
    BufferedWriter writer = null;
    OutputStreamWriter osw = null;
    OutputStream fos = null;
    try {
        fos = new FileOutputStream(file);
        osw = new OutputStreamWriter(fos);
        writer = new BufferedWriter(osw);
        String lineSeparator = System.getProperty("line.separator");
        for (int i = 0; i < lines.size(); i++) {
            String line = lines.get(i);
            writer.write(line);
            if (i < lines.size() - 1) {
                writer.write(lineSeparator);
            }
        }
    } catch (IOException e) {
        throw e;
    } finally {
        close(writer);
        close(osw);
        close(fos);
    }
}

public static byte[] readBytes(File file) {
    FileInputStream fis = null;
    byte[] b = null;
    try {
        fis = new FileInputStream(file);
        b = readBytesFromStream(fis);
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        close(fis);
    }
    return b;
}

public static void writeBytes(byte[] inBytes, File file) {
    FileOutputStream fos = null;
    try {
        fos = new FileOutputStream(file);
        writeBytesToStream(inBytes, fos);
        fos.flush();
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        close(fos);
    }
}

public static void close(InputStream inStream) {
    try {
        inStream.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
    inStream = null;
}

public static void close(OutputStream outStream) {
    try {
        outStream.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
    outStream = null;
}

public static void close(Writer writer) {
    if (writer != null) {
        try {
            writer.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
        writer = null;
    }
}

public static long copy(InputStream readStream, OutputStream writeStream) throws IOException {
    int bytesread = -1;
    byte[] b = new byte[4096]; //4096 is default cluster size in Windows for < 2TB NTFS partitions
    long count = 0;
    bytesread = readStream.read(b);
    while (bytesread != -1) {
        writeStream.write(b, 0, bytesread);
        count += bytesread;
        bytesread = readStream.read(b);
    }
    return count;
}
public static byte[] readBytesFromStream(InputStream readStream) throws IOException {
    ByteArrayOutputStream writeStream = null;
    byte[] byteArr = null;
    writeStream = new ByteArrayOutputStream();
    try {
        copy(readStream, writeStream);
        writeStream.flush();
        byteArr = writeStream.toByteArray();
    } finally {
        close(writeStream);
    }
    return byteArr;
}
public static void writeBytesToStream(byte[] inBytes, OutputStream writeStream) throws IOException {
    ByteArrayInputStream bis = null;
    bis = new ByteArrayInputStream(inBytes);
    try {
        copy(bis, writeStream);
    } finally {
        close(bis);
    }
}
};

Edit: For @JB Nizet, And Everyone :)

//writeLines(list, outputUtfFile, StandardCharsets.UTF_16BE); //does not work
//writeLines(list, outputUtfFile, Charset.defaultCharset()); //does not work. 
writeLines(list, outputUtfFile, StandardCharsets.UTF_16LE); //works

I need to specify encoding of bytes when reading bytes into String. I need to specify encoding of bytes when I am writing bytes from String into file.

Once I have a String in JVM, I do not need to remember the source byte encoding, am I right?

When I write to file, it should convert the String into the default Charset of my machine (be it UTF8 or ASCII or cp1252). That is failing. It fails for UTF16 BE too. Why does it fail for some Charsets?

3
  • 1
    "Seems like eclipse is setting file.encoding parameter when running App." - No, I think it's much more likely that changing the encoding Eclipse understands for the file is changing the bytes stored on disk. Commented Oct 8, 2015 at 12:39
  • 1
    But note that writeLines is using the platform default encoding... that sounds like a bad idea to me. Commented Oct 8, 2015 at 12:40
  • 1
    Consider to use the NIO.2 File API. And try-with-resources. Commented Oct 8, 2015 at 12:50

1 Answer 1

6

The Java source file encoding is indeed irrelevant. And the reading part of your code is correct (although inefficient). What is incorrect is the writing part:

osw = new OutputStreamWriter(fos);

should be changed to

osw = new OutputStreamWriter(fos, StandardCharsets.UTF_8);

Otherwise, you use the default encoding (which doesn't seem to be UTF8 on your system) instead of using UTF8.

Note that Java allows using forward slashes in file paths, even on Windows. You could simply write

File inputUtfFile = new File("C:/sources/TestUtfRead/utf8text.txt");

EDIT:

Once I have a String in JVM, I do not need to remember the source byte encoding, am I right?

Yes, you're right.

When I write to file, it should convert the String into the default Charset of my machine (be it UTF8 or ASCII or cp1252). That is failing.

If you don't specify any encoding, Java will indeed use the platform default encoding to transform the characters into bytes. If you specify an encoding (as suggested in the beginning of this answer), then it uses the encoding you tell it to use.

But all encodings can't, like UTF8, represent all the unicode characters. ASCII for example only supports 128 different characters. Cp1252, AFAIK, only supports 256 characters. So, the encoding succeeds, but it replaces unencodable characters with a special one (I can't remember which one) which means: I can't encode this Thai or Russian character because it's not part of my supported character set.

UTF16 encoding should be fine. But make sure to also configure your text editor to use UTF16 when reading and displaying the content of the file. If it's configured to use another encoding, the displayed content won't be correct.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. this works. However I have another question. Will write another post.
Thanks ! Works good now. UTF16LE works but not BE. My editor might be buggy (havent updated in a while).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.