Java FileReader encoding issue

Question

I tried to use java.io.FileReader to read some text files and convert them into a string, but I found the result is wrongly encoded and not readable at all.

Here's my environment:

Windows 2003, OS encoding: CP1252
Java 5.0

My files are UTF-8 encoded or CP1252 encoded, and some of them (UTF-8 encoded files) may contain Chinese (non-Latin) characters.

I use the following code to do my work:

   private static String readFileAsString(String filePath)
    throws java.io.IOException{
        StringBuffer fileData = new StringBuffer(1000);
        FileReader reader = new FileReader(filePath);
        //System.out.println(reader.getEncoding());
        BufferedReader reader = new BufferedReader(reader);
        char[] buf = new char[1024];
        int numRead=0;
        while((numRead=reader.read(buf)) != -1){
            String readData = String.valueOf(buf, 0, numRead);
            fileData.append(readData);
            buf = new char[1024];
        }
        reader.close();
        return fileData.toString();
    }

The above code doesn't work. I found the FileReader's encoding is CP1252 even if the text is UTF-8 encoded. But the JavaDoc of java.io.FileReader says that:

The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate.

Does this mean that I am not required to set character encoding by myself if I am using FileReader? But I did get wrongly encoded data currently, what's the correct way to deal with my situtaion? Thanks.

You should also loose the String.valueOf() inside the loop and use StringBuffer.append(char[],int,int) directly. This saves a lot of copying of the char[]. Also replace StringBuffer with StringBuilder. None of this is about your question, 'though. — Joachim Sauer
– Joachim Sauer, Commented Mar 30, 2009 at 12:01
I hate to say it, but have you read the JavaDoc right after the part you pasted? You know, the part that says "To specify these values yourself, construct an InputStreamReader on a FileInputStream."? — Powerlord
– Powerlord, Commented Mar 30, 2009 at 13:55
Thanks for your comment, actually I read the JavaDoc, but what I am not sure is whether or not I should specify these values myself, and switch to "construct an InputStreamReader on a FileInputStream". — nybon
– nybon, Commented Mar 31, 2009 at 1:05
Yes, if you know the file is in something other than the platform default encoding, you have to tell the InputStreamReader which one to use. — Alan Moore
– Alan Moore, Commented Mar 31, 2009 at 4:46

Joachim Sauer · Accepted Answer · 2020-04-28 11:39:39Z

272

Yes, you need to specify the encoding of the file you want to read.

Yes, this means that you have to know the encoding of the file you want to read.

No, there is no general way to guess the encoding of any given "plain text" file.

The one-arguments constructors of FileReader always use the platform default encoding which is generally a bad idea.

Since Java 11 FileReader has also gained constructors that accept an encoding: new FileReader(file, charset) and new FileReader(fileName, charset).

In earlier versions of java, you need to use new InputStreamReader(new FileInputStream(pathToFile), <encoding>).

edited Apr 28, 2020 at 11:39

answered Mar 30, 2009 at 9:58

Joachim Sauer

309k59 gold badges568 silver badges624 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Bhanu Sharma Over a year ago

InputStream is = new FileInputStream(filename); here i got error file not found error with Russian file name

Ferrybig Over a year ago

+1 for the suggestion of using InputStreamReader, however using links in code blocks makes it hard to copy and paste the code, if this can be changed, thx

NobleUplift Over a year ago

Would it be "UTF-8" or "UTF8" in the encodings. According to the Java SE reference on encoding, since InputStreamReader is a java.io class, it would be "UTF8"?

Joachim Sauer Over a year ago

@NobleUplift: the safest bet is StandardCharsets.UTF_8, there's no chance of mistyping there ;-) But yes, if you go with string "UTF8" would be correct (although I seem to remember that it will accept both ways).

Stijn de Witt Over a year ago

@JoachimSauer Actually, this is one of the purposes of the Byte Order Mark, along with.. well.. establishing the byte order! :) As such I find it weird that Java's FileReader is not able to automatically detect UTF-16 that has such a BOM... In fact I once wrote a UnicodeFileReader that does exactly that. Unfortunately closed source, but Google has it's UnicodeReader which is very similar.

|

Michael Borgwardt · Accepted Answer · 2009-03-30 10:07:25Z

80

FileReader uses Java's platform default encoding, which depends on the system settings of the computer it's running on and is generally the most popular encoding among users in that locale.

If this "best guess" is not correct then you have to specify the encoding explicitly. Unfortunately, FileReader does not allow this (major oversight in the API). Instead, you have to use new InputStreamReader(new FileInputStream(filePath), encoding) and ideally get the encoding from metadata about the file.

answered Mar 30, 2009 at 10:07

Michael Borgwardt

347k81 gold badges491 silver badges726 bronze badges

8 Comments

monojohnny Over a year ago

"major oversight in the API" - thanks for this explanation - I was wondering why I couldn't find the constructor I was after ! Cheers John

Michael Borgwardt Over a year ago

@Bhanu Sharma: that's an encoding issue at a different level, check where you're getting the filename from, and if it's hardcoded what encoding the compiler uses.

bobince Over a year ago

@BhanuSharma: filename encoding issues are nothing to do with this question. See one of the many existing “why don't Unicode filenames work in Java” questions. Spoiler: java.io APIs like FileReader use C standard library filesystem calls, which can't support Unicode on Windows; consider using java.nio instead.

Stijn de Witt Over a year ago

"FileReader uses Java's platform default encoding, which depends on the system settings of the computer it's running on and is generally the most popular encoding among users in that locale." I wouldn't say that. At least of Windows. For some weird technical/historical reasons, the JVM ignores the fact that Unicode is the recommended encoding on Windows for 'all new applications' and instead always acts as if the legacy encoding configured as fallback for legacy apps is the 'platform default'.

Stijn de Witt Over a year ago

I would even go as far as saying that if your Java app does not explicitly specify encodings every time it's reading or writing to files/streams/resources, it's broken, because it can not ever work reliably then.

|

Andreas Gelever · Accepted Answer · 2019-06-05 08:04:22Z

15

For Java 7+ doc you can use this:

BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);

Here are all Charsets doc

For example if your file is in CP1252, use this method

Charset.forName("windows-1252");

Here is other canonical names for Java encodings both for IO and NIO doc

If you do not know with exactly encoding you have got in a file, you may use some third-party libs like this tool from Google this which works fairly neat.

edited Jun 5, 2019 at 8:04

answered Jun 5, 2019 at 7:44

Andreas Gelever

2,0463 gold badges22 silver badges28 bronze badges

Comments

Radoslav Ivanov · Accepted Answer · 2019-01-21 23:48:53Z

8

Since Java 11 you may use that:

public FileReader(String fileName, Charset charset) throws IOException;

edited Jan 21, 2019 at 23:48

answered Dec 7, 2018 at 4:36

Radoslav Ivanov

1,09211 silver badges24 bronze badges

Comments

Guangtong Shen · Accepted Answer · 2019-09-20 22:01:45Z

FileInputStream with InputStreamReader is better than directly using FileReader, because the latter doesn't allow you to specify encoding charset.

Here is an example using BufferedReader, FileInputStream and InputStreamReader together, so that you could read lines from a file.

List<String> words = new ArrayList<>();
List<String> meanings = new ArrayList<>();
public void readAll( ) throws IOException{
    String fileName = "College_Grade4.txt";
    String charset = "UTF-8";
    BufferedReader reader = new BufferedReader(
        new InputStreamReader(
            new FileInputStream(fileName), charset)); 

    String line; 
    while ((line = reader.readLine()) != null) { 
        line = line.trim();
        if( line.length() == 0 ) continue;
        int idx = line.indexOf("\t");
        words.add( line.substring(0, idx ));
        meanings.add( line.substring(idx+1));
    } 
    reader.close();
}

marc_s · Accepted Answer · 2019-09-27 12:36:37Z

0

For another as Latin languages for example Cyrillic you can use something like this:

FileReader fr = new FileReader("src/text.txt", StandardCharsets.UTF_8);

and be sure that your .txt file is saved with UTF-8 (but not as default ANSI) format. Cheers!

edited Sep 27, 2019 at 12:36

marc_s

760k186 gold badges1.4k silver badges1.5k bronze badges

answered Sep 10, 2019 at 20:31

Iefimenko Ievgen

4011 gold badge6 silver badges13 bronze badges

1 Comment

user5875755 Over a year ago

there's no other parameter than the file path!

Collectives™ on Stack Overflow

Java FileReader encoding issue

6 Answers 6

10 Comments

8 Comments

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

10 Comments

8 Comments

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related