1

So, I have file in ISO8859-1 encoding. I do the next:

InputStreamReader isr = new InputStreamReader(new FileInputStream(fileLocation));
System.out.println(isr.getEncoding());

And I get UTF8... Looks like FileInputStream or InputStreamReader convert it to UTF8.

Yes, I know about the next one way:

BufferedReader br = new BufferedReader(
     new InputStreamReader(
     new FileInputStream(fileLocation), "ISO-8859-1");

But I don't know beforehand what encoding my file will have.

How can I read file with saving encoding?

4
  • 1
    You need to guess what the encoding was. There is no way to know for sure unless this is recorded somewhere as well. Commented Sep 6, 2018 at 9:20
  • If you write a file containing only ASCII text, it will be the same regardless of whether you use ASCII-7, UTF-8, ISO-8859-1, Windows-1252 so there is no way to guess which was used from reading the file alone (nor does it matter in that case) Commented Sep 6, 2018 at 9:25
  • You can specify the default encoding to your favourite using the -Dfile.encoding=ISO-8859-1 JVM argument. In this way, you don't have to specify any encoding. Commented Sep 6, 2018 at 9:31
  • In the first example you get UTF-8 because it is configured as your system default encoding and it is used... by default... every time you don't specify any encoding. Using system default encoding is a great way to introduce bugs to your program. I would go as far as to say that your program can not be considered ready until you have had to fix at least one bug related to using default encoding. :) Commented Sep 6, 2018 at 9:52

1 Answer 1

2

Binary files (bytes) that are actually text in some encoding for those bytes, unfortunately do not store the encoding (charset) somewhere.

Sometimes there is an encoding somewhere: Unicode text could have an optional BOM character at the begin of the file. HTML and XML can specify the charset.

If you downloaded the file from the internet in the header lines the charset could be mentioned. Say it were an HTML file, and Content-Type: text/html; charset=Windows-1251. Then you could read the file with Windows-1251, and always store it as UTF-8, modifying/adding a <meta charset="UTF-8">.

But in general there is no solution for determining some file's encoding. You could do:

  • read the bytes
  • if convertible to UTF-8 without error in the multibyte sequences, it is UTF-8
  • otherwise it is a single byte encoding, default to Windows-1252 (rather than ISO-8859-1)
  • maybe use word frequency tables of some languages together with encodings, and try those
  • write the bytes in the determined encoding to file as UTF-8

There might be a library doing such a thing; combining language recognition and charset recognition.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.