How to read file with saving encoding?

Question

So, I have file in ISO8859-1 encoding. I do the next:

InputStreamReader isr = new InputStreamReader(new FileInputStream(fileLocation));
System.out.println(isr.getEncoding());

And I get UTF8... Looks like FileInputStream or InputStreamReader convert it to UTF8.

Yes, I know about the next one way:

BufferedReader br = new BufferedReader(
     new InputStreamReader(
     new FileInputStream(fileLocation), "ISO-8859-1");

But I don't know beforehand what encoding my file will have.

How can I read file with saving encoding?

You need to guess what the encoding was. There is no way to know for sure unless this is recorded somewhere as well. — Peter Lawrey
– Peter Lawrey, Commented Sep 6, 2018 at 9:20
If you write a file containing only ASCII text, it will be the same regardless of whether you use ASCII-7, UTF-8, ISO-8859-1, Windows-1252 so there is no way to guess which was used from reading the file alone (nor does it matter in that case) — Peter Lawrey
– Peter Lawrey, Commented Sep 6, 2018 at 9:25
You can specify the default encoding to your favourite using the -Dfile.encoding=ISO-8859-1 JVM argument. In this way, you don't have to specify any encoding. — m4gic
– m4gic, Commented Sep 6, 2018 at 9:31
In the first example you get UTF-8 because it is configured as your system default encoding and it is used... by default... every time you don't specify any encoding. Using system default encoding is a great way to introduce bugs to your program. I would go as far as to say that your program can not be considered ready until you have had to fix at least one bug related to using default encoding. :) — Torben
– Torben, Commented Sep 6, 2018 at 9:52

Joop Eggen · Accepted Answer · 2018-09-06 09:35:19Z

Binary files (bytes) that are actually text in some encoding for those bytes, unfortunately do not store the encoding (charset) somewhere.

Sometimes there is an encoding somewhere: Unicode text could have an optional BOM character at the begin of the file. HTML and XML can specify the charset.

If you downloaded the file from the internet in the header lines the charset could be mentioned. Say it were an HTML file, and Content-Type: text/html; charset=Windows-1251. Then you could read the file with Windows-1251, and always store it as UTF-8, modifying/adding a <meta charset="UTF-8">.

But in general there is no solution for determining some file's encoding. You could do:

read the bytes
if convertible to UTF-8 without error in the multibyte sequences, it is UTF-8
otherwise it is a single byte encoding, default to Windows-1252 (rather than ISO-8859-1)
maybe use word frequency tables of some languages together with encodings, and try those
write the bytes in the determined encoding to file as UTF-8

There might be a library doing such a thing; combining language recognition and charset recognition.

Collectives™ on Stack Overflow

How to read file with saving encoding?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related