0

I'm trying to read a file from the SD card and I've been told it's in unicode format. However, when I try to read the file I get the following:

Encoded file

This is the code I'm using to read the file:

InputStreamReader fw = new InputStreamReader(new FileInputStream(root.getAbsolutePath()+"/Drive/sdk/cmd.62.out"), "UTF-8");
char[] buf = new char[255];     
fw.read(buf);
String readString = new String(buf);
Log.d("courierread",readString);    
fw.close();

If I write that output to a file this is what I get when I open it in a hex editor: Hex info

Any thoughts on what I need to do to read the file correctly?

2 Answers 2

2

Does the file have a byte-order mark? In that case look at Reading UTF-8 - BOM marker

EDIT (from comment): That looks like little-endian UTF-16 to me. Try the charset "UTF-16LE".

Sign up to request clarification or add additional context in comments.

7 Comments

Not sure, but I tried applying the BOM removal code and it seemed to make it worse! I suppose the easiest solution is to strip out all those weird A characters - unfortunately I don't know the unicode char to do so..
Stripping out those characters wouldn't be solving the problem. Are you sure it's a UTF-8 file? Can you look at the file in a hex editor and post a screen shot or the hex codes of the first few bytes?
All I know is that it's unicode. I tried UTF-16 and it was completely unreadable, it was just made up of lots of dodgy characters. As requested I've outputted the hex codes for each character (see the original post). It appears that there is a 0 in between every character..
A single 0 doesn't make much sense between the characters. It there really were a 0 byte it would be 00. The problem with your output, is that it has already been processed by (possibly wrong) Java code, so a look at it in an "independent" hex editor would be better...
Thanks. That looks like little-endian UTF-16 to me. Try the charset "UTF-16LE".
|
1

The file you show in the hex editor is not UTF-8 encoded, it looks more like UTF-16. This means you must specify UTF-16 as the encoding in your code (probably the UTF-16LE variant).

If it were UTF-8 encoded, then it would represent all characters representable in ASCII using just a single byte.

1 Comment

Interesting tip, thanks for that. I'll try creating different files with different types of encoding.. I guess that is the easiest way to learn the difference

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.