1

I have the following issue with a UTF8 files structured as following:

FIELD1§FIELD2§FIELD3§FIELD4

Looking at hexadecimal values of the file it uses A7 to codify §. So according to this codify it should be UTF8, but it's strange because A7 > 7F so 1 byte shouldn't be enough to codify §.

So I tried using directly a BufferedReader with a specified charset:

BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(input), utf8))

but when I try to tokenize the string with

SmartTokenizer st = new SmartTokenizer(toTokenize, "§")

(the SmartTokenizer is a modified version of the StringTokenizer that keeps empty tokens)

no splitting occurs, and if I try to print the string I obtain

FIELD1?FIELD2?FIELD3?...

so § used in the file is different from the one specified as a the delimiter, and it's not able to print out it too.

So what's the problem here? Maybe the original file should use 2 bytes to store §?

2 Answers 2

6

The UTF-8 encoding of § is 0xC2 0xA7.

If the file uses A7 to represent §, then it's probably writtein in ISO-8859-1 (or another ISO-8859-* or their derivates).

Sign up to request clarification or add additional context in comments.

1 Comment

Yes, I was looking into the wrong direction, after trying to convert between standards and so on.. I just told BufferedReader to read according to ISO-8859-1 charset.. thanks!
1

Looking at hexadecimal values of the file it uses A7 to codify §. So according to this codify it should be UTF8

Uh, why? It's ISO8859-1 (or latin-1 or related one) http://en.wikipedia.org/wiki/ISO/IEC_8859-1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.