Java parsing UTF8

Question

I have the following issue with a UTF8 files structured as following:

FIELD1§FIELD2§FIELD3§FIELD4

Looking at hexadecimal values of the file it uses A7 to codify §. So according to this codify it should be UTF8, but it's strange because A7 > 7F so 1 byte shouldn't be enough to codify §.

So I tried using directly a BufferedReader with a specified charset:

BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(input), utf8))

but when I try to tokenize the string with

SmartTokenizer st = new SmartTokenizer(toTokenize, "§")

(the SmartTokenizer is a modified version of the StringTokenizer that keeps empty tokens)

no splitting occurs, and if I try to print the string I obtain

FIELD1?FIELD2?FIELD3?...

so § used in the file is different from the one specified as a the delimiter, and it's not able to print out it too.

So what's the problem here? Maybe the original file should use 2 bytes to store §?

Joachim Sauer · Accepted Answer · 2010-04-06 16:41:54Z

6

The UTF-8 encoding of § is 0xC2 0xA7.

If the file uses A7 to represent §, then it's probably writtein in ISO-8859-1 (or another ISO-8859-* or their derivates).

answered Apr 6, 2010 at 16:41

Joachim Sauer

309k59 gold badges568 silver badges624 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jack Over a year ago

Yes, I was looking into the wrong direction, after trying to convert between standards and so on.. I just told BufferedReader to read according to ISO-8859-1 charset.. thanks!

leonbloy · Accepted Answer · 2010-04-06 16:41:52Z

1

Looking at hexadecimal values of the file it uses A7 to codify §. So according to this codify it should be UTF8

Uh, why? It's ISO8859-1 (or latin-1 or related one) http://en.wikipedia.org/wiki/ISO/IEC_8859-1

answered Apr 6, 2010 at 16:41

leonbloy

76.4k22 gold badges149 silver badges197 bronze badges

Collectives™ on Stack Overflow

Java parsing UTF8

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related