4

We are trying to download source of webpages, however we cannot see some specific characters -like ü,ö,ş,ç- propoerly due to character encoding. We tried the following code in order to convert encoding of the string ("text" variable):

byte[] xyz = text.getBytes();
text = new String(xyz,"windows-1254"); 

We observed that if encoding is utf-8, we still cannot see pages correctly. What should we do?

1
  • You'll need to show the code that actually reads the data, including the declaration of the input stream and/or reader you use. Also, some sample input (or a link to the page you're trying to read). Commented Jan 26, 2010 at 17:09

2 Answers 2

2

Tell the String constructor to use the UTF-8 encoding to interpret the bytes, if you know the page encodes its contents as UTF-8.

However I am not sure this is the extent of your problem. You have "text" already before trying to "convert" it. This means something has already tried to interpret the bytes of the page as a String, according to some encoding. If that was the wrong encoding, nothing you do later can necessarily fix it.

Instead you need to fix this upstream.

byte[] bytesOfThePage = ...;
String text = new String(bytesOfThePage, "UTF-8");
Sign up to request clarification or add additional context in comments.

Comments

0

The problem is likely exactly there where you're reading, writing and/or displaying those characters.

If you're reading those characters using a Reader, then you need to construct an InputStreamReader first using the 2-argument constructor wherein you can pass the correct encoding (thus, UTF-8) as 2nd argument. E.g.

reader = new InputStreamReader(url.openStream(), "UTF-8");

If you're for example writing those characters to a file, then you need to construct an OutputStreamWriter using the 2-argument constructor wherein you can pass the correct encoding (thus, UTF-8) as 2nd argument. E.g.

writer = new OutputStreamWriter(new FileOutputStream("/page.html"), "UTF-8");

If you're for example writing it all plain vanilla to the stdout (e.g. System.out.println(line) and so on, then you need to ensure that the stdout itself is using the correct encoding (thus, UTF-8). In an IDE such as Eclipse you can configure it by Window > Preferences > General > Workspace > Encoding.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.