0

I have program that downloads webpages and process the body, and I am having problem detecting the encoding for some pages, especially if there is no information added in the header or in the html content, is there a way in java to auto detect and evaluate the char encoding of String or html body of a response?

2 Answers 2

1

Have a look at juniversalchardet, which is the Java port of encoding detector library of Mozilla.

Here is a sample program to check if the encoding is UTF-8.

protected static boolean validUTF8(byte[] input) { 
  UniversalDetector detector = new UniversalDetector(null); 
  detector.handleData(input, 0, input.length); 
  detector.dataEnd(); 
  if ("UTF-8".equals(detector.getDetectedCharset())) { 
   return true; 
  } 
  return false; 
 } 
Sign up to request clarification or add additional context in comments.

Comments

0

As an alternative answer I would suggest: URLConnection.guessContentTypeFromStream(InputStream is) but the Stream must support marking, and guessContentTypeFromName(String fname) (yes, I know it sounds silly, but it is very efficient).

Of course, first you have to get the Stream for the body of the HttpURLConnection somewhat like this InputStream is = response.getInputStream();

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.