Java Auto detect encoding an http response body

Question

I have program that downloads webpages and process the body, and I am having problem detecting the encoding for some pages, especially if there is no information added in the header or in the html content, is there a way in java to auto detect and evaluate the char encoding of String or html body of a response?

Dhruvan Ganesh · Accepted Answer · 2016-10-12 09:32:31Z

1

Have a look at juniversalchardet, which is the Java port of encoding detector library of Mozilla.

Here is a sample program to check if the encoding is UTF-8.

protected static boolean validUTF8(byte[] input) { 
  UniversalDetector detector = new UniversalDetector(null); 
  detector.handleData(input, 0, input.length); 
  detector.dataEnd(); 
  if ("UTF-8".equals(detector.getDetectedCharset())) { 
   return true; 
  } 
  return false; 
 }

answered Oct 12, 2016 at 9:32

Dhruvan Ganesh

1,5861 gold badge18 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

tur11ng · Accepted Answer · 2017-03-22 23:22:41Z

0

As an alternative answer I would suggest: URLConnection.guessContentTypeFromStream(InputStream is) but the Stream must support marking, and guessContentTypeFromName(String fname) (yes, I know it sounds silly, but it is very efficient).

Of course, first you have to get the Stream for the body of the HttpURLConnection somewhat like this InputStream is = response.getInputStream();

answered Mar 22, 2017 at 23:22

tur11ng

1,1121 gold badge10 silver badges27 bronze badges

Collectives™ on Stack Overflow

Java Auto detect encoding an http response body

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related