MalformedByteSequenceException when parsing XML from URL in Java

Question

I'm trying to parse a XML with the following code:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new URL("http://www.cinemark.com.br/mobile/xml/films/").openStream());

But get the following error:

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:687)
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:557)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1753)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1629)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1667)
    at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:196)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:812)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:347)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
    at Programacao.main(Programacao.java:53)

Accessing the url, you can see there are some portuguese characters, and seeing the response, I could see the first line of the xml file:

<?xml version="1.0" encoding="iso-8859-1"?>

So I tried doing this:

URL url = new URL("http://www.cinemark.com.br/mobile/xml/films/");

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();

InputStream ism = url.openStream();
InputSource is = new InputSource(ism);
is.setEncoding("iso-8859-1");

Document doc = db.parse(is.getByteStream());

But I still got the EXACT same error. How can I parse the xml using a different encondig?

Also, how can I know if the xml is really in the encoding described in the file?

I'm using JDK 1.7.0_51 on Fedora Linux 20

Thanks

SOLUTION

What I did to solve the problem, based on Seelenvirtuose answer:

URL url = new URL("http://www.cinemark.com.br/mobile/xml/films/");

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();

InputStream ism = url.openStream();
GZIPInputStream gis = new GZIPInputStream(ism);
Reader decoder = new InputStreamReader(gis);
InputSource is = new InputSource(decoder);

Document doc = db.parse(is);

I suppose, setting encoding has nothing to do with getByteStream. The last one returns just bytes. Encoding is a meta-information about how to interpret these bytes, but with getBytes there's no such interpretation at all. — kirilloid
– kirilloid, Commented May 4, 2014 at 5:13
Parse the input source directly and not getting the byte stream as @kirilloid mentioned. — Praba
– Praba, Commented May 4, 2014 at 5:15
Thanks for the answers, but if I just passed the Input Source directly, I would get: org.xml.sax.SAXParseException; Content is not allowed in prolog. I also had to do what @Seelenvirtuose said. — luislhl
– luislhl, Commented May 4, 2014 at 14:56

Seelenvirtuose · Accepted Answer · 2014-05-04 06:57:06Z

1

The difference in behavior is as following:

When accessing the URL in a browser, after some time it displays:

<?xml version="1.0" encoding="iso-8859-1"?>
<cinemark>
  <films>
    <film ...>...</film>
    ...
  </films>
</cinemark>

However, when simply running curl (for example), then you get an output similar to:

‹      ¬YMsÛ6½ûW`xôT¨Oªc) [...]

So, what actually is happening? Easy: This is called HTTP compresson. So when running the following command

curl -o films.zip http://www.cinemark.com.br/mobile/xml/films/

you will get a file called films.zip that contains a single file called films, which in turn contains the expected XML document.

So, what you should do is: Take the output stream as a compressed stream, extract the content, and parse that.

answered May 4, 2014 at 6:57

Seelenvirtuose

20.7k6 gold badges40 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

luislhl Over a year ago

You are right, it was compressed in gunzip format. Thank you very much.

Collectives™ on Stack Overflow

MalformedByteSequenceException when parsing XML from URL in Java

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related