0

I'm trying to parse a XML with the following code:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new URL("http://www.cinemark.com.br/mobile/xml/films/").openStream());

But get the following error:

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:687)
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:557)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1753)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1629)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1667)
    at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:196)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:812)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:347)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
    at Programacao.main(Programacao.java:53)

Accessing the url, you can see there are some portuguese characters, and seeing the response, I could see the first line of the xml file:

<?xml version="1.0" encoding="iso-8859-1"?>

So I tried doing this:

URL url = new URL("http://www.cinemark.com.br/mobile/xml/films/");

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();

InputStream ism = url.openStream();
InputSource is = new InputSource(ism);
is.setEncoding("iso-8859-1");

Document doc = db.parse(is.getByteStream());

But I still got the EXACT same error. How can I parse the xml using a different encondig?

Also, how can I know if the xml is really in the encoding described in the file?

I'm using JDK 1.7.0_51 on Fedora Linux 20

Thanks

SOLUTION

What I did to solve the problem, based on Seelenvirtuose answer:

URL url = new URL("http://www.cinemark.com.br/mobile/xml/films/");

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();

InputStream ism = url.openStream();
GZIPInputStream gis = new GZIPInputStream(ism);
Reader decoder = new InputStreamReader(gis);
InputSource is = new InputSource(decoder);

Document doc = db.parse(is);
4
  • I suppose, setting encoding has nothing to do with getByteStream. The last one returns just bytes. Encoding is a meta-information about how to interpret these bytes, but with getBytes there's no such interpretation at all. Commented May 4, 2014 at 5:13
  • Parse the input source directly and not getting the byte stream as @kirilloid mentioned. Commented May 4, 2014 at 5:15
  • may b encoding format is wrong...first check the format Commented May 4, 2014 at 5:17
  • Thanks for the answers, but if I just passed the Input Source directly, I would get: org.xml.sax.SAXParseException; Content is not allowed in prolog. I also had to do what @Seelenvirtuose said. Commented May 4, 2014 at 14:56

1 Answer 1

1

The difference in behavior is as following:

When accessing the URL in a browser, after some time it displays:

<?xml version="1.0" encoding="iso-8859-1"?>
<cinemark>
  <films>
    <film ...>...</film>
    ...
  </films>
</cinemark>

However, when simply running curl (for example), then you get an output similar to:

‹      ¬YMsÛ6½ûW`xôT¨Oªc) [...]

So, what actually is happening? Easy: This is called HTTP compresson. So when running the following command

curl -o films.zip http://www.cinemark.com.br/mobile/xml/films/

you will get a file called films.zip that contains a single file called films, which in turn contains the expected XML document.

So, what you should do is: Take the output stream as a compressed stream, extract the content, and parse that.

Sign up to request clarification or add additional context in comments.

1 Comment

You are right, it was compressed in gunzip format. Thank you very much.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.