0

Note: there are countless questions in this general subject on here, but I couldn't find anything targeted toward my specific problem.

I'm working on parsing XML from http://rss.cnn.com/rss/cnn_latest.rss and my parser was working just fine and I was getting everything I was looking for. No problems. And then out of the blue, after hours of working just fine...I started getting some encoding errors.

Now, what I've been doing is writing the source XML to a file and then parsing that file, as below.

File xmlfile = new File("cnnxml.txt");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();

DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(xmlfile);

What's weird is this is the first line of the XML file, so it would seem the encoding is, in fact, UTF-8

<?xml version="1.0" encoding="UTF-8"?>

Below are the errors I'm getting in Eclipse.

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException:Invalid byte 3 of 4-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.scanData(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanCDATASection(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at getRSS.main(getRSS.java:87)

And, again, this was working all day and then entirely out of nowhere I started getting problems. What is going on?

2
  • I have a IOException try-catch around it and it still produces this issue. When you say the sequence is in an entity, do you mean within each individual item (in this case, each story linked on the RSS) of the XML? That would lend some credibility to my theory that, for whatever reason, something strange got added to the website in the middle of my coding and broke what was once working. Commented Nov 10, 2016 at 23:19
  • 1
    @JoopEggen your comment should be an answer. Commented Nov 11, 2016 at 8:08

2 Answers 2

2

Get the InputStream of the file, convert it to String using the specified character encoding(UTF-8) & parse the InputSource from the string. Example code :

        DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
        String content = IOUtils.toString(xmlInputStream, "UTF-8");
        InputSource is = new InputSource(new StringReader(content));
        Document doc = dBuilder.parse(is);
        doc.getDocumentElement().normalize();
Sign up to request clarification or add additional context in comments.

Comments

0

The solution you will have to explore. But @MichaelKay suggested that a better answer is unlikely.

The file declares to be UTF-8 but is not. Use a programmer's editor like JEdit or Notepad++ to play around with encodings. As this is a data error, catch the exception and make a copy of the file for examination. It just might be an error message of the server - then a solution would be to check the response status. Note: maybe the sequence is in an entity - see stacktrace.

My conjection is that some XML is corrupt, so the try-catch should do something with the data: store it with the stacktrace or such. Best would be if it were repeatable.

It could be that the data relates to an "out of order" message, or some boundary case.

1 Comment

Thank you for the assistance. As it turned out, yes, the page and resulting XML file had some non-UTF-8 content in there and was causing problems. However, a few hours later, that page had basically been completely re-populated with new items and the problem was solved without me having to change anything. Though I have no reason to try it now, I imagine perhaps your solution and Manas Maji's solution would also probably help. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.