XML parser not parsing UTF-8 despite correct encoding

Question

Note: there are countless questions in this general subject on here, but I couldn't find anything targeted toward my specific problem.

I'm working on parsing XML from http://rss.cnn.com/rss/cnn_latest.rss and my parser was working just fine and I was getting everything I was looking for. No problems. And then out of the blue, after hours of working just fine...I started getting some encoding errors.

Now, what I've been doing is writing the source XML to a file and then parsing that file, as below.

File xmlfile = new File("cnnxml.txt");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();

DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(xmlfile);

What's weird is this is the first line of the XML file, so it would seem the encoding is, in fact, UTF-8

<?xml version="1.0" encoding="UTF-8"?>

Below are the errors I'm getting in Eclipse.

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException:Invalid byte 3 of 4-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.scanData(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanCDATASection(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at getRSS.main(getRSS.java:87)

And, again, this was working all day and then entirely out of nowhere I started getting problems. What is going on?

I have a IOException try-catch around it and it still produces this issue. When you say the sequence is in an entity, do you mean within each individual item (in this case, each story linked on the RSS) of the XML? That would lend some credibility to my theory that, for whatever reason, something strange got added to the website in the middle of my coding and broke what was once working. — MP12389
– MP12389, Commented Nov 10, 2016 at 23:19

Manas Maji · Accepted Answer · 2016-11-11 11:29:08Z

2

Get the InputStream of the file, convert it to String using the specified character encoding(UTF-8) & parse the InputSource from the string. Example code :

        DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
        String content = IOUtils.toString(xmlInputStream, "UTF-8");
        InputSource is = new InputSource(new StringReader(content));
        Document doc = dBuilder.parse(is);
        doc.getDocumentElement().normalize();

answered Nov 11, 2016 at 11:29

Manas Maji

1011 silver badge2 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Joop Eggen · Accepted Answer · 2016-11-11 11:21:44Z

0

The solution you will have to explore. But @MichaelKay suggested that a better answer is unlikely.

The file declares to be UTF-8 but is not. Use a programmer's editor like JEdit or Notepad++ to play around with encodings. As this is a data error, catch the exception and make a copy of the file for examination. It just might be an error message of the server - then a solution would be to check the response status. Note: maybe the sequence is in an entity - see stacktrace.

My conjection is that some XML is corrupt, so the try-catch should do something with the data: store it with the stacktrace or such. Best would be if it were repeatable.

It could be that the data relates to an "out of order" message, or some boundary case.

answered Nov 11, 2016 at 11:21

Joop Eggen

110k8 gold badges89 silver badges142 bronze badges

1 Comment

MP12389 Over a year ago

Thank you for the assistance. As it turned out, yes, the page and resulting XML file had some non-UTF-8 content in there and was causing problems. However, a few hours later, that page had basically been completely re-populated with new items and the problem was solved without me having to change anything. Though I have no reason to try it now, I imagine perhaps your solution and Manas Maji's solution would also probably help. Thanks!

Collectives™ on Stack Overflow

XML parser not parsing UTF-8 despite correct encoding

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related