Note: there are countless questions in this general subject on here, but I couldn't find anything targeted toward my specific problem.
I'm working on parsing XML from http://rss.cnn.com/rss/cnn_latest.rss and my parser was working just fine and I was getting everything I was looking for. No problems. And then out of the blue, after hours of working just fine...I started getting some encoding errors.
Now, what I've been doing is writing the source XML to a file and then parsing that file, as below.
File xmlfile = new File("cnnxml.txt");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(xmlfile);
What's weird is this is the first line of the XML file, so it would seem the encoding is, in fact, UTF-8
<?xml version="1.0" encoding="UTF-8"?>
Below are the errors I'm getting in Eclipse.
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException:Invalid byte 3 of 4-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.scanData(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanCDATASection(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at getRSS.main(getRSS.java:87)
And, again, this was working all day and then entirely out of nowhere I started getting problems. What is going on?