3

I'm writing a program which parses a web page (one which I don't have access to so I can't modify it).

First I connect and use getContent() to get an InputStream for the page. There's no trouble there.

But then when parsing:

    public static int[] parseMoveGameList(InputStream is) throws ParserConfigurationException, IOException, SAXException {
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = dbf.newDocumentBuilder();
        Document doc = builder.parse(is);
        /*...*/
    }

Here builder.parse throws:

org.xml.sax.SAXParseException; lineNumber: 3; columnNumber: 64; The system identifier must begin with either a single or double quote character.
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:253)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:288)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
    at cs.ualberta.lgadapter.LGAdapter.parseMoveGameList(LGAdapter.java:78)
    ...

The page that I'm parsing (but can't change) looks like

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" >









<html>
<head>
<META http-equiv="Expires" content="0" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<!-- ...  -->
</head>
<body>
<!-- ...  -->
</body>
</html>

How can I get past this exception?

2
  • 1
    I don't think it's a good idea to use an XML parser to parse HTML. Commented Aug 10, 2012 at 17:01
  • stackoverflow.com/questions/9071568/… Commented Aug 10, 2012 at 17:07

1 Answer 1

2

Html is not valid xml. Using an xml parser to parse html will probably result in a lot of errors(as you have already discovered).

The reason your html is failing is because of your Doctype declaration:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" >

xml parsers expect the 'PUBLIC' doctype declaration to look like the following:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "FALLBACK PATH TO DTD" >

If you can't change the html page, I am not sure there is much you can do about this. Maybe you can the modify/wrap your input stream to add some dummy data to make it conform to what is expected, or remove the doctype declaration.

You should use a HTML parsing library instead. I do not know of any off the top of my head, but this (older) post seems to have a couple listed. http://www.benmccann.com/blog/java-html-parsing-library-comparison/ . Searching Google also comes back with http://jsoup.org/

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.