0

I have this method which loads an XHTML document from a java.io.InputStream returning a org.w3c.dom.Document.

private Document loadDocFrom(InputStream is) throws SAXException,
        IOException, ParserConfigurationException {
    DocumentBuilderFactory domFactory = DocumentBuilderFactory
            .newInstance();
    domFactory.setNamespaceAware(true); // never forget this
    DocumentBuilder builder = domFactory.newDocumentBuilder();

    Document doc = builder.parse(is);
    is.close();
    return doc;
}

This method works, I have tested it with some XHTML documents (e.g. http://pastebin.com/L2kHwggU) and XHTML websites.

But, for some documents such as this http://pastebin.com/v675yWSJ or even websites like www.w3.org, it enters an infinite loop at Document doc = builder.parse(is);.

EDIT:

@Michael Kay found the problem, but I am waiting for his solution.

One of the other possible solutions is to ignore the DTD:

domFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)

Thank you for your help.

4
  • Are you sure the loop is infinite? Most template processing languages emit events for each token of interest. If you have a break point on the event list you could have a lot of tokens to go through. Commented May 23, 2012 at 22:58
  • I didn't add any "event" to that "event list", in fact, I've never heard of it. So how do you explain that I can parse some XHTML documents like pastebin.com/L2kHwggU? Also, I've debugged the source code and, step by step, it is always stuck at that line "next()". Commented May 23, 2012 at 23:53
  • Does it run forever without the debug statement in the source code? Commented May 24, 2012 at 0:18
  • Yes. It looks like @Michael Kay found the problem. Thank you anyway :) Commented May 24, 2012 at 14:54

1 Answer 1

1

I think your diagnosis that it's an infinite loop is incorrect; it's just taking a very long time, which isn't the same thing.

The usual reason for this is that the document contains a reference to the XHTML DTD on the W3C web site, and the parser is going to the web to fetch this rather than using a local copy. W3C about a year ago started "throttling" requests for these common DTDs because they could no longer handle the volume of traffic.

The usual solution is to use a Resolver to redirect the requests to a local copy.

Sign up to request clarification or add additional context in comments.

2 Comments

Can you show me how to use a Resolver to redirect requests to a local copy? A friend told me I could "ignore" the validation using this line: domFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false); It works, but is it "correct" to just ignore it?
That approach might work with some parsers and some DTDs. In general though the DTD may contain entity definitions, and if the main document references the entity definitions then they need to be loaded. I'm afraid I never use the DOM builder so I'm not sure offhand exactly how to make it work with a Resolver, but I know it can be done.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.