Infinite loop while parsing XHTML using DocumentBuilder "parse"

Question

I have this method which loads an XHTML document from a java.io.InputStream returning a org.w3c.dom.Document.

private Document loadDocFrom(InputStream is) throws SAXException,
        IOException, ParserConfigurationException {
    DocumentBuilderFactory domFactory = DocumentBuilderFactory
            .newInstance();
    domFactory.setNamespaceAware(true); // never forget this
    DocumentBuilder builder = domFactory.newDocumentBuilder();

    Document doc = builder.parse(is);
    is.close();
    return doc;
}

This method works, I have tested it with some XHTML documents (e.g. http://pastebin.com/L2kHwggU) and XHTML websites.

But, for some documents such as this http://pastebin.com/v675yWSJ or even websites like www.w3.org, it enters an infinite loop at Document doc = builder.parse(is);.

EDIT:

@Michael Kay found the problem, but I am waiting for his solution.

One of the other possible solutions is to ignore the DTD:

domFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)

Thank you for your help.

Are you sure the loop is infinite? Most template processing languages emit events for each token of interest. If you have a break point on the event list you could have a lot of tokens to go through. — nsfyn55
– nsfyn55, Commented May 23, 2012 at 22:58
I didn't add any "event" to that "event list", in fact, I've never heard of it. So how do you explain that I can parse some XHTML documents like pastebin.com/L2kHwggU? Also, I've debugged the source code and, step by step, it is always stuck at that line "next()". — anahnarciso
– anahnarciso, Commented May 23, 2012 at 23:53
Does it run forever without the debug statement in the source code? — nsfyn55
– nsfyn55, Commented May 24, 2012 at 0:18
Yes. It looks like @Michael Kay found the problem. Thank you anyway :) — anahnarciso
– anahnarciso, Commented May 24, 2012 at 14:54

Michael Kay · Accepted Answer · 2012-05-24 07:14:50Z

1

I think your diagnosis that it's an infinite loop is incorrect; it's just taking a very long time, which isn't the same thing.

The usual reason for this is that the document contains a reference to the XHTML DTD on the W3C web site, and the parser is going to the web to fetch this rather than using a local copy. W3C about a year ago started "throttling" requests for these common DTDs because they could no longer handle the volume of traffic.

The usual solution is to use a Resolver to redirect the requests to a local copy.

answered May 24, 2012 at 7:14

Michael Kay

165k11 gold badges97 silver badges173 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

anahnarciso Over a year ago

Can you show me how to use a Resolver to redirect requests to a local copy? A friend told me I could "ignore" the validation using this line: domFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false); It works, but is it "correct" to just ignore it?

Michael Kay Over a year ago

That approach might work with some parsers and some DTDs. In general though the DTD may contain entity definitions, and if the main document references the entity definitions then they need to be loaded. I'm afraid I never use the DOM builder so I'm not sure offhand exactly how to make it work with a Resolver, but I know it can be done.

Collectives™ on Stack Overflow

Infinite loop while parsing XHTML using DocumentBuilder "parse"

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related