4

I am using sax parser to parse XML as string in my application .When my code send HTML body as string then sax parser getting stuck for longer time (more than 5 hour).

Page source url : "http://www.cityam.com/taxonomy/term/1/all/feed" which i want to parse. This url giving HTML page instead of XML. How to handle this kind of problem or how to get out from my saxParser with appropriate exception. My code is here

public List<RssEntry> parseDocument(String body) {
    // expected body is xml but getting stuck when get body of html page.
    SAXParserFactory factory = SAXParserFactory.newInstance();
    try {
        SAXParser parser = factory.newSAXParser();
        XMLReader reader = parser.getXMLReader();   
        parser.parse(new ByteArrayInputStream(body.getBytes("UTF-8")), this);
    }

    some catch block

Please help me.Thanks

16
  • 1
    There's a good chance that the HTML isn't valid XML. Could that be the problem? Commented Mar 8, 2013 at 11:27
  • Can you expand on what you mean by stuck? Are your callbacks in your Handler actually being called? Are there any exceptions being thrown? Commented Mar 8, 2013 at 11:29
  • @ sven - but how to get out from here if html not valid Commented Mar 8, 2013 at 11:30
  • @Dave - My Application stop responding.and control not getting out from parser code. Commented Mar 8, 2013 at 11:31
  • 1
    Can you try taking out all the code from your handler callbacks and replacing with simple logging messages to check whether any of your callbacks are actually being called. Start with startDocument just to see if the parser is even starting at all. Commented Mar 8, 2013 at 14:03

2 Answers 2

1

When my code send HTML body as string then sax parser getting stuck for longer time (more than 5 hour). If i pass body of html page which contains "http://apache.org/xml/features/nonvalidating/load-external-dtd" in dtd are (start of html page) then sax parser got busy to load external-dtd.

so i put these feature as false then sax parser throw an error if xml is not well defined.

XMLReader reader = parser.getXMLReader(); reader.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd",false);

Thanks everybody to help me.

Sign up to request clarification or add additional context in comments.

Comments

0
// expected body is xml but getting stuck when get body of html page.
SAXParserFactory factory = SAXParserFactory.newInstance();
if(!body.startsWith("<?xml")){
    throw new NotXmlInputException(message); //your exception
}

or create shema file for your xml, and use validation

SchemaFactory constraintFactory =
        SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Source constraints = new StreamSource(/* your schema */);
Schema schema = constraintFactory.newSchema(constraints);
Validator validator = schema.newValidator();

try {
    validator.validate(/* convert your string to sourse*/);
} catch (org.xml.sax.SAXException e) {
    log("Validation error: " + e.getMessage());
}

or may be helped use

SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(true);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.