36

I've been trying to figure out how to check the syntax of an XML file, make sure all tags are closed, there's no random characters, etc... All I care at this point is making sure there is no broken XML in the file.

I've been looking at some SO posts like these...

... but I realized that I don't want to validate the structure of the XML file; I don't want to validate against an XML Schema (XSD)... I just want to check the XML syntax and determine if it is correct.

3 Answers 3

51

You can check if an XML document is well-formed using the following code:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setNamespaceAware(true);

DocumentBuilder builder = factory.newDocumentBuilder();

builder.setErrorHandler(new SimpleErrorHandler());    
// the "parse" method also validates XML, will throw an exception if misformatted
Document document = builder.parse(new InputSource("document.xml"));

The SimpleErrorHandler class referred to in the above code is as follows:

public class SimpleErrorHandler implements ErrorHandler {
    public void warning(SAXParseException e) throws SAXException {
        System.out.println(e.getMessage());
    }

    public void error(SAXParseException e) throws SAXException {
        System.out.println(e.getMessage());
    }

    public void fatalError(SAXParseException e) throws SAXException {
        System.out.println(e.getMessage());
    }
}

This came from this website, which provides various methods for validating XML with Java. Note also that this method loads an entire DOM tree into memory, see comments for alternatives if you want to save on RAM.

Sign up to request clarification or add additional context in comments.

24 Comments

So will this check the syntax of the XML file? I don't want to use an XML Schema here...
Yes, it will check that the document follows the rules of "well-formedness" set out by the XML spec - w3.org/TR/xml/#sec-well-formed. This means that all elements must be closed, nested properly, etc. In fact, the spec defines well-formedness because you can't always use a DTD.
but wouldn't sax be a better choice, performancewise, he's not using the document anyway and therefore doesn't need to hold it in memory
Yes, probably. That is, if he actually doesn't need the document in memory - I don't think he's implied that really. In that case, there is sample code to do exactly the same thing using SAX here: edankert.com/validate.html
No problem. The method I gave you in my answer uses DOM to parse the document, which builds up a tree of the document as it goes, using up potentially a lot of memory. SAX does not build up a tree of your document. You can find a good comparison of the two here: developerlife.com/tutorials/?p=28
|
5

What you are asking is how to verify that a piece of content is well-formed XML document. This is easily done by simply letting an XML parser (try to) parse content in question -- if there are issues, parser will report an error by throwing exception. There really isn't anything more to that; so all you need is to figure out how to parse an XML document.

About the only thing to beware is that some libs that claim to be XML parsers are not really proper parsers, in that they actually might not verify things that XML parser must do (as per XML specification) -- in Java, Javolution is an example of something that does little to no checking; VTD-XML and XPP3 do some verification (but not all required checks). And at the other end of spectrum, Xerces and Woodstox check everything that specification mandates. Xerces is bundled with JDK; and most web service frameworks bundle Woodstox in addition.

Since the accepted answer already shows how to parse content into a DOM document (which starts with parsing), that might be enough. The only caveat is that this requires that you have 3-5x as much memory available as raw size of the input document. To get around this limitation you could use a streaming parser, such as Woodstox (which implements standard Stax API). If so, you would create an XMLStreamReader, and just call reader.next() as long as reader.hasNext() returns true.

Comments

0

http://www.ibm.com/developerworks/xml/library/x-javaxmlvalidapi/index.html Does this help? It uses XSD which is pretty robust. Not only can you validate the documents structure, but you can supply some pretty complex rules about what type of content your nodes and attributes can contain.

7 Comments

I don't want to use XSD... I'm taking care of that kind of validation elsewhere. I just want to check syntax at the moment.
Do you mind telling me what the issue with using XSD is? Do you not want to write XSD? How do you know what version of xml your document is to be compliant with?
No issue... there is code in place already to validate against an XSD. But it doesn't check syntax.
If you are validating your XML against an XSD and its not well-formed doesn't your validation catch that?
I don't think so... I didn't write it :) It might, but it mostly likely doesn't handle specific syntax issues that may come up.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.