2

I'd like to take an XML file, heavily structured and about half gig in size, and create from it another XML file, containing only selected elements of the original one.

1) How can I do that?

2) can it be done with DOM Parser? What is the size limit of the DOM parser?

Thanks!

2
  • Consider using XSLT which allows you to write a template (in XML) which acts as a recipe for extracting the elements and/or attributes you want and then writing them out as a new document (as XML if desired). I've used Saxon to do this in the past (using a command line script rather a Java application). Commented Apr 5, 2015 at 19:26
  • You might prefer to read the file sequentially only saving the elements you actually need. With this strategy you won't to need to allocate memory to store and manipulate your 0.5GB file. You can do this with a SAX parser. You can also use Stax in Java. Commented Apr 5, 2015 at 19:39

2 Answers 2

2

If you have a very large source XML (like your 0.5 GB file), and wish to extract information from it, possibly creating a new XML, you might consider using an event-based parser which does not require loading the entire XML in memory. The simplest of these implementations is the SAX parser, which requires that you write an event listener which will capture events like document-start, element-start, element-end, etc, where you can inspect the data you are reading (the name of the element, the attributes, etc.) and decide if you are going to ignore it or do something with the data.

Search for a SAX tutorial using JAXP and you should find several examples. Another strategy which you might want to consider, depending on what you want to do is StAX.

Here is a simple example using SAX to read data from a XML file and extract some information based on search criteria. It's a very simple example I use to teach SAX processing. I think it might help your understanding of how it works. The search criteria is hardwired and consists of names of movie directors to search in a giant XML with a movie selection generated from IMDB data.

XML Source example ("source.xml" ~300MB file)

<Movies>
    ...
    <Movie>
        <Imdb>tt1527186</Imdb>
        <Title>Melancholia</Title>
        <Director>Lars von Trier</Director>
        <Year>2011</Year>
        <Duration>136</Duration>
    </Movie>
    <Movie>
        <Imdb>tt0060390</Imdb>
        <Title>Fahrenheit 451</Title>
        <Director>François Truffaut</Director>
        <Year>1966</Year>
        <Duration>112</Duration>
    </Movie>
    <Movie>
        <Imdb>tt0062622</Imdb>
        <Title>2001: A Space Odyssey</Title>
        <Director>Stanley Kubrick</Director>
        <Year>1968</Year>
        <Duration>160</Duration>
    </Movie>
    ...
</Movies>

Here is an example of an event handler. It selects the Movie elements by matching strings. I extended DefaultHandler and implemented startElement() (called when an opening tag is found), characters() (called when a block of characters are read), endElement() (called when an end tag is found) and endDocument() (called once, when the document finished). Since the data that is read is not retained in memory, you have to save the data you are interested in yourself. I used some boolean flags and instance variables to save the current tag, current data, etc.

class ExtractMovieSaxHandler extends DefaultHandler {

    // These are some parameters for the search which will select 
    // the subtrees (they will receive data when we set up the parser)
    private String tagToMatch;
    private String tagContents; // OR match
    private boolean strict = false;  // if strict matches will be exact

    /**
     * Sets criteria to select and copy Movie elements from source XML.
     *
     * @param tagToMatch Must contain text only
     * @param tagContents Text contents of the tag
     * @param strict If true, match must be exact
     */
    public void setSearchCriteria(String tagToMatch, String tagContents, boolean strict) {
        this.tagToMatch = tagToMatch;
        this.tagContents = tagContents;
        this.strict = strict;
    }

    // These are the temporary values we store as we parse the file
    private String currentElement;
    private StringBuilder contents = null; // if not null we are in Movie tag
    private String currentData;
    List<String> result = new ArrayList<String>(); // store resulting nodes here
    private boolean skip = false;

...

These methods are the implementation of the ContentHandler. The first one detects an element was found (start tag). We save the name of the tag (child of Movie) in a variable, because it might be one we use in the search:

...

    @Override
    public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {

        // Store the current element that started now
        currentElement = qName;

        // If this is a Movie tag, save the contents because we might need it
        if (qName.equals("Movie")) {
            contents = new StringBuilder();
        }

    }
...    

This one is called every time a block of characters is called. We check if those characters are occurring inside an element which interests us. If it is, we match the contents and save it if it matches.

...
    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {

        // if we discovered that we don't need this data, we skip it
        if (skip || currentElement == null) {
            return;
        }

        // If we are inside the tag we want to search, save the contents
        currentData = new String(ch, start, length);

        if (currentElement.equals(tagToMatch)) {
            boolean discard = true;

            if (strict) {
                if (currentData.equals(tagContents)) { // exact match
                    discard = false;
                }

            } else {
                if (currentData.toLowerCase().indexOf(tagContents.toLowerCase()) >= 0) { // matches occurrence of substring
                    discard = false;
                }
            }

            if (discard) {
                skip = true;
            }
        }

    }
...    

This is called when an end tag is found. We can now append it to the document we are building in memory if we wish.

...
    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {

        // Rebuild the XML if it's a node we didn't skip
        if (qName.equals("Movie")) {
            if (!skip) {
                result.add(contents.insert(0, "<Movie>").append("</Movie>").toString());
            }

            // reset the variables so we can check the next node
            contents = null;
            skip = false;
        } else if (contents != null && !skip) {
            contents.append("<").append(qName).append(">")
                    .append(currentData)
                    .append("</").append(qName).append(">");
        }

        currentElement = null;
    }
...    

Finally, this one is called when the document ends. I also used it to print the result at the end.

...
    @Override
    public void endDocument() throws SAXException {
        StringBuilder resultFile = new StringBuilder();
        resultFile.append("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
        resultFile.append("<Movies>");
        for (String childNode : result) {
            resultFile.append(childNode.toString());
        }
        resultFile.append("</Movies>");

        System.out.println("=== Resulting XML containing Movies where " + tagToMatch + " is one of " + tagContents + " ===");
        System.out.println(resultFile.toString());
    }

}

Here is a small Java application which loads that file, and uses an event handler to extract the data.

public class SAXReaderExample {

    public static final String PATH = "src/main/resources"; // this is where I put the XML file

    public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {

        // Obtain XML Reader
        SAXParserFactory spf = SAXParserFactory.newInstance();
        SAXParser sp = spf.newSAXParser();
        XMLReader reader = sp.getXMLReader();

        // Instantiate SAX handler
        ExtractMovieSaxHandler handler = new ExtractMovieSaxHandler();

        // set search criteria
        handler.setSearchCriteria("Director", "Kubrick", false);

        // Register handler with XML reader
        reader.setContentHandler(handler);

        // Parse the XML
        reader.parse(new InputSource(new FileInputStream(new File(PATH, "source.xml"))));
    }
}

Here is the resulting file, after processing:

<?xml version="1.0" encoding="UTF-8"?>
<Movies>
    <Movie>
        <Imdb>tt0062622</Imdb>
        <Title>2001: A Space Odyssey</Title>
        <Director>Stanley Kubrick</Director>
        <Year>1968</Year>
        <Duration>160</Duration>
    </Movie>
    <Movie>
        <Imdb>tt0066921</Imdb>
        <Title>A Clockwork Orange</Title>
        <Director>Stanley Kubrick</Director>
        <Year>1972</Year>
        <Duration>136</Duration>
    </Movie>
    <Movie>
        <Imdb>tt0081505</Imdb>
        <Title>The Shining</Title>
        <Director>Stanley Kubrick</Director>
        <Year>1980</Year>
        <Duration>144</Duration>
    </Movie>
    ...
</Movies>

Your scenario might be different, but this example shows a general solution which you can probably adapt to your problem. You can find more information in tutorials about SAX and JAXP.

Sign up to request clarification or add additional context in comments.

Comments

1

500Mb is well within the limits of what can be achieved using XSLT. It depends a little bit on how much effort you want to expend to develop an optimum solution: i.e., which is more expensive, your time or the machine's time?

4 Comments

well, obviously the machine's time is more expansive, since it is going to work according to my solution long after I'll finish developing it :) Although, my question was not about the limitations of XSLT, but of DOM in the context of size...
I can't see why you would want to use DOM. If you use an XSLT processor it will build an in-memory tree, but most XSLT processors have an internal tree representation that is more economical than DOM.
I just want to know the limit of DOM, I did not say I want to use it... I didn't know about XSLT beforehand, but am investigating about it right now. My question remains - can anyone provide info about the limitation in the context of file size of the DOM parser? (for educational purposes) thank you!
It depends on the DOM implementation and on the detailed structure of your XML, but you're likely to need something between 5*N and 10*N of memory where N is the raw document size.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.