I am parsing through an XML Wikipedia data dump and I'd like to pull out a page and make it into a new XML document with a stripped down version of the page. For example, of each page, I'm only interested in the title, id, timestamp, username, and text.
Here is a full Wikipedia page:
<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility" />
<revision>
<id>381202555</id>
<timestamp>2010-08-26T22:38:36Z</timestamp>
<contributor>
<username>OlEnglish</username>
<id>7181920</id>
</contributor>
<minor />
<comment>[[Help:Reverting|Reverted]] edits by [[Special:Contributions/76.28.186.133|76.28.186.133]] ([[User talk:76.28.186.133|talk]]) to last version by Gurch</comment>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}</text>
<sha1 />
</revision>
</page>
What I'd like to end up with after the stripping is done would be something like this:
<page>
<title>AccessibleComputing</title>
<id>10</id>
<revision>
<timestamp>2010-08-26T22:38:36Z</timestamp>
<contributor>
<username>OlEnglish</username>
</contributor>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}</text>
</revision>
</page>
Because of the sheer size of these documents I know I can't use DOM to handle this. I know how to set up a SAX parser but what would be the best way to build a new XML file while parsing the document?
Thanks