Building XML file with SAX parser

Question

I am parsing through an XML Wikipedia data dump and I'd like to pull out a page and make it into a new XML document with a stripped down version of the page. For example, of each page, I'm only interested in the title, id, timestamp, username, and text.

Here is a full Wikipedia page:

<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility" />
<revision>
  <id>381202555</id>
  <timestamp>2010-08-26T22:38:36Z</timestamp>
  <contributor>
    <username>OlEnglish</username>
    <id>7181920</id>
  </contributor>
  <minor />
  <comment>[[Help:Reverting|Reverted]] edits by [[Special:Contributions/76.28.186.133|76.28.186.133]] ([[User talk:76.28.186.133|talk]]) to last version by Gurch</comment>
  <text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}</text>
  <sha1 />
  </revision>
</page>

What I'd like to end up with after the stripping is done would be something like this:

<page>
  <title>AccessibleComputing</title>
  <id>10</id>
  <revision>
    <timestamp>2010-08-26T22:38:36Z</timestamp>
    <contributor>
      <username>OlEnglish</username>
    </contributor>
    <text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}</text>
  </revision>
</page>

Because of the sheer size of these documents I know I can't use DOM to handle this. I know how to set up a SAX parser but what would be the best way to build a new XML file while parsing the document?

Thanks

If you can't use DOM< maybe u can try VTD-XML or extended vtd-xml? — vtd-xml-author
– vtd-xml-author, Commented Jul 19, 2013 at 19:18

Evgeniy Dorofeev · Accepted Answer · 2013-07-02 20:24:22Z

4

You can use XMLFilterImpl and leave only content you need, here is the idea, both input and output are streams, so it can process XML of any size

    XMLReader xr = new XMLFilterImpl(XMLReaderFactory.createXMLReader()) {
        public void startElement(String uri, String localName, String qName, Attributes atts)
                throws SAXException {
            if (qName.equals("page")) {
                super.startElement(uri, localName, qName, atts);
            }
        }

        public void endElement(String uri, String localName, String qName) throws SAXException {
            if (qName.equals("page")) {
                super.endElement(uri, localName, qName);
            }
        }

        public void characters(char[] ch, int start, int length) throws SAXException {
            //super.characters(ch, start, length);
        }
    };
    Source src = new SAXSource(xr, new InputSource("1.xml"));
    Result res = new StreamResult(System.out);
    TransformerFactory.newInstance().newTransformer().transform(src, res);

edited Jul 2, 2013 at 20:24

answered Jul 2, 2013 at 19:59

Evgeniy Dorofeev

137k31 gold badges209 silver badges288 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

joshft91 Over a year ago

This looks like it might do what I'm looking for. I haven't used XMLFilterImpl before so I'll give this a shot.

joshft91 Over a year ago

I can look at that as well. Would a StAX XmlEventReader/XmlEventWriter be able to handle a huge XML file?

Evgeniy Dorofeev Over a year ago

It will, and it will be easier to write filtering out logic wiht it

Sams · Accepted Answer · 2014-08-22 18:01:15Z

Here i have implemented parsing using SAX Parser which extracts title element and title

attribute in redirect element in wikipedia dump file.

package parser;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;



 public class SAXHandler extends DefaultHandler {

List<String> list;
int count=0,counter=0;
int MAX_SIZE=100000;
String temp="";
int counterz=0;
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException{

        long start = System.currentTimeMillis();            

        SAXHandler saxhandler=new SAXHandler();
        saxhandler.assign();
        saxhandler.parseDoc();

        long end = System.currentTimeMillis();
        System.out.println("Time taken to write is " + (end - start) + "msecs");    

}

void assign(){
    list = new ArrayList<String>(); 
}

void parseDoc() throws ParserConfigurationException, SAXException, IOException{

    SAXParserFactory spf = SAXParserFactory.newInstance();
    SAXParser sp = spf.newSAXParser();
    sp.parse("D:\\XMLParsing_Files\\enwiki-20120902-pages-articles-multistream.xml", this);
    writeToFile(list); // for writing the end elements
}

public void startDocument() throws SAXException {

}

public void endDocument() throws SAXException {

}

public void startElement(String uri, String localName,String qName, Attributes attributes)throws SAXException {

    if(qName.equalsIgnoreCase("redirect"))
    {
        list.add(attributes.getValue("title"));
        count++;
        if(count==MAX_SIZE)
        {
            try {
                writeToFile(list);
            } catch (IOException e) {
                e.printStackTrace();
            }
            list.clear();
            count=0;
        }
    }

}

public void endElement(String uri, String localName, String qName)throws SAXException {

   if(qName.equalsIgnoreCase("title"))
   {
       list.add(temp);
       count++;
       if(count==MAX_SIZE)
       {
        try {
            writeToFile(list);
        } catch (IOException e) {
            e.printStackTrace();
        }
        list.clear();
        count=0;
       }
   }

}

public void characters(char ch[], int start, int length)throws SAXException {

    temp="";
    temp=new String(ch,start,length);
}

void writeToFile(List<String> list) throws IOException{

    Collections.sort(list);
    File file = new File("D:\\XMLParsing_Files\\Extracted_Data\\Extracted_Sorted_Data_" + getSuffix() + ".txt");


    if (!file.exists()) {
        file.createNewFile();
    }

    FileWriter fw = new FileWriter(file.getAbsoluteFile());
    PrintWriter pw = new PrintWriter(fw);

    Iterator<String> it = list.iterator();
    while (it.hasNext()) {
        pw.println(it.next());
    }
    pw.println("zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz");   
    pw.close();
    System.out.println(++counterz + "Done");
}

int  getSuffix(){
    counter++;
    return counter;
 }

}

Collectives™ on Stack Overflow

Building XML file with SAX parser

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related