0

I am parsing through an XML Wikipedia data dump and I'd like to pull out a page and make it into a new XML document with a stripped down version of the page. For example, of each page, I'm only interested in the title, id, timestamp, username, and text.

Here is a full Wikipedia page:

<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility" />
<revision>
  <id>381202555</id>
  <timestamp>2010-08-26T22:38:36Z</timestamp>
  <contributor>
    <username>OlEnglish</username>
    <id>7181920</id>
  </contributor>
  <minor />
  <comment>[[Help:Reverting|Reverted]] edits by [[Special:Contributions/76.28.186.133|76.28.186.133]] ([[User talk:76.28.186.133|talk]]) to last version by Gurch</comment>
  <text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}</text>
  <sha1 />
  </revision>
</page>

What I'd like to end up with after the stripping is done would be something like this:

<page>
  <title>AccessibleComputing</title>
  <id>10</id>
  <revision>
    <timestamp>2010-08-26T22:38:36Z</timestamp>
    <contributor>
      <username>OlEnglish</username>
    </contributor>
    <text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}</text>
  </revision>
</page>

Because of the sheer size of these documents I know I can't use DOM to handle this. I know how to set up a SAX parser but what would be the best way to build a new XML file while parsing the document?

Thanks

2
  • 1
    Is there a reason you wouldn't just use XSLT? Commented Jul 2, 2013 at 19:41
  • If you can't use DOM< maybe u can try VTD-XML or extended vtd-xml? Commented Jul 19, 2013 at 19:18

2 Answers 2

4

You can use XMLFilterImpl and leave only content you need, here is the idea, both input and output are streams, so it can process XML of any size

    XMLReader xr = new XMLFilterImpl(XMLReaderFactory.createXMLReader()) {
        public void startElement(String uri, String localName, String qName, Attributes atts)
                throws SAXException {
            if (qName.equals("page")) {
                super.startElement(uri, localName, qName, atts);
            }
        }

        public void endElement(String uri, String localName, String qName) throws SAXException {
            if (qName.equals("page")) {
                super.endElement(uri, localName, qName);
            }
        }

        public void characters(char[] ch, int start, int length) throws SAXException {
            //super.characters(ch, start, length);
        }
    };
    Source src = new SAXSource(xr, new InputSource("1.xml"));
    Result res = new StreamResult(System.out);
    TransformerFactory.newInstance().newTransformer().transform(src, res);
Sign up to request clarification or add additional context in comments.

3 Comments

This looks like it might do what I'm looking for. I haven't used XMLFilterImpl before so I'll give this a shot.
I can look at that as well. Would a StAX XmlEventReader/XmlEventWriter be able to handle a huge XML file?
It will, and it will be easier to write filtering out logic wiht it
0

Here i have implemented parsing using SAX Parser which extracts title element and title

attribute in redirect element in wikipedia dump file.

package parser;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;



 public class SAXHandler extends DefaultHandler {

List<String> list;
int count=0,counter=0;
int MAX_SIZE=100000;
String temp="";
int counterz=0;
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException{

        long start = System.currentTimeMillis();            

        SAXHandler saxhandler=new SAXHandler();
        saxhandler.assign();
        saxhandler.parseDoc();

        long end = System.currentTimeMillis();
        System.out.println("Time taken to write is " + (end - start) + "msecs");    

}

void assign(){
    list = new ArrayList<String>(); 
}

void parseDoc() throws ParserConfigurationException, SAXException, IOException{

    SAXParserFactory spf = SAXParserFactory.newInstance();
    SAXParser sp = spf.newSAXParser();
    sp.parse("D:\\XMLParsing_Files\\enwiki-20120902-pages-articles-multistream.xml", this);
    writeToFile(list); // for writing the end elements
}

public void startDocument() throws SAXException {

}

public void endDocument() throws SAXException {

}

public void startElement(String uri, String localName,String qName, Attributes attributes)throws SAXException {

    if(qName.equalsIgnoreCase("redirect"))
    {
        list.add(attributes.getValue("title"));
        count++;
        if(count==MAX_SIZE)
        {
            try {
                writeToFile(list);
            } catch (IOException e) {
                e.printStackTrace();
            }
            list.clear();
            count=0;
        }
    }

}

public void endElement(String uri, String localName, String qName)throws SAXException {

   if(qName.equalsIgnoreCase("title"))
   {
       list.add(temp);
       count++;
       if(count==MAX_SIZE)
       {
        try {
            writeToFile(list);
        } catch (IOException e) {
            e.printStackTrace();
        }
        list.clear();
        count=0;
       }
   }

}

public void characters(char ch[], int start, int length)throws SAXException {

    temp="";
    temp=new String(ch,start,length);
}

void writeToFile(List<String> list) throws IOException{

    Collections.sort(list);
    File file = new File("D:\\XMLParsing_Files\\Extracted_Data\\Extracted_Sorted_Data_" + getSuffix() + ".txt");


    if (!file.exists()) {
        file.createNewFile();
    }

    FileWriter fw = new FileWriter(file.getAbsoluteFile());
    PrintWriter pw = new PrintWriter(fw);

    Iterator<String> it = list.iterator();
    while (it.hasNext()) {
        pw.println(it.next());
    }
    pw.println("zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz");   
    pw.close();
    System.out.println(++counterz + "Done");
}

int  getSuffix(){
    counter++;
    return counter;
 }

}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.