6

I have a project where I collect all the Wikipedia articles belonging to a particular category, pull out the dump from Wikipedia, and put it into our db.

So I should be parsing the Wikipedia dump file to get the stuff done. Do we have an efficient parser to do this job? I am a python developer. So I prefer any parser in python. If not suggest one and I will try to write a port of it in python and contribute it to the web, so other persons make use of it or at least try it.

So all I want is a python parser to parse Wikipedia dump files. I started writing a manual parser which parses each node and gets the stuff done.

4 Answers 4

3

There is example code for the same at http://jjinux.blogspot.com/2009/01/python-parsing-wikipedia-dumps-using.html

Sign up to request clarification or add additional context in comments.

Comments

2

Another good module is mwlib from here - it is a pain to install with all dependencies (at least on Windows), but it works well.

Comments

1

I don't know about licensing, but this is implemented in python, and includes the source.

Comments

0

I would strongly recommend mwxml. It is a utility for parsing Wikimedia dumps written by Aaron Halfaker, a research scientist at the Wikimedia foundation. It can be installed with

pip install mwxml

Usage is pretty intuitive as demonstrated by this example from the documentation:

>>> import mwxml

>>> dump = mwxml.Dump.from_file(open("dump.xml"))

>>> print(dump.site_info.name, dump.site_info.dbname)
Wikipedia enwiki

>>> for page in dump:
...     for revision in page:
...        print(revision.id)
...
1
2
3

It is part of a larger set of data analysis utilities put out by the Wikimedia Foundation and its community.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.