Is there a parser/way available to parse Wikipedia dump files using Python?

Question

I have a project where I collect all the Wikipedia articles belonging to a particular category, pull out the dump from Wikipedia, and put it into our db.

So I should be parsing the Wikipedia dump file to get the stuff done. Do we have an efficient parser to do this job? I am a python developer. So I prefer any parser in python. If not suggest one and I will try to write a port of it in python and contribute it to the web, so other persons make use of it or at least try it.

So all I want is a python parser to parse Wikipedia dump files. I started writing a manual parser which parses each node and gets the stuff done.

Swaroop C H · Accepted Answer · 2009-03-19 10:00:28Z

3

There is example code for the same at http://jjinux.blogspot.com/2009/01/python-parsing-wikipedia-dumps-using.html

answered Mar 19, 2009 at 10:00

Swaroop C H

17k10 gold badges46 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

PhilS · Accepted Answer · 2010-04-15 15:36:34Z

2

Another good module is mwlib from here - it is a pain to install with all dependencies (at least on Windows), but it works well.

edited Apr 15, 2010 at 15:36

answered May 28, 2009 at 20:23

PhilS

1,2797 silver badges13 bronze badges

Comments

James L · Accepted Answer · 2009-03-19 10:00:45Z

1

I don't know about licensing, but this is implemented in python, and includes the source.

answered Mar 19, 2009 at 10:00

James L

16.9k10 gold badges56 silver badges74 bronze badges

Comments

kjschiroo · Accepted Answer · 2017-04-07 13:57:46Z

0

I would strongly recommend mwxml. It is a utility for parsing Wikimedia dumps written by Aaron Halfaker, a research scientist at the Wikimedia foundation. It can be installed with

pip install mwxml

Usage is pretty intuitive as demonstrated by this example from the documentation:

>>> import mwxml

>>> dump = mwxml.Dump.from_file(open("dump.xml"))

>>> print(dump.site_info.name, dump.site_info.dbname)
Wikipedia enwiki

>>> for page in dump:
...     for revision in page:
...        print(revision.id)
...
1
2
3

It is part of a larger set of data analysis utilities put out by the Wikimedia Foundation and its community.

answered Apr 7, 2017 at 13:57

kjschiroo

5305 silver badges12 bronze badges

Collectives™ on Stack Overflow

Is there a parser/way available to parse Wikipedia dump files using Python?

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related