2

I want to loop through a Wikipedia dump in XML format and for each revision I want to save the Timestamp and the Comment if the revision is made by a certain username. Is this possible? I'm trying to get familiar with lxml.

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
    <siteinfo>
        <sitename>Wikipedia</sitename>
        <dbname>enwiki</dbname>
        <base>https://en.wikipedia.org/wiki/Main_Page</base>
        <generator>MediaWiki 1.27.0-wmf.18</generator>
        <case>first-letter</case>
        <namespaces>...</namespaces>
    </siteinfo>
    <page>
        <title>Zhuangzi</title>
        <ns>0</ns>
        <id>42870472</id>
        <revision>
            <id>610251969</id>
            <timestamp>2014-05-26T20:08:14Z</timestamp>
            <contributor>
                <username>White whirlwind</username>
                <id>8761551</id>
            </contributor>
            <comment>...</comment>
            <model>wikitext</model>
            <format>text/x-wiki</format>
            <text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
            <sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1>
        </revision>
        <revision>...</revision>
        <revision>...</revision>
        <revision>...</revision>
        <revision>...</revision>
        <revision>...</revision>

    </page>
    <page>...</page>
</mediawiki>
1
  • 3
    What tools have you found to read XML data and what code have you tried to use to do what you ask? Commented Mar 31, 2016 at 12:43

3 Answers 3

1
import xmltodict 


xml_input = """
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>enwiki</dbname>
    <base>https://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.27.0-wmf.18</generator>
    <case>first-letter</case>
    <namespaces>...</namespaces>
</siteinfo>
<page>
    <title>Zhuangzi</title>
    <ns>0</ns>
    <id>42870472</id>
    <revision>
        <id>610251969</id>
        <timestamp>2014-05-25T20:08:14Z</timestamp>
        <contributor>
            <username>Patric</username>
            <id>8761551</id>
        </contributor>
    </revision>
    <revision>
        <id>610251969</id>
        <timestamp>2014-05-26T20:08:14Z</timestamp>
        <contributor>
            <username>Don</username>
            <id>8761551</id>
        </contributor>
    </revision>
    <revision>
        <id>610251969</id>
        <timestamp>2014-05-27T20:08:14Z</timestamp>
        <contributor>
            <username>Patric</username>
            <id>8761551</id>
        </contributor>
    </revision>                
</page>
</mediawiki>
"""


dic_xml = xmltodict.parse(xml_input)

for rev in dic_xml['mediawiki']['page']['revision']:
    if rev['contributor']['username'] == 'Patric':
        print rev['id']
        print rev['timestamp']

with your file:

import xmltodict
with open('/home/jurkij/Downloads/testarticles.xml') as xml_file:
    dic_xml = xmltodict.parse(xml_file.read())
    for page in dic_xml['mediawiki']['page']:
        for rev in  page['revision']:
            if 'username' in rev['contributor'] and rev['contributor']['username'] == 'Aristophanes68':
                print rev['timestamp']
                print rev['id']
Sign up to request clarification or add additional context in comments.

3 Comments

Looks good but I can't get it to work with dic_xml = xmltodict.parse(open('2articles.xml', encoding='latin-1').read())
Can you somewhere upload your xml and paste link?
And what is exactly your problem? I can parse it without troubles. There was just one problem with missing key username in parent tag contributor which I fixed with 'username' in rev['contributor'] cond.
1

Yes, this is possible using lxml.

You know what nodes you are looking for (start with the reivision's username), so write code to select that node and compare the value against the known name you are looking for.

Once you have done that part, saving the timestamp and comment should be simple.

You will find what you need in the lxml documentation (http://lxml.de/); look into the sections on "XPath" to figure out how to select the nodes you want (this will include snippets that load the XML into your script)

You may also wish to consult the ElementTree tutorial that lxml links (http://effbot.org/zone/element.htm) to get an understanding of how you can use the XML elements you'll find using the XPath or other methods. This will be useful for getting the values from the elements.

Comments

1

Continuing on from your last question, you can easily do it with lxml and an xpath expression:

from lxml.etree import parse

tree = parse("test.xml")

ns = {"wiki": "http://www.mediawiki.org/xml/export-0.10/"}
revs = tree.xpath("//wiki:revision[.//wiki:username='White whirlwind']",namespaces=ns)

print([(rev.xpath(".//wiki:timestamp//text()", namespaces=ns)[0],rev.xpath(".//wiki:username//text()", namespaces=ns)[0]) for rev in revs])

For the following xml:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
    <siteinfo>
        <sitename>Wikipedia</sitename>
        <dbname>enwiki</dbname>
        <base>https://en.wikipedia.org/wiki/Main_Page</base>
        <generator>MediaWiki 1.27.0-wmf.18</generator>
        <case>first-letter</case>
        <namespaces>...</namespaces>
    </siteinfo>
    <page>
        <title>Zhuangzi</title>
        <ns>0</ns>
        <id>42870472</id>
        <revision>
            <id>610251969</id>
            <timestamp>2014-05-26T20:08:14Z</timestamp>
            <contributor>
                <username>White whirlwind</username>
                <id>8761551</id>
            </contributor>
            <comment>...</comment>
            <model>wikitext</model>
            <format>text/x-wiki</format>
            <text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
            <sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1>
        </revision>
        <revision>
                 <id>610251969</id>
            <timestamp>2014-06-26T20:08:14Z</timestamp>
            <contributor>
                <username>White whirlwind</username>
                <id>8761551</id>
            </contributor>
            <comment>...</comment>
            <model>wikitext</model>
            <format>text/x-wiki</format>
            <text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
            <sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1>
        </revision>
        <revision>     <id>610251969</id>
            <timestamp>2014-07-26T20:08:14Z</timestamp>
            <contributor>
                <username>foobar</username>
                <id>8761551</id>
            </contributor>
            <comment>...</comment>
            <model>wikitext</model>
            <format>text/x-wiki</format>
            <text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
            <sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1></revision>
        <revision>...</revision>
        <revision>...</revision>
        <revision>...</revision>

        </page>

Outputs:

 [[('2014-05-26T20:08:14Z', 'White whirlwind'), ('2014-06-26T20:08:14Z', 'White whirlwind')]

//wiki:revision[.//wiki:username='White whirlwind'] finds all the revision tags that contain a username and that username value is White whirlwind, you will see it returns 2 as foo does not match, you just need to extract the timestamp and username values from each of the filtered revisions in revs.

For your file in google drive it returns:

[('2014-05-26T20:08:14Z', 'White whirlwind'), 
('2014-05-26T20:12:49Z', 'White whirlwind'),
 ('2014-05-26T20:13:04Z', 'White whirlwind'),
('2014-05-31T21:14:15Z', 'White whirlwind'), 
('2015-10-11T19:24:46Z', 'White whirlwind'),
 ('2015-10-11T19:26:31Z', 'White whirlwind')]

Which if you check your file is correct.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.