Extract data from XML file if arguments are of certain values

Question

I want to loop through a Wikipedia dump in XML format and for each revision I want to save the Timestamp and the Comment if the revision is made by a certain username. Is this possible? I'm trying to get familiar with lxml.

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
    <siteinfo>
        <sitename>Wikipedia</sitename>
        <dbname>enwiki</dbname>
        <base>https://en.wikipedia.org/wiki/Main_Page</base>
        <generator>MediaWiki 1.27.0-wmf.18</generator>
        <case>first-letter</case>
        <namespaces>...</namespaces>
    </siteinfo>
    <page>
        <title>Zhuangzi</title>
        <ns>0</ns>
        <id>42870472</id>
        <revision>
            <id>610251969</id>
            <timestamp>2014-05-26T20:08:14Z</timestamp>
            <contributor>
                <username>White whirlwind</username>
                <id>8761551</id>
            </contributor>
            <comment>...</comment>
            <model>wikitext</model>
            <format>text/x-wiki</format>
            <text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
            <sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1>
        </revision>
        <revision>...</revision>
        <revision>...</revision>
        <revision>...</revision>
        <revision>...</revision>
        <revision>...</revision>

    </page>
    <page>...</page>
</mediawiki>

What tools have you found to read XML data and what code have you tried to use to do what you ask? — OneCricketeer
– OneCricketeer, Commented Mar 31, 2016 at 12:43

jurkij · Accepted Answer · 2016-03-31 15:27:09Z

1

import xmltodict 


xml_input = """
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>enwiki</dbname>
    <base>https://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.27.0-wmf.18</generator>
    <case>first-letter</case>
    <namespaces>...</namespaces>
</siteinfo>
<page>
    <title>Zhuangzi</title>
    <ns>0</ns>
    <id>42870472</id>
    <revision>
        <id>610251969</id>
        <timestamp>2014-05-25T20:08:14Z</timestamp>
        <contributor>
            <username>Patric</username>
            <id>8761551</id>
        </contributor>
    </revision>
    <revision>
        <id>610251969</id>
        <timestamp>2014-05-26T20:08:14Z</timestamp>
        <contributor>
            <username>Don</username>
            <id>8761551</id>
        </contributor>
    </revision>
    <revision>
        <id>610251969</id>
        <timestamp>2014-05-27T20:08:14Z</timestamp>
        <contributor>
            <username>Patric</username>
            <id>8761551</id>
        </contributor>
    </revision>                
</page>
</mediawiki>
"""


dic_xml = xmltodict.parse(xml_input)

for rev in dic_xml['mediawiki']['page']['revision']:
    if rev['contributor']['username'] == 'Patric':
        print rev['id']
        print rev['timestamp']

with your file:

import xmltodict
with open('/home/jurkij/Downloads/testarticles.xml') as xml_file:
    dic_xml = xmltodict.parse(xml_file.read())
    for page in dic_xml['mediawiki']['page']:
        for rev in  page['revision']:
            if 'username' in rev['contributor'] and rev['contributor']['username'] == 'Aristophanes68':
                print rev['timestamp']
                print rev['id']

edited Mar 31, 2016 at 15:27

answered Mar 31, 2016 at 13:17

jurkij

1552 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Knokkelgeddon Over a year ago

Looks good but I can't get it to work with dic_xml = xmltodict.parse(open('2articles.xml', encoding='latin-1').read())

jurkij Over a year ago

Can you somewhere upload your xml and paste link?

jurkij Over a year ago

And what is exactly your problem? I can parse it without troubles. There was just one problem with missing key username in parent tag contributor which I fixed with 'username' in rev['contributor'] cond.

sikrob · Accepted Answer · 2016-03-31 13:15:39Z

Yes, this is possible using lxml.

You know what nodes you are looking for (start with the reivision's username), so write code to select that node and compare the value against the known name you are looking for.

Once you have done that part, saving the timestamp and comment should be simple.

You will find what you need in the lxml documentation (http://lxml.de/); look into the sections on "XPath" to figure out how to select the nodes you want (this will include snippets that load the XML into your script)

You may also wish to consult the ElementTree tutorial that lxml links (http://effbot.org/zone/element.htm) to get an understanding of how you can use the XML elements you'll find using the XPath or other methods. This will be useful for getting the values from the elements.

Community · Accepted Answer · 2017-05-23 12:24:08Z

Continuing on from your last question, you can easily do it with lxml and an xpath expression:

from lxml.etree import parse

tree = parse("test.xml")

ns = {"wiki": "http://www.mediawiki.org/xml/export-0.10/"}
revs = tree.xpath("//wiki:revision[.//wiki:username='White whirlwind']",namespaces=ns)

print([(rev.xpath(".//wiki:timestamp//text()", namespaces=ns)[0],rev.xpath(".//wiki:username//text()", namespaces=ns)[0]) for rev in revs])

For the following xml:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
    <siteinfo>
        <sitename>Wikipedia</sitename>
        <dbname>enwiki</dbname>
        <base>https://en.wikipedia.org/wiki/Main_Page</base>
        <generator>MediaWiki 1.27.0-wmf.18</generator>
        <case>first-letter</case>
        <namespaces>...</namespaces>
    </siteinfo>
    <page>
        <title>Zhuangzi</title>
        <ns>0</ns>
        <id>42870472</id>
        <revision>
            <id>610251969</id>
            <timestamp>2014-05-26T20:08:14Z</timestamp>
            <contributor>
                <username>White whirlwind</username>
                <id>8761551</id>
            </contributor>
            <comment>...</comment>
            <model>wikitext</model>
            <format>text/x-wiki</format>
            <text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
            <sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1>
        </revision>
        <revision>
                 <id>610251969</id>
            <timestamp>2014-06-26T20:08:14Z</timestamp>
            <contributor>
                <username>White whirlwind</username>
                <id>8761551</id>
            </contributor>
            <comment>...</comment>
            <model>wikitext</model>
            <format>text/x-wiki</format>
            <text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
            <sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1>
        </revision>
        <revision>     <id>610251969</id>
            <timestamp>2014-07-26T20:08:14Z</timestamp>
            <contributor>
                <username>foobar</username>
                <id>8761551</id>
            </contributor>
            <comment>...</comment>
            <model>wikitext</model>
            <format>text/x-wiki</format>
            <text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
            <sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1></revision>
        <revision>...</revision>
        <revision>...</revision>
        <revision>...</revision>

        </page>

Outputs:

 [[('2014-05-26T20:08:14Z', 'White whirlwind'), ('2014-06-26T20:08:14Z', 'White whirlwind')]

//wiki:revision[.//wiki:username='White whirlwind'] finds all the revision tags that contain a username and that username value is White whirlwind, you will see it returns 2 as foo does not match, you just need to extract the timestamp and username values from each of the filtered revisions in revs.

For your file in google drive it returns:

[('2014-05-26T20:08:14Z', 'White whirlwind'), 
('2014-05-26T20:12:49Z', 'White whirlwind'),
 ('2014-05-26T20:13:04Z', 'White whirlwind'),
('2014-05-31T21:14:15Z', 'White whirlwind'), 
('2015-10-11T19:24:46Z', 'White whirlwind'),
 ('2015-10-11T19:26:31Z', 'White whirlwind')]

Which if you check your file is correct.

Collectives™ on Stack Overflow

Extract data from XML file if arguments are of certain values

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related