Python xml parsing using minidom

Question

I just started learning how to parse xml using minidom. I tried to get the author's names (xml data is down below) using the following code:

from xml.dom import minidom

xmldoc = minidom.parse("cora.xml")

author = xmldoc.getElementsByTagName ('author')

for author in author:
    authorID=author.getElementsByTagName('author id')
    print authorID

I got empty brackets([]) all the way. Can someone please help me out? I will also need the title and venue. Thanks in advance. See xml data below:

<?xml version="1.0" encoding="UTF-8"?>
<coraRADD>
   <publication id="ahlskog1994a">
      <author id="199">M. Ahlskog</author>
      <author id="74"> J. Paloheimo</author>
      <author id="64"> H. Stubb</author>
      <author id="103"> P. Dyreklev</author>
      <author id="54"> M. Fahlman</author>
      <title>Inganas</title>
      <title>and</title>
      <title>M.R.</title>
      <venue>
         <venue pubid="ahlskog1994a" id="1">
                  <name>Andersson</name>
                  <name> J Appl. Phys.</name>
                  <vol>76</vol>
                  <date> (1994). </date>
            </venue>

Is that the correct XML data? There’s an extra opening <venue> tag, and the <publication> and <coraRADD> tags aren’t closed. — Paul D. Waite
– Paul D. Waite, Commented May 16, 2013 at 13:25
Hi Paul, that's the correct XML data. I copied it directly from the XML file. — user2274879
– user2274879, Commented May 16, 2013 at 13:28
Are you married to the minidom library? The ElementTree API is much easier to use, for example. — Martijn Pieters
– Martijn Pieters, Commented May 16, 2013 at 13:29
I just started parsing, hence I do not know much about the other API's. I'll try ElementTree if its really that much easier to use. Thanks. — user2274879
– user2274879, Commented May 16, 2013 at 13:34
When I save that XML to my computer and attempt to parse it using minidom (xmldoc = minidom.parse("cora.xml")), I get an xml.parsers.expat.ExpatError error. Maybe I should say “is that the complete XML data”? — Paul D. Waite
– Paul D. Waite, Commented May 16, 2013 at 13:38

Martijn Pieters · Accepted Answer · 2013-05-16 13:51:28Z

1

You can only find tags with getElementsByTagName(), not attributes. You'll need to access those through the Element.getAttribute() method instead:

for author in author:
    authorID = author.getAttribute('id')
    print authorID

If you are still learning about parsing XML, you really want to stay away from the DOM. The DOM API is overly verbose to fit many different programming languages.

The ElementTree API would be easier to use:

import xml.etree.ElementTree as ET

tree = ET.parse('cora.xml')
root = tree.getroot()

# loop over all publications
for pub in root.findall('publication'):
    print ' '.join([t.text for t in pub.findall('title')])
    for author in pub.findall('author'):
        print 'Author id: {}'.format(author.attrib['id'])
        print 'Author name: {}'.format(author.text)
    for venue in pub.findall('.//venue[@id]'):  # all venue tags with id attribute
        print ', '.join([name.text for name in venue.findall('name')])

edited May 16, 2013 at 13:51

answered May 16, 2013 at 13:34

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user2274879 Over a year ago

Hi Pieters, its working now. Thanks alot, but I am more interested in the names of the authors, and the venue. Any ideas?

Martijn Pieters Over a year ago

@user2274879: Loop over the publications instead then (for pub in root.findall('publication'):) then find authors from there (for author in pub.findall('author')) and venues (for venue in pub.findall('.//venue[@id]') perhaps, to find just those with an id attribute). Author names are text content in the tag, so author.text will get you that.

user2274879 Over a year ago

I got the following error when i tried using author.tex: TypeError: 'str' object is not callable

Martijn Pieters Over a year ago

@user2274879: No (); .text is an attribute (and yes, I made that mistake at first, corrected since).

user2274879 Over a year ago

Hi Pieters, its working like magic. I was actually fraustrated before posting the question.

|

Collectives™ on Stack Overflow

Python xml parsing using minidom

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related