1

I just started learning how to parse xml using minidom. I tried to get the author's names (xml data is down below) using the following code:

from xml.dom import minidom

xmldoc = minidom.parse("cora.xml")

author = xmldoc.getElementsByTagName ('author')

for author in author:
    authorID=author.getElementsByTagName('author id')
    print authorID

I got empty brackets([]) all the way. Can someone please help me out? I will also need the title and venue. Thanks in advance. See xml data below:

<?xml version="1.0" encoding="UTF-8"?>
<coraRADD>
   <publication id="ahlskog1994a">
      <author id="199">M. Ahlskog</author>
      <author id="74"> J. Paloheimo</author>
      <author id="64"> H. Stubb</author>
      <author id="103"> P. Dyreklev</author>
      <author id="54"> M. Fahlman</author>
      <title>Inganas</title>
      <title>and</title>
      <title>M.R.</title>
      <venue>
         <venue pubid="ahlskog1994a" id="1">
                  <name>Andersson</name>
                  <name> J Appl. Phys.</name>
                  <vol>76</vol>
                  <date> (1994). </date>
            </venue>
6
  • Is that the correct XML data? There’s an extra opening <venue> tag, and the <publication> and <coraRADD> tags aren’t closed. Commented May 16, 2013 at 13:25
  • Hi Paul, that's the correct XML data. I copied it directly from the XML file. Commented May 16, 2013 at 13:28
  • Are you married to the minidom library? The ElementTree API is much easier to use, for example. Commented May 16, 2013 at 13:29
  • I just started parsing, hence I do not know much about the other API's. I'll try ElementTree if its really that much easier to use. Thanks. Commented May 16, 2013 at 13:34
  • 1
    When I save that XML to my computer and attempt to parse it using minidom (xmldoc = minidom.parse("cora.xml")), I get an xml.parsers.expat.ExpatError error. Maybe I should say “is that the complete XML data”? Commented May 16, 2013 at 13:38

1 Answer 1

1

You can only find tags with getElementsByTagName(), not attributes. You'll need to access those through the Element.getAttribute() method instead:

for author in author:
    authorID = author.getAttribute('id')
    print authorID

If you are still learning about parsing XML, you really want to stay away from the DOM. The DOM API is overly verbose to fit many different programming languages.

The ElementTree API would be easier to use:

import xml.etree.ElementTree as ET

tree = ET.parse('cora.xml')
root = tree.getroot()

# loop over all publications
for pub in root.findall('publication'):
    print ' '.join([t.text for t in pub.findall('title')])
    for author in pub.findall('author'):
        print 'Author id: {}'.format(author.attrib['id'])
        print 'Author name: {}'.format(author.text)
    for venue in pub.findall('.//venue[@id]'):  # all venue tags with id attribute
        print ', '.join([name.text for name in venue.findall('name')])
Sign up to request clarification or add additional context in comments.

6 Comments

Hi Pieters, its working now. Thanks alot, but I am more interested in the names of the authors, and the venue. Any ideas?
@user2274879: Loop over the publications instead then (for pub in root.findall('publication'):) then find authors from there (for author in pub.findall('author')) and venues (for venue in pub.findall('.//venue[@id]') perhaps, to find just those with an id attribute). Author names are text content in the tag, so author.text will get you that.
I got the following error when i tried using author.tex: TypeError: 'str' object is not callable
@user2274879: No (); .text is an attribute (and yes, I made that mistake at first, corrected since).
Hi Pieters, its working like magic. I was actually fraustrated before posting the question.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.