Extract XML-data with python

Question

I have a huge list of different authors and their selected works in a <list> in XML (namend bibliography.xml). Here is an example:

<list type="index">
                <item><persName>Poe, Edgar Allan</persName>, <note>1809—1849</note>, <bibl>The Black
                        Cat 1843 (<abbr>Cat.</abbr>).</bibl> — <bibl>The Gold-Bug 1843
                            (<abbr>Bug.</abbr>).</bibl> — <bibl>The Raven 1845
                        (<abbr>Rav.</abbr>).</bibl></item>

                <item><persName>Melville, Herman</persName>, <bibl>Benito Cereno 1855
                            (<abbr>Ben.</abbr>)</bibl> — <bibl>Moby-Dick 1851
                        (<abbr>MobD.</abbr>)</bibl> — <bibl>Typee: A Peep at Polynesian Life 1846
                            (<abbr>PolyL.</abbr>)</bibl></item>
                
                <item><persName>Barth, John</persName>, <note>(*1930)</note>, <bibl>The Sot-Weed
                        Factor 1960 (<abbr>Fac.</abbr>)</bibl> — <bibl>Giles Goat-Boy 1960
                            (<abbr>Gil.</abbr>)</bibl></item>
            </list>

import xml.etree.ElementTree as ET

tree = ET.parse('bibliography.xml')
root = tree.getroot()

for work in root:
    if(work.tag=='item'):
        print work.get('persName')
            if (attr.tag=='abbr')
                print (attr.text)

obviously it's not working, but since I'm absolutely new to python, I can't wrap my mind around about what I'm doing wrong. Would be highly appreciated if someone could help me out here.

Okay, that's weird beacuse Oxygen and some other validators are fine with the XML. Keep in mind that I just posted a snippet of the <list>, not the whole TEI-Header, body etc. — SparrowSilencio
– SparrowSilencio, Commented Feb 27, 2021 at 12:33

Tanveer · Accepted Answer · 2021-02-27 12:43:41Z

0

Even I tried the same way as you did and landed up in the same problem. I had no option but to convert the whole xml into pretty-xml, and treat it as a single string. Then iterate each line to for a specific tag.

import xml.dom.minidom

dom = xml.dom.minidom.parse("bibliography.xml")
pretty_xml = dom.toprettyxml()
pretty_xml = pretty_xml.split("\n")
start, end = [], [] # store the beginning and the end of "item" tag

for idx in range(len(pretty_xml)):
        if "item" in pretty_xml[idx]:
            if "/" not in pretty_xml[idx]:
                start.append(idx)
            else:
                end.append(idx)

Now you know that between start[0] and end[0] you have your first data point available. Like wise iterate for all elements of both list sequentially with "if" conditions, the structure would be somewhat like this (I am not writing the whole code):

for idx in range(len(start)):
    for line in pretty_xml[start[idx] + 1 : end[idx]]:
        line.split("persName")[1].replace("<","").replace(">","").replace("/","")
         ...
         ...

(If you find a better structured approach, do let me know.)

answered Feb 27, 2021 at 12:43

Tanveer

971 gold badge1 silver badge7 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

SparrowSilencio Over a year ago

Thanks a lot for your answer. I tried it but I got a response saying: >IndexError: list index out of range There's some kind of solution for my problem also here on stackoverflow (stackoverflow.com/questions/37619848/…) but it still can't make it work.

Tanveer Over a year ago

Are you getting error in the first part or the second part of the code snippet (that I shared) ? Were you able to populate "start" and "end" lists?

SparrowSilencio Over a year ago

first part says "dom = xml.dom.minidom.parse("bibliography.xml")" second part says the already mentioned above

Greg · Accepted Answer · 2021-02-28 22:26:46Z

0

Consider using XPath to get the data. Simply call tree.xpath("//item") to return all items.

Below is a working example based on XML snippet. tree.getroot() will only work depending on full xml.

Basic working example:

import lxml.etree as etree

xml = '''<list type="index">
            <item><persName>Poe, Edgar Allan</persName>, <note>1809—1849</note>, <bibl>The Black
                    Cat 1843 <abbr>(Cat.).</abbr></bibl> — <bibl>The Gold-Bug 1843
                        <abbr>(Bug.)</abbr>.</bibl> — <bibl>The Raven 1845
                    <abbr>(Rav.)</abbr>.</bibl></item>

            <item><persName>Melville, Herman</persName>, <bibl>Benito Cereno 1855
                        (<abbr>Ben.</abbr>)</bibl> — <bibl>Moby-Dick 1851
                    (<abbr>MobD.</abbr>)</bibl> — <bibl>Typee: A Peep at Polynesian Life 1846
                        (<abbr>PolyL.</abbr>)</bibl></item>
            
            <item><persName>Barth, John</persName>, <note>(*1930)</note>, <bibl>The Sot-Weed
                    Factor 1960 (<abbr>Fac.</abbr>)</bibl> — <bibl>Giles Goat-Boy 1960
                        (<abbr>Gil.</abbr>)</bibl></item>
        </list>
'''
tree = etree.fromstring(xml)
#root = tree.getroot()

for work in tree.xpath("//item"):
    persName = work.find('persName').text.strip()
    abbr =' '.join([x.text for x in work.xpath('bibl/abbr')])
    print (f'{persName} {abbr}')

Output:

Poe, Edgar Allan (Cat.). (Bug.) (Rav.)
Melville, Herman Ben. MobD. PolyL.
Barth, John Fac. Gil.

edited Feb 28, 2021 at 22:26

answered Feb 27, 2021 at 13:11

Greg

4,5383 gold badges19 silver badges28 bronze badges

7 Comments

SparrowSilencio Over a year ago

Thanks a lot, that worked. But only if there's just one <persName> as an author, if there are more <persName>'s, each in one <item> as in my case, it will just print one name followed by all <abbr>'s, without printing the related name. And I'm wondering if I have to put the whole xml-data into that script as well or if I can just link to the xml-file within the script?

Greg Over a year ago

You can probably replace work.find('persName') with work.xpath('//persName') or work.findall('persName') and preform for each loop on results. If you supply XML example, then I can update answer.

Greg Over a year ago

Depending on XML, you may be able to do for work in tree.xpath("//persName"):

SparrowSilencio Over a year ago

Thanks for the reply. If I replace it with each of your suggestions, I get the response „AttributeError: 'list' object has no attribute 'text'“ I'll give more xml-data in my post above, I just edited it P.S.: the list goes on and one in the same way, about 500 bibliographical entries)

Greg Over a year ago

running the xml in your question still achieved the correct results. The error „AttributeError: 'list' object has no attribute 'text' - it's most like you're calling .text on a list (and not an item).

|

Collectives™ on Stack Overflow

Extract XML-data with python

2 Answers 2

3 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related