0

I have a huge list of different authors and their selected works in a <list> in XML (namend bibliography.xml). Here is an example:

<list type="index">
                <item><persName>Poe, Edgar Allan</persName>, <note>1809—1849</note>, <bibl>The Black
                        Cat 1843 (<abbr>Cat.</abbr>).</bibl> — <bibl>The Gold-Bug 1843
                            (<abbr>Bug.</abbr>).</bibl> — <bibl>The Raven 1845
                        (<abbr>Rav.</abbr>).</bibl></item>

                <item><persName>Melville, Herman</persName>, <bibl>Benito Cereno 1855
                            (<abbr>Ben.</abbr>)</bibl> — <bibl>Moby-Dick 1851
                        (<abbr>MobD.</abbr>)</bibl> — <bibl>Typee: A Peep at Polynesian Life 1846
                            (<abbr>PolyL.</abbr>)</bibl></item>
                
                <item><persName>Barth, John</persName>, <note>(*1930)</note>, <bibl>The Sot-Weed
                        Factor 1960 (<abbr>Fac.</abbr>)</bibl> — <bibl>Giles Goat-Boy 1960
                            (<abbr>Gil.</abbr>)</bibl></item>
            </list>
import xml.etree.ElementTree as ET

tree = ET.parse('bibliography.xml')
root = tree.getroot()

for work in root:
    if(work.tag=='item'):
        print work.get('persName')
            if (attr.tag=='abbr')
                print (attr.text)

obviously it's not working, but since I'm absolutely new to python, I can't wrap my mind around about what I'm doing wrong. Would be highly appreciated if someone could help me out here.

1
  • Okay, that's weird beacuse Oxygen and some other validators are fine with the XML. Keep in mind that I just posted a snippet of the <list>, not the whole TEI-Header, body etc. Commented Feb 27, 2021 at 12:33

2 Answers 2

0

Even I tried the same way as you did and landed up in the same problem. I had no option but to convert the whole xml into pretty-xml, and treat it as a single string. Then iterate each line to for a specific tag.

import xml.dom.minidom

dom = xml.dom.minidom.parse("bibliography.xml")
pretty_xml = dom.toprettyxml()
pretty_xml = pretty_xml.split("\n")
start, end = [], [] # store the beginning and the end of "item" tag

for idx in range(len(pretty_xml)):
        if "item" in pretty_xml[idx]:
            if "/" not in pretty_xml[idx]:
                start.append(idx)
            else:
                end.append(idx)

Now you know that between start[0] and end[0] you have your first data point available. Like wise iterate for all elements of both list sequentially with "if" conditions, the structure would be somewhat like this (I am not writing the whole code):

for idx in range(len(start)):
    for line in pretty_xml[start[idx] + 1 : end[idx]]:
        line.split("persName")[1].replace("<","").replace(">","").replace("/","")
         ...
         ...

(If you find a better structured approach, do let me know.)

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks a lot for your answer. I tried it but I got a response saying: >IndexError: list index out of range There's some kind of solution for my problem also here on stackoverflow (stackoverflow.com/questions/37619848/…) but it still can't make it work.
Are you getting error in the first part or the second part of the code snippet (that I shared) ? Were you able to populate "start" and "end" lists?
first part says "dom = xml.dom.minidom.parse("bibliography.xml")" second part says the already mentioned above
0

Consider using XPath to get the data. Simply call tree.xpath("//item") to return all items.

Below is a working example based on XML snippet. tree.getroot() will only work depending on full xml.

Basic working example:

import lxml.etree as etree

xml = '''<list type="index">
            <item><persName>Poe, Edgar Allan</persName>, <note>1809—1849</note>, <bibl>The Black
                    Cat 1843 <abbr>(Cat.).</abbr></bibl> — <bibl>The Gold-Bug 1843
                        <abbr>(Bug.)</abbr>.</bibl> — <bibl>The Raven 1845
                    <abbr>(Rav.)</abbr>.</bibl></item>

            <item><persName>Melville, Herman</persName>, <bibl>Benito Cereno 1855
                        (<abbr>Ben.</abbr>)</bibl> — <bibl>Moby-Dick 1851
                    (<abbr>MobD.</abbr>)</bibl> — <bibl>Typee: A Peep at Polynesian Life 1846
                        (<abbr>PolyL.</abbr>)</bibl></item>
            
            <item><persName>Barth, John</persName>, <note>(*1930)</note>, <bibl>The Sot-Weed
                    Factor 1960 (<abbr>Fac.</abbr>)</bibl> — <bibl>Giles Goat-Boy 1960
                        (<abbr>Gil.</abbr>)</bibl></item>
        </list>
'''
tree = etree.fromstring(xml)
#root = tree.getroot()

for work in tree.xpath("//item"):
    persName = work.find('persName').text.strip()
    abbr =' '.join([x.text for x in work.xpath('bibl/abbr')])
    print (f'{persName} {abbr}')

Output:

Poe, Edgar Allan (Cat.). (Bug.) (Rav.)
Melville, Herman Ben. MobD. PolyL.
Barth, John Fac. Gil.

7 Comments

Thanks a lot, that worked. But only if there's just one <persName> as an author, if there are more <persName>'s, each in one <item> as in my case, it will just print one name followed by all <abbr>'s, without printing the related name. And I'm wondering if I have to put the whole xml-data into that script as well or if I can just link to the xml-file within the script?
You can probably replace work.find('persName') with work.xpath('//persName') or work.findall('persName') and preform for each loop on results. If you supply XML example, then I can update answer.
Depending on XML, you may be able to do for work in tree.xpath("//persName"):
Thanks for the reply. If I replace it with each of your suggestions, I get the response „AttributeError: 'list' object has no attribute 'text'“ I'll give more xml-data in my post above, I just edited it P.S.: the list goes on and one in the same way, about 500 bibliographical entries)
running the xml in your question still achieved the correct results. The error „AttributeError: 'list' object has no attribute 'text' - it's most like you're calling .text on a list (and not an item).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.