0

I recently asked a question that got closed so I am trying to make it less broad. My issue is I don't know where to begin with the problem so I can't really show what I have 'already tried'. Unable to find anything online that has helped.

I have an open source XML file that follows this format:

<surnames>
    <cluster>
        <surname lang="ga" text="Achaorainn" anchor="Achaorainn"/>
        <surname lang="en" text="Ahern" anchor="Ahern"/>
        <surname lang="en" text="Aherne" anchor="Aherne"/>
        <surname lang="en" text="Ahearne" anchor="Ahearne"/>
    </cluster>
    <cluster>
        <surname lang="en" text="Achison" anchor="Achison"/>
        <surname lang="en" text="Atchison" anchor="Atchison"/>
    </cluster>
    <cluster>
        <surname lang="en" text="Adams" anchor="Adams"/>
        <surname lang="ga" text="Mac Conamha" anchor="Conamha"/>
    </cluster>
    <cluster>
        <surname lang="ga" text="Ághas" anchor="Ághas"/>
        <surname lang="en" text="Ashe" anchor="Ashe"/>
        <surname lang="ga" text="Ás" anchor="Ás"/>
    </cluster>
    <cluster>
        <surname lang="en" text="Young" anchor="Young"/>
        <surname lang="ga" text="Ó Hógáin" anchor="Hógáin"/>
        <surname lang="ga" text="de Siún" anchor="Siún"/>
    </cluster>
</surnames>

Essentially I want this to be converted to a CSV file that looks like this, splitting each cluster into a row:

Achaorainn,Ahern,Aherne,Ahearne
Achison,Atchison
Adams,Mac Conamha

I have never tried anything like this so even just pointing me in the right direction would be a massive help.

I thought about converting to dataframe and then to CSV.

I tried this as a starting point but I can't even get it to work as I think it fails at the objectify.parse stage:

import csv
import pandas as pd
import xml.etree.ElementTree as ET

#%%

xml = objectify.parse('surnames_reduced.xml')
root = xml.getroot()

data=[]
for i in range(len(root.getchildren())):
    data.append([child.text for child in root.getchildren()[i].getchildren()])

df = pd.DataFrame(data).T

2 Answers 2

1

Useing etree, saving as list of lists (that can be converted to CSV directly):

import lxml.etree
import csv

#  xml = lxml.etree.parse('z.xml')
xml = lxml.etree.fromstring(open('z.xml').read())  # in case there is no XML declaration!
result=[]
for cluster in xml.xpath('//cluster'):
    names = []
    for child in cluster.getchildren():
        names.append(child.get('text'))  # reads the name attribute
    result.append(names)

with open("out.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerows(result)

print(open('out.csv').read())

Output:

Achaorainn,Ahern,Aherne,Ahearne
Achison,Atchison
Adams,Mac Conamha
Ághas,Ashe,Ás
Young,Ó Hógáin,de Siún
Sign up to request clarification or add additional context in comments.

3 Comments

I get a parse error on line 4. My XML file looks exactly as I have pasted above, do I need an outer object for it?
Your XML file is not valid, the XML header is missing - i assumed it got lost while pasting here. Something like this: <?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
Yes it works perfectly. Thanks a lot, as you can see my XML experience is non-existent
0

Using python builtin XML lib (no external lib is needed)

import xml.etree.ElementTree as ET

xml = '''<surnames>
    <cluster>
        <surname lang="ga" text="Achaorainn" anchor="Achaorainn"/>
        <surname lang="en" text="Ahern" anchor="Ahern"/>
        <surname lang="en" text="Aherne" anchor="Aherne"/>
        <surname lang="en" text="Ahearne" anchor="Ahearne"/>
    </cluster>
    <cluster>
        <surname lang="en" text="Achison" anchor="Achison"/>
        <surname lang="en" text="Atchison" anchor="Atchison"/>
    </cluster>
    <cluster>
        <surname lang="en" text="Adams" anchor="Adams"/>
        <surname lang="ga" text="Mac Conamha" anchor="Conamha"/>
    </cluster>
    <cluster>
        <surname lang="ga" text="Ághas" anchor="Ághas"/>
        <surname lang="en" text="Ashe" anchor="Ashe"/>
        <surname lang="ga" text="Ás" anchor="Ás"/>
    </cluster>
    <cluster>
        <surname lang="en" text="Young" anchor="Young"/>
        <surname lang="ga" text="Ó Hógáin" anchor="Hógáin"/>
        <surname lang="ga" text="de Siún" anchor="Siún"/>
    </cluster>
</surnames>'''

root = ET.fromstring(xml)
data = []
for c in root.findall('.//cluster'):
    data.append([s.attrib['text'] for s in c.findall('./surname')])
for entry in data:
    print(','.join(entry))

output

Achaorainn,Ahern,Aherne,Ahearne
Achison,Atchison
Adams,Mac Conamha
Ághas,Ashe,Ás
Young,Ó Hógáin,de Siún

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.