XML file to CSV via python dataframe

Question

I recently asked a question that got closed so I am trying to make it less broad. My issue is I don't know where to begin with the problem so I can't really show what I have 'already tried'. Unable to find anything online that has helped.

I have an open source XML file that follows this format:

<surnames>
    <cluster>
        <surname lang="ga" text="Achaorainn" anchor="Achaorainn"/>
        <surname lang="en" text="Ahern" anchor="Ahern"/>
        <surname lang="en" text="Aherne" anchor="Aherne"/>
        <surname lang="en" text="Ahearne" anchor="Ahearne"/>
    </cluster>
    <cluster>
        <surname lang="en" text="Achison" anchor="Achison"/>
        <surname lang="en" text="Atchison" anchor="Atchison"/>
    </cluster>
    <cluster>
        <surname lang="en" text="Adams" anchor="Adams"/>
        <surname lang="ga" text="Mac Conamha" anchor="Conamha"/>
    </cluster>
    <cluster>
        <surname lang="ga" text="Ághas" anchor="Ághas"/>
        <surname lang="en" text="Ashe" anchor="Ashe"/>
        <surname lang="ga" text="Ás" anchor="Ás"/>
    </cluster>
    <cluster>
        <surname lang="en" text="Young" anchor="Young"/>
        <surname lang="ga" text="Ó Hógáin" anchor="Hógáin"/>
        <surname lang="ga" text="de Siún" anchor="Siún"/>
    </cluster>
</surnames>

Essentially I want this to be converted to a CSV file that looks like this, splitting each cluster into a row:

Achaorainn,Ahern,Aherne,Ahearne
Achison,Atchison
Adams,Mac Conamha

I have never tried anything like this so even just pointing me in the right direction would be a massive help.

I thought about converting to dataframe and then to CSV.

I tried this as a starting point but I can't even get it to work as I think it fails at the objectify.parse stage:

import csv
import pandas as pd
import xml.etree.ElementTree as ET

#%%

xml = objectify.parse('surnames_reduced.xml')
root = xml.getroot()

data=[]
for i in range(len(root.getchildren())):
    data.append([child.text for child in root.getchildren()[i].getchildren()])

df = pd.DataFrame(data).T

Maurice Meyer · Accepted Answer · 2020-05-27 12:41:58Z

1

Useing etree, saving as list of lists (that can be converted to CSV directly):

import lxml.etree
import csv

#  xml = lxml.etree.parse('z.xml')
xml = lxml.etree.fromstring(open('z.xml').read())  # in case there is no XML declaration!
result=[]
for cluster in xml.xpath('//cluster'):
    names = []
    for child in cluster.getchildren():
        names.append(child.get('text'))  # reads the name attribute
    result.append(names)

with open("out.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerows(result)

print(open('out.csv').read())

Output:

Achaorainn,Ahern,Aherne,Ahearne
Achison,Atchison
Adams,Mac Conamha
Ághas,Ashe,Ás
Young,Ó Hógáin,de Siún

edited May 27, 2020 at 12:41

answered May 27, 2020 at 12:07

Maurice Meyer

18.2k4 gold badges35 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

nimgwfc Over a year ago

I get a parse error on line 4. My XML file looks exactly as I have pasted above, do I need an outer object for it?

Maurice Meyer Over a year ago

Your XML file is not valid, the XML header is missing - i assumed it got lost while pasting here. Something like this: <?xml version="1.0" encoding="UTF-8" standalone="yes" ?>

nimgwfc Over a year ago

Yes it works perfectly. Thanks a lot, as you can see my XML experience is non-existent

balderman · Accepted Answer · 2020-06-30 12:31:05Z

Using python builtin XML lib (no external lib is needed)

import xml.etree.ElementTree as ET

xml = '''<surnames>
    <cluster>
        <surname lang="ga" text="Achaorainn" anchor="Achaorainn"/>
        <surname lang="en" text="Ahern" anchor="Ahern"/>
        <surname lang="en" text="Aherne" anchor="Aherne"/>
        <surname lang="en" text="Ahearne" anchor="Ahearne"/>
    </cluster>
    <cluster>
        <surname lang="en" text="Achison" anchor="Achison"/>
        <surname lang="en" text="Atchison" anchor="Atchison"/>
    </cluster>
    <cluster>
        <surname lang="en" text="Adams" anchor="Adams"/>
        <surname lang="ga" text="Mac Conamha" anchor="Conamha"/>
    </cluster>
    <cluster>
        <surname lang="ga" text="Ághas" anchor="Ághas"/>
        <surname lang="en" text="Ashe" anchor="Ashe"/>
        <surname lang="ga" text="Ás" anchor="Ás"/>
    </cluster>
    <cluster>
        <surname lang="en" text="Young" anchor="Young"/>
        <surname lang="ga" text="Ó Hógáin" anchor="Hógáin"/>
        <surname lang="ga" text="de Siún" anchor="Siún"/>
    </cluster>
</surnames>'''

root = ET.fromstring(xml)
data = []
for c in root.findall('.//cluster'):
    data.append([s.attrib['text'] for s in c.findall('./surname')])
for entry in data:
    print(','.join(entry))

output

Achaorainn,Ahern,Aherne,Ahearne
Achison,Atchison
Adams,Mac Conamha
Ághas,Ashe,Ás
Young,Ó Hógáin,de Siún

Collectives™ on Stack Overflow

XML file to CSV via python dataframe

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related