0

I got a xml file need to read in python as dataframe, it is a part of the xml code:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <data id="root_661191">
        <index id="data_162062">
            <item id="index_829361_1">173915</item>
            <item id="index_829361_2">14712</item>
            <item id="index_829361_3">321255</item>
        </index>
        <property_id id="data_809625">
            <item id="property_id_844926_1">88942.0</item>
            <item id="property_id_844926_2">88162.0</item>
            <item id="property_id_844926_3">80553.0</item>
        </property_id>
        <addr_street id="data_409265">
            <item id="addr_street_959977_1">58 Middleton Street</item>
            <item id="addr_street_959977_2">24 Royena Road</item>
            <item id="addr_street_959977_3">9 Cafardi Boulevard</item>
        </addr_street>
        <price id="data_784942">
           <item id="price_225606_1">7480000.0</item>
           <item id="price_225606_2">7728000.0</item>
           <item id="price_225606_3">7659000.0</item> 
        </price>
    </data>
</root>

I try some easier sample data to test my read function, they work. But when I use my function to do this xml file it only produce None in output. I think it might be col names, but I don't know how to fix it, could anyone help me? The function I used is:

import pandas as pd
import xml.etree.ElementTree as et

def parse_xml(xml_file, df_cols): 
    
    xtree = et.parse(xml_file)
    xroot = xtree.getroot()
    rows = []
    
    for node in xroot: 
        res = []
        res.append(node.attrib.get(df_cols[0]))
        for el in df_cols[1:]: 
            if node is not None and node.find(el) is not None:
                res.append(node.find(el).text)
            else: 
                res.append(None)
        rows.append({df_cols[i]: res[i] 
                     for i, _ in enumerate(df_cols)})
    
    out_df = pd.DataFrame(rows, columns=df_cols)
        
    return out_df

df_cols = ['index','property_id','addr_street','price']
parse_xml['myxmlfile.xml',df_cols]

1 Answer 1

1

I think this is what you want. You should be able to put this in a function if you need

tree = et.parse('myxmlfile.xml')
root = tree.getroot()

df_cols = ['index','property_id','addr_street','price']
mlist = []
for col in df_cols:
    for d in root.findall('data'):
        # print(d.attrib)
        for c in d.findall(col):
            # print(c)
            # print(c.attrib)
            # print(c.attrib.get('id'))
            lst = []
            for itm in c.findall('item'):
                # print(itm.text)
                lst.append(itm.text)
        # print({col:lst})
        mlist.append(pd.DataFrame({col:lst}))
mlist
pd.concat(mlist, axis=1)

Output:

    index property_id          addr_street      price
0  173915     88942.0  58 Middleton Street  7480000.0
1   14712     88162.0       24 Royena Road  7728000.0
2  321255     80553.0  9 Cafardi Boulevard  7659000.0
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.