Python : Flatten xml to csv with nested child tags

Question

There are multiple XML files that I would like to flatten, I am looking for a generic function or logic to convert the xml to a flat file. Most of the answers include hard-coded tags. Closest one being Python : Flatten xml to csv with parent tag repeated in child but still has hard-coded solution. For below input xml

<root> 
    <child> child-val </child>
    <child2> child2-val2 </child2>
    <anotherchild>
        <childid> another child 45</childid>
        <childname> another child name </childname>
    </anotherchild>
    <group> 
        <groupid> groupid-123</groupid>
        <grouplist>
            <groupzone>
                <groupname>first </groupname>
                <groupsize> 4</groupsize>
            </groupzone>
            <groupzone>
                <groupname>second </groupname>
                <groupsize> 6</groupsize>
            </groupzone>
            <groupzone>
                <groupname> third </groupname>
                <groupsize> 8 </groupsize>
            </groupzone>
        </grouplist>
    </group>
    <secondgroup> 
        <secondgroupid> secondgroupid-42 </secondgroupid>
        <secondgrouptitle> second group title </secondgrouptitle>
        <secondgrouplist>
            <secondgroupzone>
                <secondgroupsub>
                    <secondsub>v1</secondsub>
                    <secondsubid>12</secondsubid>
                </secondgroupsub>
                <secondgroupname> third </secondgroupname>
                <secondgroupsize> 4</secondgroupsize>
            </secondgroupzone>
            <secondgroupzone>
                <secondgroupsub>
                    <secondsub>v2</secondsub>
                    <secondsubid>1</secondsubid>
                </secondgroupsub>
                <secondgroupname>fourth </secondgroupname>
                <secondgroupsize> 6</secondgroupsize>
            </secondgroupzone>
            <secondgroupzone>
                <secondgroupsub>
                    <secondsub>v3</secondsub>
                    <secondsubid>45</secondsubid>
                </secondgroupsub>
                <secondgroupname> tenth </secondgroupname>
                <secondgroupsize> 10 </secondgroupsize>
            </secondgroupzone>
        </secondgrouplist>
    </secondgroup>
    <child3> val3 </child3>
</root>

I tried using this package pandas-read-xml got most of the values but the anotherchild tag values are showing up in one column(anotherchild) instead of anotherchild|childid and anotherchild|anotherchild. If possible suggest a generic logic to convert an xml to flat file.

import pandas_read_xml as pdx

df = pdx.read_xml(xml_content, ['root'])
fully_fatten_df = pdx.fully_flatten(df)
fully_fatten_df.to_csv("stack.csv", index=False)

Output csv

anotherchild,child,child2,child3,group|groupzone|groupname,group|groupzone|groupsize,secondgroup|secondgroupzone|secondgroupname,secondgroup|secondgroupzone|secondgroupsize,secondgroup|secondgroupzone|secondgroupsub|secondsub,secondgroup|secondgroupzone|secondgroupsub|secondsubid
,child-val,child2-val2,val3,,,third,4,v1,12
,child-val,child2-val2,val3,,,fourth,6,v2,1
,child-val,child2-val2,val3,,,tenth,10,v3,45
,child-val,child2-val2,val3,first,4,,,,
,child-val,child2-val2,val3,second,6,,,,
,child-val,child2-val2,val3,third,8,,,,
another child 45,child-val,child2-val2,val3,,,,,,
another child name,child-val,child2-val2,val3,,,,,,
,child-val,child2-val2,val3,,,,,,
,child-val,child2-val2,val3,,,,,,
,child-val,child2-val2,val3,,,,,,

kindly post your expected output dataframe. Also, you could use python's lxml module or the ET module — sammywemmy
– sammywemmy, Commented Apr 7, 2021 at 13:40

Bogdan Ariton · Accepted Answer · 2021-04-07 21:42:13Z

Normally the xml nodes that hold a value should be the corresponding columns. As I see in your xml example "child", "child2", "childid", and so on, should be columns.

Based on the above xml I've made this piece of code that should be sufficiently generic to accommodate similar examples.

import pandas as pd
import tabulate
import xml.etree.ElementTree as Xet

def getData(root, rows, columns, rowcount, name=None):
    if name != None:
        name = "{0}{1}{2}".format(name,"|",root.tag) # we construct the column names like this so that we don't risk haveing the same column on different nodes that should repeat
                                         # for example: a node named "name" could be under group and secondgroup and they shouldn't be the same column
    else:
        name = root.tag

    for item in root:
        if len(item) == 0:
            colName = "{0}{1}{2}".format(name,"|", item.tag)
            # colName = item.tag # remove this line to get the full column name; ex: root|group|grouplist|groupzone|groupsize
            if not colName in columns:
                columns.append(colName) # save the column to a list
                rowcount.append(0) # save the row on which we add the value for this column
                rows[rowcount[columns.index(colName)]].update({colName : item.text.strip()}) # add the value to the row - this will always happen on row 0
            else:
                repeatPosition = columns.index(colName) # get the column position for the repeated item
                rowcount[repeatPosition] = rowcount[repeatPosition] + 1 # increase row count
                if len(rows) <= max(rowcount):
                    rows.append({}) # add a new row based on row count
                rows[rowcount[repeatPosition]].update({colName : item.text.strip()}) # add the value on the new row

        getData(item, rows, columns, rowcount, name) # recursive call to walk trough each list of elements


xmlParse = Xet.parse('example.xml')
root = xmlParse.getroot()

rows = [{}] # adding at least one row from the start and will add additional rows as we go along
columns = [] # holds the names of the columns
rowcount = [] # holds the rows on which we add each element value; ex: 
getData(root, rows, columns, rowcount)

df = pd.DataFrame(rows, columns=columns)
print(df)
df.to_csv('parse.csv')

The end result after running this code looks like this: csv result

And this is the plain csv:

,root|child,root|child2,root|anotherchild|childid,root|anotherchild|childname,root|group|groupid,root|group|grouplist|groupzone|groupname,root|group|grouplist|groupzone|groupsize,root|secondgroup|secondgroupid,root|secondgroup|secondgrouptitle,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupsub|secondsub,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupsub|secondsubid,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupname,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupsize,root|child3
0,child-val,child2-val2,another child 45,another child name,groupid-123,first,4,secondgroupid-42,second group title,v1,12,third,4,val3
1,,,,,,second,6,,,v2,1,fourth,6,
2,,,,,,third,8,,,v3,45,tenth,10,

Hopefully this should get you started in the right direction.

Collectives™ on Stack Overflow

Python : Flatten xml to csv with nested child tags

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related