1

There are multiple XML files that I would like to flatten, I am looking for a generic function or logic to convert the xml to a flat file. Most of the answers include hard-coded tags. Closest one being Python : Flatten xml to csv with parent tag repeated in child but still has hard-coded solution. For below input xml

<root> 
    <child> child-val </child>
    <child2> child2-val2 </child2>
    <anotherchild>
        <childid> another child 45</childid>
        <childname> another child name </childname>
    </anotherchild>
    <group> 
        <groupid> groupid-123</groupid>
        <grouplist>
            <groupzone>
                <groupname>first </groupname>
                <groupsize> 4</groupsize>
            </groupzone>
            <groupzone>
                <groupname>second </groupname>
                <groupsize> 6</groupsize>
            </groupzone>
            <groupzone>
                <groupname> third </groupname>
                <groupsize> 8 </groupsize>
            </groupzone>
        </grouplist>
    </group>
    <secondgroup> 
        <secondgroupid> secondgroupid-42 </secondgroupid>
        <secondgrouptitle> second group title </secondgrouptitle>
        <secondgrouplist>
            <secondgroupzone>
                <secondgroupsub>
                    <secondsub>v1</secondsub>
                    <secondsubid>12</secondsubid>
                </secondgroupsub>
                <secondgroupname> third </secondgroupname>
                <secondgroupsize> 4</secondgroupsize>
            </secondgroupzone>
            <secondgroupzone>
                <secondgroupsub>
                    <secondsub>v2</secondsub>
                    <secondsubid>1</secondsubid>
                </secondgroupsub>
                <secondgroupname>fourth </secondgroupname>
                <secondgroupsize> 6</secondgroupsize>
            </secondgroupzone>
            <secondgroupzone>
                <secondgroupsub>
                    <secondsub>v3</secondsub>
                    <secondsubid>45</secondsubid>
                </secondgroupsub>
                <secondgroupname> tenth </secondgroupname>
                <secondgroupsize> 10 </secondgroupsize>
            </secondgroupzone>
        </secondgrouplist>
    </secondgroup>
    <child3> val3 </child3>
</root>

I tried using this package pandas-read-xml got most of the values but the anotherchild tag values are showing up in one column(anotherchild) instead of anotherchild|childid and anotherchild|anotherchild. If possible suggest a generic logic to convert an xml to flat file.

import pandas_read_xml as pdx

df = pdx.read_xml(xml_content, ['root'])
fully_fatten_df = pdx.fully_flatten(df)
fully_fatten_df.to_csv("stack.csv", index=False)

Output csv

anotherchild,child,child2,child3,group|groupzone|groupname,group|groupzone|groupsize,secondgroup|secondgroupzone|secondgroupname,secondgroup|secondgroupzone|secondgroupsize,secondgroup|secondgroupzone|secondgroupsub|secondsub,secondgroup|secondgroupzone|secondgroupsub|secondsubid
,child-val,child2-val2,val3,,,third,4,v1,12
,child-val,child2-val2,val3,,,fourth,6,v2,1
,child-val,child2-val2,val3,,,tenth,10,v3,45
,child-val,child2-val2,val3,first,4,,,,
,child-val,child2-val2,val3,second,6,,,,
,child-val,child2-val2,val3,third,8,,,,
another child 45,child-val,child2-val2,val3,,,,,,
another child name,child-val,child2-val2,val3,,,,,,
,child-val,child2-val2,val3,,,,,,
,child-val,child2-val2,val3,,,,,,
,child-val,child2-val2,val3,,,,,,

1
  • kindly post your expected output dataframe. Also, you could use python's lxml module or the ET module Commented Apr 7, 2021 at 13:40

1 Answer 1

3

Normally the xml nodes that hold a value should be the corresponding columns. As I see in your xml example "child", "child2", "childid", and so on, should be columns.

Based on the above xml I've made this piece of code that should be sufficiently generic to accommodate similar examples.

import pandas as pd
import tabulate
import xml.etree.ElementTree as Xet

def getData(root, rows, columns, rowcount, name=None):
    if name != None:
        name = "{0}{1}{2}".format(name,"|",root.tag) # we construct the column names like this so that we don't risk haveing the same column on different nodes that should repeat
                                         # for example: a node named "name" could be under group and secondgroup and they shouldn't be the same column
    else:
        name = root.tag

    for item in root:
        if len(item) == 0:
            colName = "{0}{1}{2}".format(name,"|", item.tag)
            # colName = item.tag # remove this line to get the full column name; ex: root|group|grouplist|groupzone|groupsize
            if not colName in columns:
                columns.append(colName) # save the column to a list
                rowcount.append(0) # save the row on which we add the value for this column
                rows[rowcount[columns.index(colName)]].update({colName : item.text.strip()}) # add the value to the row - this will always happen on row 0
            else:
                repeatPosition = columns.index(colName) # get the column position for the repeated item
                rowcount[repeatPosition] = rowcount[repeatPosition] + 1 # increase row count
                if len(rows) <= max(rowcount):
                    rows.append({}) # add a new row based on row count
                rows[rowcount[repeatPosition]].update({colName : item.text.strip()}) # add the value on the new row

        getData(item, rows, columns, rowcount, name) # recursive call to walk trough each list of elements


xmlParse = Xet.parse('example.xml')
root = xmlParse.getroot()

rows = [{}] # adding at least one row from the start and will add additional rows as we go along
columns = [] # holds the names of the columns
rowcount = [] # holds the rows on which we add each element value; ex: 
getData(root, rows, columns, rowcount)

df = pd.DataFrame(rows, columns=columns)
print(df)
df.to_csv('parse.csv')

The end result after running this code looks like this: csv result

And this is the plain csv:

,root|child,root|child2,root|anotherchild|childid,root|anotherchild|childname,root|group|groupid,root|group|grouplist|groupzone|groupname,root|group|grouplist|groupzone|groupsize,root|secondgroup|secondgroupid,root|secondgroup|secondgrouptitle,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupsub|secondsub,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupsub|secondsubid,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupname,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupsize,root|child3
0,child-val,child2-val2,another child 45,another child name,groupid-123,first,4,secondgroupid-42,second group title,v1,12,third,4,val3
1,,,,,,second,6,,,v2,1,fourth,6,
2,,,,,,third,8,,,v3,45,tenth,10,

Hopefully this should get you started in the right direction.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.