How to Convert parsed XML to pandas dataframe or CSV in python?

Question

My sample xml is

<RecordContainer RecordNumber = "1">
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
 </catalog>
</RecordContainer>
<RecordContainer RecordNumber = "2">
 <catalog>  
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
</catalog>
</RecordContainer>

My code to parse the above

import xml.etree.ElementTree as ET
tree = ET.fromstring("<root>"+ sample_data + "</root>")

Now after parsing I want to covert it to pandas dataframe or a csv file. To convert to pandas dataframe following is my code

def f(elem, result):
    result[elem.tag] = elem.text
    cs = elem.getchildren()
    for c in cs:
        result = f(c, result)
        return result

d = f(tree, {})
df = pd.DataFrame(d, index=['values'])

But the above code is returning me empty value in df.

How do I convert above parsed xml to pandas df or csv file?

The fact you have to impose a root asserts your original XML is not an XML but fragmented markup that closely resembles XML (since by definition XML is well-formed with a root). Before automated solution, consider fixing the source of this fragment. — Parfait
– Parfait, Commented Sep 23, 2019 at 15:17

Parfait · Accepted Answer · 2019-09-23 15:18:30Z

2

The code below convert the xml to a flat dict

import xml.etree.ElementTree as ET
import pandas as pd

xml = '''<r><RecordContainer RecordNumber = "1">
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
 </catalog>
</RecordContainer>
<RecordContainer RecordNumber = "2">
 <catalog>  
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
</catalog>
</RecordContainer></r>'''

root = ET.fromstring(xml)
records = []
containers = root.findall('.//RecordContainer')
for container in containers:
    entry = container.attrib
    book = container.find('.//catalog/book')
    entry.update(book.attrib)
    for child in list(book):
        entry[child.tag] = child.text
    records.append(entry)

for rec in records:
    print(rec)
df = pd.DataFrame(records)
print(df)

output

{'RecordNumber': '1', 'id': 'bk101', 'author': 'Gambardella, Matthew', 'title': "XML Developer's Guide", 'genre': 'Computer', 'price': '44.95', 'publish_date': '2000-10-01', 'description': 'An in-depth look at creating applications \n      with XML.'}
{'RecordNumber': '2', 'id': 'bk102', 'author': 'Ralls, Kim', 'title': 'Midnight Rain', 'genre': 'Fantasy', 'price': '5.95', 'publish_date': '2000-12-16', 'description': 'A former architect battles corporate zombies, \n      an evil sorceress, and her own childhood to become queen \n      of the world.'}

  RecordNumber                author  ... publish_date                  title
0            1  Gambardella, Matthew  ...   2000-10-01  XML Developer's Guide
1            2            Ralls, Kim  ...   2000-12-16          Midnight Rain

[2 rows x 8 columns]

edited Sep 23, 2019 at 15:18

Parfait

108k19 gold badges102 silver badges138 bronze badges

answered Sep 23, 2019 at 12:20

balderman

24k8 gold badges39 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Jahnab Kumar Deka Over a year ago

This works great but since it is a nested dictionary I am not able to convert it to csv or pandas dataframe.. could you please suggest how should i convert it to csv or a pandas dataframe? @balderman

balderman Over a year ago

Which attributes you want to have in a row?

balderman Over a year ago

I have updated my answer. The code generates a flat dict now.

Jahnab Kumar Deka Over a year ago

Thanks for the above solution.. but you have explicitly hardcoded the child tag ".//catalog/book" which will create a problem when the file has multiple child tags inside another child tag.. for example please look at the following XML

balderman Over a year ago

Did you try the code against this xml? Please upload to somewhere a final version of the xml you use.

|

Collectives™ on Stack Overflow

How to Convert parsed XML to pandas dataframe or CSV in python?

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related