0

My sample xml is

<RecordContainer RecordNumber = "1">
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
 </catalog>
</RecordContainer>
<RecordContainer RecordNumber = "2">
 <catalog>  
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
</catalog>
</RecordContainer>

My code to parse the above

import xml.etree.ElementTree as ET
tree = ET.fromstring("<root>"+ sample_data + "</root>")

Now after parsing I want to covert it to pandas dataframe or a csv file. To convert to pandas dataframe following is my code

def f(elem, result):
    result[elem.tag] = elem.text
    cs = elem.getchildren()
    for c in cs:
        result = f(c, result)
        return result

d = f(tree, {})
df = pd.DataFrame(d, index=['values'])

But the above code is returning me empty value in df.

How do I convert above parsed xml to pandas df or csv file?

1
  • The fact you have to impose a root asserts your original XML is not an XML but fragmented markup that closely resembles XML (since by definition XML is well-formed with a root). Before automated solution, consider fixing the source of this fragment. Commented Sep 23, 2019 at 15:17

1 Answer 1

2

The code below convert the xml to a flat dict

import xml.etree.ElementTree as ET
import pandas as pd

xml = '''<r><RecordContainer RecordNumber = "1">
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
 </catalog>
</RecordContainer>
<RecordContainer RecordNumber = "2">
 <catalog>  
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
</catalog>
</RecordContainer></r>'''

root = ET.fromstring(xml)
records = []
containers = root.findall('.//RecordContainer')
for container in containers:
    entry = container.attrib
    book = container.find('.//catalog/book')
    entry.update(book.attrib)
    for child in list(book):
        entry[child.tag] = child.text
    records.append(entry)

for rec in records:
    print(rec)
df = pd.DataFrame(records)
print(df)

output

{'RecordNumber': '1', 'id': 'bk101', 'author': 'Gambardella, Matthew', 'title': "XML Developer's Guide", 'genre': 'Computer', 'price': '44.95', 'publish_date': '2000-10-01', 'description': 'An in-depth look at creating applications \n      with XML.'}
{'RecordNumber': '2', 'id': 'bk102', 'author': 'Ralls, Kim', 'title': 'Midnight Rain', 'genre': 'Fantasy', 'price': '5.95', 'publish_date': '2000-12-16', 'description': 'A former architect battles corporate zombies, \n      an evil sorceress, and her own childhood to become queen \n      of the world.'}

  RecordNumber                author  ... publish_date                  title
0            1  Gambardella, Matthew  ...   2000-10-01  XML Developer's Guide
1            2            Ralls, Kim  ...   2000-12-16          Midnight Rain

[2 rows x 8 columns]
Sign up to request clarification or add additional context in comments.

7 Comments

This works great but since it is a nested dictionary I am not able to convert it to csv or pandas dataframe.. could you please suggest how should i convert it to csv or a pandas dataframe? @balderman
Which attributes you want to have in a row?
I have updated my answer. The code generates a flat dict now.
Thanks for the above solution.. but you have explicitly hardcoded the child tag ".//catalog/book" which will create a problem when the file has multiple child tags inside another child tag.. for example please look at the following XML
Did you try the code against this xml? Please upload to somewhere a final version of the xml you use.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.