Converting XML to pandas DataFrame using read_xml

Question

I am trying to convert the following XML to pandas data frame using pd.read_xml api.

<?xml version='1.0' encoding='ISO-8859-1'?>
<Request>
    <Employee Name="James">
        <Address>Virginia</Address>
        <Project Name="project1">
            <Description>Description of Project 1</Description>
        </Project>
        <Project Name="project2">
            <Description>Description of Project 2</Description>
        </Project>
    </Employee>
</Request>

I tried the following code

df1 = pd.read_xml(filename, xpath="./*",parser="lxml")
print("\n"+str(df1.to_markdown))

and got the result something like this

<bound method DataFrame.to_markdown of     Name   Address  Project
0  James  Virginia      NaN>

Similarly when i tried to change the Xpath to read all the elements like below

df2 = pd.read_xml(filename, xpath="./*/*",parser="lxml")
print("\n"+str(df2.to_markdown))

I got result something like this

<bound method DataFrame.to_markdown of     Address      Name               Description
0  Virginia      None                      None
1      None  project1  Description of Project 1
2      None  project2  Description of Project 2>

What I am expecting is to get the results in the following format

<bound method DataFrame.to_markdown of  
  EmployeName Address      ProjectName              Description
0 James       Virginia      project1           Description of Project 1
1 James       Virginia      project2           Description of Project 2>

Is there a way to do this using read_xml api or any other library?

Why are you looking for 3 records? James got only 2 projects? — balderman
– balderman, Commented Apr 12, 2022 at 12:12
Yes, I think I mistakenly edited, it with 3 records. Actually 2 records are required. I can edit the output. Thanx. — Jasprit Singh
– Jasprit Singh, Commented Apr 12, 2022 at 12:23
Yes - please fix the post and show exactly how the DF look like. — balderman
– balderman, Commented Apr 12, 2022 at 12:24

balderman · Accepted Answer · 2022-04-12 12:33:27Z

1

See below

import xml.etree.ElementTree as ET
import pandas as pd

xml = '''<?xml version='1.0' encoding='ISO-8859-1'?>
<Request>
    <Employee Name="James">
        <Address>Virginia</Address>
        <Project Name="project1">
            <Description>Description of Project 1</Description>
        </Project>
        <Project Name="project2">
            <Description>Description of Project 2</Description>
        </Project>
    </Employee>
</Request>'''

root = ET.fromstring(xml)
data = []
emp = root.find('.//Employee')
name = emp.attrib['Name']
addr = emp.find('Address').text
for proj in emp.findall('.//Project'):
  proj_name = proj.attrib['Name']
  desc = proj.find('Description').text
  data.append({'EmployeName':name,'Address':addr,'ProjectName':proj_name,'Description':desc})
df = pd.DataFrame(data)
print(df)

output

  EmployeName   Address ProjectName               Description
0       James  Virginia    project1  Description of Project 1
1       James  Virginia    project2  Description of Project 2

answered Apr 12, 2022 at 12:33

balderman

24k8 gold badges39 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jasprit Singh Over a year ago

Thank for the answer. The problem is the structure of the XML can change and the fields like Description, Project Name etc. can be added or removed. Is there a generic approach where we can create the data frame without specifying the column names.

balderman Over a year ago

@JaspritSingh The generic approach is to have a defensive code which will not assume the element or attribute exists. Do you get the direction?

Jasprit Singh Over a year ago

Can you please help me with an example?

Collectives™ on Stack Overflow

Converting XML to pandas DataFrame using read_xml

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related