0

I am trying to convert the following XML to pandas data frame using pd.read_xml api.

<?xml version='1.0' encoding='ISO-8859-1'?>
<Request>
    <Employee Name="James">
        <Address>Virginia</Address>
        <Project Name="project1">
            <Description>Description of Project 1</Description>
        </Project>
        <Project Name="project2">
            <Description>Description of Project 2</Description>
        </Project>
    </Employee>
</Request>

I tried the following code

df1 = pd.read_xml(filename, xpath="./*",parser="lxml")
print("\n"+str(df1.to_markdown))

and got the result something like this

<bound method DataFrame.to_markdown of     Name   Address  Project
0  James  Virginia      NaN>

Similarly when i tried to change the Xpath to read all the elements like below

df2 = pd.read_xml(filename, xpath="./*/*",parser="lxml")
print("\n"+str(df2.to_markdown))

I got result something like this

<bound method DataFrame.to_markdown of     Address      Name               Description
0  Virginia      None                      None
1      None  project1  Description of Project 1
2      None  project2  Description of Project 2>

What I am expecting is to get the results in the following format

<bound method DataFrame.to_markdown of  
  EmployeName Address      ProjectName              Description
0 James       Virginia      project1           Description of Project 1
1 James       Virginia      project2           Description of Project 2>

Is there a way to do this using read_xml api or any other library?

3
  • Why are you looking for 3 records? James got only 2 projects? Commented Apr 12, 2022 at 12:12
  • Yes, I think I mistakenly edited, it with 3 records. Actually 2 records are required. I can edit the output. Thanx. Commented Apr 12, 2022 at 12:23
  • Yes - please fix the post and show exactly how the DF look like. Commented Apr 12, 2022 at 12:24

1 Answer 1

1

See below

import xml.etree.ElementTree as ET
import pandas as pd

xml = '''<?xml version='1.0' encoding='ISO-8859-1'?>
<Request>
    <Employee Name="James">
        <Address>Virginia</Address>
        <Project Name="project1">
            <Description>Description of Project 1</Description>
        </Project>
        <Project Name="project2">
            <Description>Description of Project 2</Description>
        </Project>
    </Employee>
</Request>'''

root = ET.fromstring(xml)
data = []
emp = root.find('.//Employee')
name = emp.attrib['Name']
addr = emp.find('Address').text
for proj in emp.findall('.//Project'):
  proj_name = proj.attrib['Name']
  desc = proj.find('Description').text
  data.append({'EmployeName':name,'Address':addr,'ProjectName':proj_name,'Description':desc})
df = pd.DataFrame(data)
print(df)

output

  EmployeName   Address ProjectName               Description
0       James  Virginia    project1  Description of Project 1
1       James  Virginia    project2  Description of Project 2
Sign up to request clarification or add additional context in comments.

3 Comments

Thank for the answer. The problem is the structure of the XML can change and the fields like Description, Project Name etc. can be added or removed. Is there a generic approach where we can create the data frame without specifying the column names.
@JaspritSingh The generic approach is to have a defensive code which will not assume the element or attribute exists. Do you get the direction?
Can you please help me with an example?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.