Creating DataFrame from XML document

Question

I understand this question has been asked a few times but I've tried everything to no avail. I'm not sure if this is an edge case or I'm missing something. I'm trying to parse an xml file and return as a df. Below is my attempt:

import xml.etree.ElementTree as ET
import pandas as pd
from lxml import objectify
tree = ET.parse('file.xml')
root = tree.getroot()

  <?xml version="1.0"?>
 -<document page-count="1">
    -<page number="1">
       -<table data-table="1" data-page="1" data-filename="Schedule.pdf">
           -<tr>
                <td colspan="17">Wednesday 20th Mar</td>
           -</tr>
           -<tr>
                <td colspan="3" style="text-align: right">1</td>
                <td style="text-align: right">2</td>
                <td style="text-align: right">3</td>
                <td style="text-align: right">4</td>
                <td style="text-align: right">5</td>
                <td style="text-align: right">6</td>
                <td style="text-align: right">7</td>
                <td style="text-align: right">8</td>
                <td style="text-align: right">9</td>
                <td style="text-align: right">10</td>
                <td style="text-align: right">11</td>
                <td style="text-align: right">12</td>
                <td style="text-align: right">13</td>
                <td style="text-align: right">14</td>
                <td style="text-align: right">15</td>
            </tr>
           -<tr>
                <td>HOME</td>
                <td>D</td>
                <td/>
                <td/>
                <td>08:00</td>
                <td>09:00</td>
                <td>10:00</td>
                <td>11:00</td>
                <td>12:00</td>
                <td>13:00</td>
                <td/>
                <td/>
                <td/>
                <td colspan="4"/>
            </tr>            
        </table>
     </page>
  </document>

I can export the data as strings:

print(ET.tostring(root, encoding='utf8').decode('utf8'))

But when trying to export as a df it returns an empty frame:

xml = objectify.parse('file.xml')
root = xml.getroot()

data=[]
for i in range(len(root.getchildren())):
    data.append([child.text for child in root.getchildren()[i].getchildren()])

df = pd.DataFrame(data).T

Out:

      0
0  None

If the date is stripped I'm hoping to Intended Output will be:

         1      2      3      4      5      6      7      8 9 10 11 12 13 14 15
0  HOME  D      08:00  09:00  10:00  11:00  12:00  13:00

The xml posted is not complete. It is missing end tags for table, page and document. See : docs.python.org/2/library/xml.etree.elementtree.html — jdweng
– jdweng, Commented Aug 17, 2019 at 7:40

crayxt · Accepted Answer · 2019-09-03 04:32:15Z

2

In the example XML, element in first table row 10 is not closed. If fixed, you can simply do (provided your file.xml is read to string a):

>>> pd.read_html(a, header=1)[0]
      1 1.1  1.2   2      3      4      5      6      7      8   9  10  11  12  13  14  15
0  HOME   D  NaN NaN  08:00  09:00  10:00  11:00  12:00  13:00 NaN NaN NaN NaN NaN NaN NaN

It looks like in your expected output, you shifted data row 1 position to the right.

answered Sep 3, 2019 at 4:32

crayxt

2,4052 gold badges15 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

lpozo · Accepted Answer · 2019-09-03 19:16:09Z

1

+50

I don't have Pandas right now, but I think you can try this code to get your data

import xml.etree.ElementTree as ET

xml = ET.parse('file.xml')

root = xml.getroot()

data = []
for child in root.iter('td'):
    data.append(child.text)

answered Sep 3, 2019 at 19:16

lpozo

6083 silver badges9 bronze badges

Collectives™ on Stack Overflow

Creating DataFrame from XML document

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related