2

I'm trying to load a large-ish (53MB) XML file into a pandas dataframe. Here are 3 rows of actual data (from public database of NTSB aviation accident reports), but the actual file has 77257 rows:

<?xml version="1.0"?>
<DATA xmlns="http://www.ntsb.gov">
<ROWS>
    <ROW EventId="20150901X74304" InvestigationType="Accident" AccidentNumber="GAA15CA244" EventDate="09/01/2015" Location="Truckee, CA" Country="United States" Latitude="" Longitude="" AirportCode="" AirportName="" InjurySeverity="" AircraftDamage="" AircraftCategory="" RegistrationNumber="N786AB" Make="JOE SALOMONE" Model="SUPER CUB SQ2" AmateurBuilt="" NumberOfEngines="" EngineType="" FARDescription="" Schedule="" PurposeOfFlight="" AirCarrier="" TotalFatalInjuries="" TotalSeriousInjuries="" TotalMinorInjuries="" TotalUninjured="" WeatherCondition="" BroadPhaseOfFlight="" ReportStatus="Preliminary" PublicationDate=""/>
    <ROW EventId="20150901X92332" InvestigationType="Accident" AccidentNumber="CEN15LA392" EventDate="08/31/2015" Location="Houston, TX" Country="United States" Latitude="29.809444" Longitude="-95.668889" AirportCode="IWS" AirportName="WEST HOUSTON" InjurySeverity="Non-Fatal" AircraftDamage="Substantial" AircraftCategory="Airplane" RegistrationNumber="N452CS" Make="CESSNA" Model="T240" AmateurBuilt="No" NumberOfEngines="" EngineType="" FARDescription="Part 91: General Aviation" Schedule="" PurposeOfFlight="Instructional" AirCarrier="" TotalFatalInjuries="" TotalSeriousInjuries="" TotalMinorInjuries="" TotalUninjured="2" WeatherCondition="VMC" BroadPhaseOfFlight="LANDING" ReportStatus="Preliminary" PublicationDate="09/04/2015"/>
    <ROW EventId="20150729X33718" InvestigationType="Accident" AccidentNumber="CEN15FA325" EventDate="" Location="Truth or Consequences, NM" Country="United States" Latitude="33.250556" Longitude="-107.293611" AirportCode="TCS" AirportName="TRUTH OR CONSEQUENCES MUNI" InjurySeverity="Fatal(2)" AircraftDamage="Substantial" AircraftCategory="Airplane" RegistrationNumber="N32401" Make="PIPER" Model="PA-28-151" AmateurBuilt="No" NumberOfEngines="1" EngineType="Reciprocating" FARDescription="Part 91: General Aviation" Schedule="" PurposeOfFlight="Personal" AirCarrier="" TotalFatalInjuries="2" TotalSeriousInjuries="" TotalMinorInjuries="" TotalUninjured="" WeatherCondition="" BroadPhaseOfFlight="UNKNOWN" ReportStatus="Preliminary" PublicationDate="08/10/2015"/>
</ROWS>
</DATA>

The following code, which I adapted from here, works, but it is very slow for this data (more than 30 mins on my system). I can't seem to get the solution posted for the original example to work because the structure of my XML is different. Is there a more efficient way to load this data?

path_to_xml_file = mypath

import pandas as pd
import xml.etree.ElementTree as ET

#Load xml file data
tree = ET.parse(path_to_xml_file)
root = tree.getroot()

#Grab list of column names
aviationdata_column_names = root[0][0].attrib.keys()             
#Create empty dataframe   
aviationdata_df = pd.DataFrame(columns=aviationdata_column_names)

#Loop through tree and append to dataframe
for i in range(0,len(root[0])-1):
    new_row = pd.Series(root[0][i].attrib)
    new_row.name = i
    aviationdata_df = aviationdata_df.append(new_row)

There are various solutions to similar questions posted around the internet (here, here, and here), but I haven't had luck implementing them. Version issues might be responsible for some of that (I am using Python 2.7).

5
  • @jezrael When I try the first 3 of those 4 methods, they return a dataframe with either 0 or 1 element. How does one make those methods loop through the xml file and pull the entire structure in? Commented Jan 22, 2017 at 19:54
  • @jezrael alas, this is an exercise to learn to work with various file formats. I am asked to use xml for this particular dataset. Commented Jan 22, 2017 at 20:05
  • @jezrael - I appreciate the link reference to my earlier question. And I doubt xml is too complicated for you! Commented Jan 22, 2017 at 21:28
  • @Parfait - I want give suggestion for OP for add coment to your question, but I forget. Super, I upvote your answer. Commented Jan 22, 2017 at 21:29
  • Does this answer your question? How to convert an XML file to nice pandas dataframe? Commented Apr 21, 2021 at 7:54

1 Answer 1

6

Since your XML is attribute centric (no element values), consider iterating across all attributes which stores in dictionary key/value pairs in xml.etree.ElementTree.

Below binds lists of attribute set dictionaries to the DataFrame() call:

import pandas as pd
import xml.etree.ElementTree as ET

path_to_xml_file = mypath

# Load xml file data
tree = ET.parse(path_to_xml_file)

data = []
for el in tree.iterfind('./*'):
    for i in el.iterfind('*'):
        data.append(dict(i.items()))

df = pd.DataFrame(data)

Output

# FIRST FEW COLUMNS
print(df[list(range(12))])

#   AccidentNumber AirCarrier AircraftCategory AircraftDamage AirportCode                 AirportName AmateurBuilt BroadPhaseOfFlight        Country     EngineType   EventDate         EventId
# 0     GAA15CA244                                                                                                                     United States                 09/01/2015  20150901X74304
# 1     CEN15LA392                    Airplane    Substantial         IWS                WEST HOUSTON           No            LANDING  United States                 08/31/2015  20150901X92332
# 2     CEN15FA325                    Airplane    Substantial         TCS  TRUTH OR CONSEQUENCES MUNI           No            UNKNOWN  United States  Reciprocating              20150729X33718
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks @Parfait. From >30 minutes to less than 1 second. Exactly what I hoped python was capable of.
Great! Turns out, you didn't even need the last for loop as attribute sets are dictionaries! See shorted code.
How would you modify this for a file that has nested items? For example: <Recipient firstname="Joe" lastname = "Smith"><house_description color = "red"/><address>12 Main Street</address></Recipient>
this is good ... any idea on how to use it with an XML file both with elements and also a namespace? The method i.find(.//{ns}key).text does not work if I use the .iterfind() as suggested above ...
@LucasAimaretto, all depends on XML which can vary quite a bit. So solution will need to be tailored. This solution is for a specific XML with mainly attributes. Please make an attempt and ask a new question.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.