Parsing through a deep-nested XML File in Python

Question

I am looking at an xml file similar to the below:

<pinnacle_line_feed>
  <PinnacleFeedTime>1418929691920</PinnacleFeedTime>
  <lastContest>28962804</lastContest>
  <lastGame>162995589</lastGame>
  <events>
    <event>
      <event_datetimeGMT>2014-12-19 11:15</event_datetimeGMT>
      <gamenumber>422739932</gamenumber>
      <sporttype>Alpine Skiing</sporttype>
      <league>DH 145</league>
      <IsLive>No</IsLive>
      <participants>
        <participant>
          <participant_name>Kjetil Jansrud (NOR)</participant_name>
          <contestantnum>2001</contestantnum>
          <rotnum>2001</rotnum>
          <visiting_home_draw>Visiting</visiting_home_draw>
        </participant>
        <participant>
          <participant_name>The Field</participant_name>
          <contestantnum>2002</contestantnum>
          <rotnum>2002</rotnum>
          <visiting_home_draw>Home</visiting_home_draw>
        </participant>
      </participants>
      <periods>
        <period>
          <period_number>0</period_number>
          <period_description>Matchups</period_description>
          <periodcutoff_datetimeGMT>2014-12-19 11:15</periodcutoff_datetimeGMT>
          <period_status>I</period_status>
          <period_update>open</period_update>
          <spread_maximum>200</spread_maximum>
          <moneyline_maximum>100</moneyline_maximum>
          <total_maximum>200</total_maximum>
          <moneyline>
            <moneyline_visiting>116</moneyline_visiting>
            <moneyline_home>-136</moneyline_home>
          </moneyline>
        </period>
      </periods>
      <PinnacleFeedTime>1418929691920</PinnacleFeedTime>
    </event>
  </events>
</pinnacle_line_feed>

I have parsed the file with the code below:

pinny_url = 'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball'

tree = ET.parse(urllib.urlopen(pinny_url))
root = tree.getroot()
list = []

for event in root.iter('event'):
    event_datetimeGMT = event.find('event_datetimeGMT').text
    gamenumber = event.find('gamenumber').text
    sporttype = event.find('sporttype').text
    league = event.find('league').text
    IsLive = event.find('IsLive').text
    for participants in event.iter('participants'):
        for participant in participants.iter('participant'):
            p1_name = participant.find('participant_name').text
            contestantnum  = participant.find('contestantnum').text
            rotnum = participant.find('rotnum').text
            vhd = participant.find('visiting_home_draw').text
    for periods in event.iter('periods'):
        for period in periods.iter('period'):
            period_number = period.find('period_number').text
            desc = period.find('period_description').text
            pdatetime = period.find('periodcutoff_datetimeGMT')
            status = period.find('period_status').text
            update = period.find('period_update').text
            max = period.find('spread_maximum').text
            mlmax = period.find('moneyline_maximum').text
            tot_max = period.find('total_maximum').text
            for moneyline in period.iter('moneyline'):
                ml_vis = moneyline.find('moneyline_visiting').text
                ml_home = moneyline.find('moneyline_home').text

However, I am hoping to get the nodes separated by event similar to a 2D table (as in a pandas dataframe). However, the full xml file has multiple "event" children, some events that do not share the same nodes as above. I am struggling quite mightily with being able to take each event node and simply create a 2d table with the tag and that value where the tag acts as the column name and the text acts as the value.

Up to this point, I have done the above to gauge how I might put that information into a dictionary and subsequently put a number of dictionaries into a list from which I can create a dataframe using pandas, but that has not worked out, as all attempts have required me to find and replace text to create the dxcictionaries and python has not responded well to that when attempting to subsequently create a dataframe. I have also used a simple:

for elt in tree.iter():
  list.append("'%s': '%s'") % (elt.tag, elt.text.strip()))

which worked quite well in simple pulling out every single tag and the corresponding text, but I was unable to make anything of that because any attempts at finding and replacing the text to create dictionaries was no good.

Any assistance would be greatly appreciated.

Thank you.

This is a pretty long post, but from what I understand, you'd like to turn the events list into a pandas dataframe so you can output something that looks like CSV from the data. Is this close? If not, can you more clearly ask your question again via an edit? — VooDooNOFX
– VooDooNOFX, Commented Jan 5, 2015 at 23:52
Yea, apologies for not being clear. I would like to do exactly as you said, turn the events list into a pandas dataframe. Essentially, I would like each tag in the above xml to act as a column and each "text" to act as that columns value. — amachure
– amachure, Commented Jan 6, 2015 at 18:39
Your data includes nested data about participants and (in some cases) periods of the game. How do you propose that data should look in your csv?. You should probably edit your question to include a full example of at least 2 rows of data. — VooDooNOFX
– VooDooNOFX, Commented Jan 6, 2015 at 20:50

VooDooNOFX · Accepted Answer · 2015-01-06 21:06:13Z

Here's an easy way to get your XML into a pandas dataframe. This utilizes the awesome requests library (which you can switch for urllib if you'd like, as well as the always helpful xmltodict library available in pypi. (NOTE: a reverse library is also available, knows as dicttoxml)

import json
import pandas
import requests
import xmltodict

web_request = requests.get(u'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball')

# Make that unweidly XML doc look like a native Dictionary!
result = xmltodict.parse(web_request.text)

# Next, convert the nested OrderedDict to a real dict, which isn't strictly necessary, but helps you
#   visualize what the structure of the data looks like
normal_dict = json.loads(json.dumps(result.get('pinnacle_line_feed', {}).get(u'events', {}).get(u'event', [])))

# Now, make that dictionary into a dataframe
df = pandas.DataFrame.from_dict(normal_dict)

To get some idea of what this is starting to look like, here's the first couple of lines of the CSV:

>>> from StringIO import StringIO
>>> foo = StringIO()  # A fake file to write to
>>> df.to_csv(foo)  # Output the df to a CSV file
>>> foo.seek(0)  # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,participants,periods,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,"{u'participant': [{u'contestantnum': u'1071', u'rotnum': u'1071', u'visiting_home_draw': u'Home', u'participant_name': u'Obras Sanitarias'}, {u'contestantnum': u'1072', u'rotnum': u'1072', u'visiting_home_draw': u'Visiting', u'participant_name': u'Libertad'}]}",,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,"{u'participant': [{u'contestantnum': u'1079', u'rotnum': u'1079', u'visiting_home_draw': u'Home', u'participant_name': u'Boca Juniors'}, {u'contestantnum': u'1080', u'rotnum': u'1080', u'visiting_home_draw': u'Visiting', u'participant_name': u'Penarol'}]}","{u'period': {u'total_maximum': u'450', u'total': {u'total_points': u'152.5', u'under_adjust': u'-107', u'over_adjust': u'-103'}, u'spread_maximum': u'450', u'period_description': u'Game', u'moneyline_maximum': u'450', u'period_number': u'0', u'period_status': u'I', u'spread': {u'spread_visiting': u'3', u'spread_adjust_visiting': u'-102', u'spread_home': u'-3', u'spread_adjust_home': u'-108'}, u'periodcutoff_datetimeGMT': u'2015-01-06 23:00', u'moneyline': {u'moneyline_visiting': u'136', u'moneyline_home': u'-150'}, u'period_update': u'open'}}",Basketball

Notice that the participants and periods columns are still their native Python dictionaries. You'll either need to remove them from the columns list, or do some additional mangling to get them to flatten out:

# Remove the offending columns in this example by selecting particular columns to show
>>> from StringIO import StringIO
>>> foo = StringIO()  # A fake file to write to
>>> df.to_csv(foo, cols=['IsLive', 'event_datetimeGMT', 'gamenumber', 'league', 'sporttype'])
>>> foo.seek(0)  # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,Basketball

Awesome stuff. I really appreciate you taking the time to help out. I will work with this tonight, see what I can do with the participants and periods columns, though that should be significantly easier considering they are now dictionaries. Thanks!
As for your question above, the goal is to take those as "participant1", "contestantnum1", etc. and treat them as column headers similar to the other fields. In other words, I would anticipate from the 1st row of your example, instead of the column headers participants and periods, the column headers would become "participant_name1", "contestantnum1", etc. and the values would be Obras Sanitarias and 1071, respectively for the first row of data.

Collectives™ on Stack Overflow

Parsing through a deep-nested XML File in Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related