2

I am stuck with a problem here. So I want to parse multiple xml files with the same structure within it. I was already able to get all the locations for each file and save them into three different lists, since there are three different types of xml structures. Now I want to create three functions (for each list), which is looping through the lists and parse the information I need. Somehow I am not able to do it. Anybody here who could give me a hint how to do it?

import os
import glob
import xml.etree.ElementTree as ET
import fnmatch
import re
import sys


#### Get the location of each XML file and save them into a list ####

all_xml_list =[]                                                                                                                                       

def locate(pattern,root=os.curdir):
    for path, dirs, files in os.walk(os.path.abspath(root)):
        for filename in fnmatch.filter(files,pattern):
            yield os.path.join(path,filename)

for files in locate('*.xml',r'C:\Users\Lars\Documents\XML-Files'):
    all_xml_list.append(files)


#### Create lists by GameDay Events ####


xml_GameDay_Player   = [x for x in all_xml_list if 'Player' in x]                                                             
xml_GameDay_Team     = [x for x in all_xml_list if 'Team' in x]                                                             
xml_GameDay_Match    = [x for x in all_xml_list if 'Match' in x]  

The XML file looks like this:

<sports-content xmlns:imp="url">
  <sports-metadata date-time="20160912T000000+0200" doc-id="sports_event_" publisher="somepublisher" language="en_EN" document-class="player-statistics">
    <sports-title>player-statistics-165483</sports-title>
  </sports-metadata>
  <sports-event>
    <event-metadata id="E_165483" event-key="165483" event-status="post-event" start-date-time="20160827T183000+0200" start-weekday="saturday" heat-number="1" site-attendance="52183" />
    <team>
      <team-metadata id="O_17" team-key="17">
        <name full="TeamName" nickname="NicknameoftheTeam" imp:dfl-3-letter-code="NOT" official-3-letter-code="" />
      </team-metadata>
      <player>
        <player-metadata player-key="33201" uniform-number="1">
          <name first="Max" last="Mustermann" full="Max Mustermann" nickname="Mäxchen" imp:extensive="Name" />
        </player-metadata>
        <player-stats stats-coverage="standard" date-coverage-type="event" minutes-played="90" score="0">
          <rating rating-type="standard" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="5.6" imp:rating-value-mid-fielder="5.8" imp:rating-value-forward="5.0" />
          <rating rating-type="grade" rating-value="2.2" />
          <rating rating-type="index" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="3.7" imp:rating-value-mid-fielder="2.5" imp:rating-value-forward="1.2" />
          <rating rating-type="bemeister" rating-value="16.04086" />
          <player-stats-soccer imp:duels-won="1" imp:duels-won-ground="0" imp:duels-won-header="1" imp:duels-lost-ground="0" imp:duels-lost-header="0" imp:duels-lost="0" imp:duels-won-percentage="100" imp:passes-completed="28" imp:passes-failed="4" imp:passes-completions-percentage="87.5" imp:passes-failed-percentage="12.5" imp:passes="32" imp:passes-short-total="22" imp:balls-touched="50" imp:tracking-distance="5579.80" imp:tracking-average-speed="3.41" imp:tracking-max-speed="23.49" imp:tracking-sprints="0" imp:tracking-sprints-distance="0.00" imp:tracking-fast-runs="3" imp:tracking-fast-runs-distance="37.08" imp:tracking-offensive-runs="0" imp:tracking-offensive-runs-distance="0.00" dfl-distance="5579.80" dfl-average-speed="3.41" dfl-max-speed="23.49">
            <stats-soccer-defensive saves="5" imp:catches-punches-crosses="3" imp:catches-punches-corners="0" goals-against-total="1" imp:penalty-saves="0" imp:clear-cut-chance="0" />
            <stats-soccer-offensive shots-total="0" shots-on-goal-total="0" imp:shots-off-post="0" offsides="0" corner-kicks="0" imp:crosses="0" assists-total="0" imp:shot-assists="0" imp:freekicks="3" imp:miss-chance="0" imp:throw-in="0" imp:punt="2" shots-penalty-shot-scored="0" shots-penalty-shot-missed="0" dfl-assists-total="0" imp:shots-total-outside-box="0" imp:shots-total-inside-box="0" imp:shots-foot-inside-box="0" imp:shots-foot-outside-box="0" imp:shots-total-header="0" />
            <stats-soccer-foul fouls-commited="0" fouls-suffered="0" imp:yellow-red-cards="0" imp:red-cards="0" imp:yellow-cards="0" penalty-caused="0" />
          </player-stats-soccer>
        </player-stats>
      </player>
    </team>
  </sports-event>
</sports-content>

I want to extract everything which is within the "player meta tag" and "player-stats coverage" and "player stats soccer" tag.

6
  • Please tell, what exactly are you unable to do. Create the lists or create the functions? Commented Jun 8, 2017 at 15:15
  • Hey! First of all thanks for the reply! I have a problem with creating the functions! Shall I post the xml structure, which I want to parse? Would that be helpful? Commented Jun 8, 2017 at 15:19
  • Yes please do. It would be best also to include the kind of info you need from the files. Commented Jun 8, 2017 at 15:21
  • Also you say that you have same structure in each file and then you say you have 3 different types of xml structures? Commented Jun 8, 2017 at 15:26
  • I just added it. Commented Jun 8, 2017 at 15:26

3 Answers 3

4

Improving on @Gnudiff's answer, here is a more resilient approach:

import os
from glob import glob
from lxml import etree

xml_GameDay = {
    'Player': [],
    'Team': [],
    'Match': [],
}

# sort all files into the right buckets
for filename in glob(r'C:\Users\Lars\Documents\XML-Files\*.xml'):
    for key in xml_GameDay.keys():
        if key in os.path.basename(filename):
            xml_GameDay[key].append(filename)
            break

def select_first(context, path):
    result = context.xpath(path)
    if len(result):
        return result[0]
    return None

# extract data from Player files
for filename in xml_GameDay['Player']:
    tree = etree.parse(filename)

    for player in tree.xpath('.//player'):        
        player_data = {
            'key': select_first(player, './player-metadata/@player-key'),
            'lastname': select_first(player, './player-metadata/name/@last'),
            'firstname': select_first(player, './player-metadata/name/@first'),
            'nickname': select_first(player, './player-metadata/name/@nickname'),
        }
        print(player_data)
        # ...

XML files can come in a variety of byte encodings and are prefixed by the XML declaration, which declares the encoding of the rest of the file.

<?xml version="1.0" encoding="UTF-8"?>

UTF-8 is a common encoding for XML files (it also is the default), but in reality it can be anything. It's impossible to predict and it's very bad practice to hard-code your program to expect a certain encoding.

XML parsers are designed to deal with this peculiarity in a transparent way, so you don't really have to worry about it, unless you do it wrong.

This is a good example of doing it wrong:

# BAD CODE, DO NOT USE
def file_get_contents(filename):
    with open(filename) as f:
        return f.read()

tree = etree.XML(file_get_contents('some_filename.xml'))

What happens here is this:

  1. Python opens filename as a text file f
  2. f.read() returns a string
  3. etree.XML() parses that string and creates a DOM object tree

Doesn't sound so wrong, does it? But if the XML is like this:

<?xml version="1.0" encoding="UTF-8"?>
<Player nickname="Mäxchen">...</Player>

then the DOM you will end up with will be:

Player
    @nickname="Mäxchen"

You have just destroyed the data. And unless the XML contained an "extended" character like ä, you would not even have noticed that this approach is borked. This can easily slip into production unnoticed.

There is exactly one correct way of opening an XML file (and it's also simpler than the code above): Give the file name to the parser.

tree = etree.parse('some_filename.xml')

This way the parser can figure out the file encoding before it reads the data and you don't have to care about those details.

Sign up to request clarification or add additional context in comments.

9 Comments

Thank you very much for the answer! I just came back home and I am done for today. I ll try your provided code tomorrow and give you feedback if it worked! I really appreciate the fast reply and the long answer!!!!!
That´s really awesome!
@Lars Depending on what you want to do, a look at lxml.objectify might be worthwhile. lxml.de/objectify.html
! Your answer is golden!! This is actually working perfectly!!!! really appreciate your help!!
Hey @Tomalak! I am sorry for bothering again. Your code is working really fine but I am facing a new issue here. I have problems now to extract the data from each "imp:" within the xml since it is defined in the beginning of the document as a prefix. Any Idea how I can handle this? I tried several things by defining the namespace etc but it dosent work. somehow I cant post the code here in the comments and I dont want to create a ne topic.
|
0

This won't be a complete solution for your particular case, because this is a bit of task to do and also I don't have keyboard, working from tablet.

In general, you can do it several ways, depending on whether you really need all data or extract specific subset, and whether you know all the possible structures in advance.

For example, one way:

from lxml import etree
Playerdata=[] 
for F in xml_Gameday_Player:
                tree=etree.XML(file_get_contents(F)) 
                for player in tree.xpath('.//player'):
                        row=[] 
                        row['player']=player.xpath('./player-metadata/name/@Last/text()')       
                        for plrdata in player.xpath('.//player-stats'):
                               #do stuff with player data
                         Playerdata+=row

This is adapted from my existing script, however, it is more tailored to extracting only a specific subset of xml. If you need all data, it would probably be better to use some xml tree walker.

file_get_contents is a small helper function :

def file_get_contents(filename):
    with open(filename) as f:
        return f.read()

Xpath is a powerful language for finding nodes within xml. Note that depending on Xpath you use, the result may be either an xml node as in "for player in..." statement, or a string, as in "row['player']=" statement.

7 Comments

Please don't read XML files that way. They will get mangled since f.read() turns them into string without any knowledge of what encoding they are in. Use etree.parse(f) instead, which actually honors the XML declaration and the declared file encoding.
@Tomalak Thank you, this will come useful.
It's a common mistake and a lingering bug. It's 2017 and people still think in ASCII-only by default...
The funny thing my input is actually even utf8. I just had to quickly Google for solution when under time pressure.
Thank you I will try it and give you feedback once I did it! Thanks in advance tho :)
|
0

you an use xml element tree library. first install it by pip install lxml. then follow the below code structure:

import xml.etree.ElementTree as ET
import os
my_dir = "your_directory"
for fn in os.listdir(my_dir):
    tree = ET.parse(os.path.join(my_dir,fn))
    root = tree.getroot()
    btf = root.find('tag_name')
    btf.text = new_value #modify the value of the tag to new_value, whatever you want to put
    tree.write(os.path.join(my_dir,fn))

if you still need detail explaination, go through this link https://www.datacamp.com/community/tutorials/python-xml-elementtree

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.