Parse multiple xml files in Python

Question

I am stuck with a problem here. So I want to parse multiple xml files with the same structure within it. I was already able to get all the locations for each file and save them into three different lists, since there are three different types of xml structures. Now I want to create three functions (for each list), which is looping through the lists and parse the information I need. Somehow I am not able to do it. Anybody here who could give me a hint how to do it?

import os
import glob
import xml.etree.ElementTree as ET
import fnmatch
import re
import sys


#### Get the location of each XML file and save them into a list ####

all_xml_list =[]                                                                                                                                       

def locate(pattern,root=os.curdir):
    for path, dirs, files in os.walk(os.path.abspath(root)):
        for filename in fnmatch.filter(files,pattern):
            yield os.path.join(path,filename)

for files in locate('*.xml',r'C:\Users\Lars\Documents\XML-Files'):
    all_xml_list.append(files)


#### Create lists by GameDay Events ####


xml_GameDay_Player   = [x for x in all_xml_list if 'Player' in x]                                                             
xml_GameDay_Team     = [x for x in all_xml_list if 'Team' in x]                                                             
xml_GameDay_Match    = [x for x in all_xml_list if 'Match' in x]

The XML file looks like this:

<sports-content xmlns:imp="url">
  <sports-metadata date-time="20160912T000000+0200" doc-id="sports_event_" publisher="somepublisher" language="en_EN" document-class="player-statistics">
    <sports-title>player-statistics-165483</sports-title>
  </sports-metadata>
  <sports-event>
    <event-metadata id="E_165483" event-key="165483" event-status="post-event" start-date-time="20160827T183000+0200" start-weekday="saturday" heat-number="1" site-attendance="52183" />
    <team>
      <team-metadata id="O_17" team-key="17">
        <name full="TeamName" nickname="NicknameoftheTeam" imp:dfl-3-letter-code="NOT" official-3-letter-code="" />
      </team-metadata>
      <player>
        <player-metadata player-key="33201" uniform-number="1">
          <name first="Max" last="Mustermann" full="Max Mustermann" nickname="Mäxchen" imp:extensive="Name" />
        </player-metadata>
        <player-stats stats-coverage="standard" date-coverage-type="event" minutes-played="90" score="0">
          <rating rating-type="standard" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="5.6" imp:rating-value-mid-fielder="5.8" imp:rating-value-forward="5.0" />
          <rating rating-type="grade" rating-value="2.2" />
          <rating rating-type="index" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="3.7" imp:rating-value-mid-fielder="2.5" imp:rating-value-forward="1.2" />
          <rating rating-type="bemeister" rating-value="16.04086" />
          <player-stats-soccer imp:duels-won="1" imp:duels-won-ground="0" imp:duels-won-header="1" imp:duels-lost-ground="0" imp:duels-lost-header="0" imp:duels-lost="0" imp:duels-won-percentage="100" imp:passes-completed="28" imp:passes-failed="4" imp:passes-completions-percentage="87.5" imp:passes-failed-percentage="12.5" imp:passes="32" imp:passes-short-total="22" imp:balls-touched="50" imp:tracking-distance="5579.80" imp:tracking-average-speed="3.41" imp:tracking-max-speed="23.49" imp:tracking-sprints="0" imp:tracking-sprints-distance="0.00" imp:tracking-fast-runs="3" imp:tracking-fast-runs-distance="37.08" imp:tracking-offensive-runs="0" imp:tracking-offensive-runs-distance="0.00" dfl-distance="5579.80" dfl-average-speed="3.41" dfl-max-speed="23.49">
            <stats-soccer-defensive saves="5" imp:catches-punches-crosses="3" imp:catches-punches-corners="0" goals-against-total="1" imp:penalty-saves="0" imp:clear-cut-chance="0" />
            <stats-soccer-offensive shots-total="0" shots-on-goal-total="0" imp:shots-off-post="0" offsides="0" corner-kicks="0" imp:crosses="0" assists-total="0" imp:shot-assists="0" imp:freekicks="3" imp:miss-chance="0" imp:throw-in="0" imp:punt="2" shots-penalty-shot-scored="0" shots-penalty-shot-missed="0" dfl-assists-total="0" imp:shots-total-outside-box="0" imp:shots-total-inside-box="0" imp:shots-foot-inside-box="0" imp:shots-foot-outside-box="0" imp:shots-total-header="0" />
            <stats-soccer-foul fouls-commited="0" fouls-suffered="0" imp:yellow-red-cards="0" imp:red-cards="0" imp:yellow-cards="0" penalty-caused="0" />
          </player-stats-soccer>
        </player-stats>
      </player>
    </team>
  </sports-event>
</sports-content>

I want to extract everything which is within the "player meta tag" and "player-stats coverage" and "player stats soccer" tag.

Please tell, what exactly are you unable to do. Create the lists or create the functions? — Gnudiff
– Gnudiff, Commented Jun 8, 2017 at 15:15
Hey! First of all thanks for the reply! I have a problem with creating the functions! Shall I post the xml structure, which I want to parse? Would that be helpful? — lazer
– lazer, Commented Jun 8, 2017 at 15:19
Yes please do. It would be best also to include the kind of info you need from the files. — Gnudiff
– Gnudiff, Commented Jun 8, 2017 at 15:21
Also you say that you have same structure in each file and then you say you have 3 different types of xml structures? — Gnudiff
– Gnudiff, Commented Jun 8, 2017 at 15:26

Tomalak · Accepted Answer · 2017-06-08 18:50:09Z

4

Improving on @Gnudiff's answer, here is a more resilient approach:

import os
from glob import glob
from lxml import etree

xml_GameDay = {
    'Player': [],
    'Team': [],
    'Match': [],
}

# sort all files into the right buckets
for filename in glob(r'C:\Users\Lars\Documents\XML-Files\*.xml'):
    for key in xml_GameDay.keys():
        if key in os.path.basename(filename):
            xml_GameDay[key].append(filename)
            break

def select_first(context, path):
    result = context.xpath(path)
    if len(result):
        return result[0]
    return None

# extract data from Player files
for filename in xml_GameDay['Player']:
    tree = etree.parse(filename)

    for player in tree.xpath('.//player'):        
        player_data = {
            'key': select_first(player, './player-metadata/@player-key'),
            'lastname': select_first(player, './player-metadata/name/@last'),
            'firstname': select_first(player, './player-metadata/name/@first'),
            'nickname': select_first(player, './player-metadata/name/@nickname'),
        }
        print(player_data)
        # ...

XML files can come in a variety of byte encodings and are prefixed by the XML declaration, which declares the encoding of the rest of the file.

<?xml version="1.0" encoding="UTF-8"?>

UTF-8 is a common encoding for XML files (it also is the default), but in reality it can be anything. It's impossible to predict and it's very bad practice to hard-code your program to expect a certain encoding.

XML parsers are designed to deal with this peculiarity in a transparent way, so you don't really have to worry about it, unless you do it wrong.

This is a good example of doing it wrong:

# BAD CODE, DO NOT USE
def file_get_contents(filename):
    with open(filename) as f:
        return f.read()

tree = etree.XML(file_get_contents('some_filename.xml'))

What happens here is this:

Python opens filename as a text file f
f.read() returns a string
etree.XML() parses that string and creates a DOM object tree

Doesn't sound so wrong, does it? But if the XML is like this:

<?xml version="1.0" encoding="UTF-8"?>
<Player nickname="Mäxchen">...</Player>

then the DOM you will end up with will be:

Player
    @nickname="MÃ¤xchen"

You have just destroyed the data. And unless the XML contained an "extended" character like ä, you would not even have noticed that this approach is borked. This can easily slip into production unnoticed.

There is exactly one correct way of opening an XML file (and it's also simpler than the code above): Give the file name to the parser.

tree = etree.parse('some_filename.xml')

This way the parser can figure out the file encoding before it reads the data and you don't have to care about those details.

edited Jun 8, 2017 at 18:50

answered Jun 8, 2017 at 17:49

Tomalak

339k68 gold badges547 silver badges635 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

lazer Over a year ago

Thank you very much for the answer! I just came back home and I am done for today. I ll try your provided code tomorrow and give you feedback if it worked! I really appreciate the fast reply and the long answer!!!!!

lazer Over a year ago

That´s really awesome!

Tomalak Over a year ago

@Lars Depending on what you want to do, a look at lxml.objectify might be worthwhile. lxml.de/objectify.html

lazer Over a year ago

! Your answer is golden!! This is actually working perfectly!!!! really appreciate your help!!

lazer Over a year ago

Hey @Tomalak! I am sorry for bothering again. Your code is working really fine but I am facing a new issue here. I have problems now to extract the data from each "imp:" within the xml since it is defined in the beginning of the document as a prefix. Any Idea how I can handle this? I tried several things by defining the namespace etc but it dosent work. somehow I cant post the code here in the comments and I dont want to create a ne topic.

|

Gnudiff · Accepted Answer · 2017-06-08 16:11:41Z

0

This won't be a complete solution for your particular case, because this is a bit of task to do and also I don't have keyboard, working from tablet.

In general, you can do it several ways, depending on whether you really need all data or extract specific subset, and whether you know all the possible structures in advance.

For example, one way:

from lxml import etree
Playerdata=[] 
for F in xml_Gameday_Player:
                tree=etree.XML(file_get_contents(F)) 
                for player in tree.xpath('.//player'):
                        row=[] 
                        row['player']=player.xpath('./player-metadata/name/@Last/text()')       
                        for plrdata in player.xpath('.//player-stats'):
                               #do stuff with player data
                         Playerdata+=row

This is adapted from my existing script, however, it is more tailored to extracting only a specific subset of xml. If you need all data, it would probably be better to use some xml tree walker.

file_get_contents is a small helper function :

def file_get_contents(filename):
    with open(filename) as f:
        return f.read()

Xpath is a powerful language for finding nodes within xml. Note that depending on Xpath you use, the result may be either an xml node as in "for player in..." statement, or a string, as in "row['player']=" statement.

edited Jun 8, 2017 at 16:11

answered Jun 8, 2017 at 16:02

Gnudiff

4,3251 gold badge26 silver badges28 bronze badges

7 Comments

Tomalak Over a year ago

Please don't read XML files that way. They will get mangled since f.read() turns them into string without any knowledge of what encoding they are in. Use etree.parse(f) instead, which actually honors the XML declaration and the declared file encoding.

Gnudiff Over a year ago

@Tomalak Thank you, this will come useful.

Tomalak Over a year ago

It's a common mistake and a lingering bug. It's 2017 and people still think in ASCII-only by default...

Gnudiff Over a year ago

The funny thing my input is actually even utf8. I just had to quickly Google for solution when under time pressure.

lazer Over a year ago

Thank you I will try it and give you feedback once I did it! Thanks in advance tho :)

|

Shekhar · Accepted Answer · 2020-08-24 20:01:34Z

0

you an use xml element tree library. first install it by pip install lxml. then follow the below code structure:

import xml.etree.ElementTree as ET
import os
my_dir = "your_directory"
for fn in os.listdir(my_dir):
    tree = ET.parse(os.path.join(my_dir,fn))
    root = tree.getroot()
    btf = root.find('tag_name')
    btf.text = new_value #modify the value of the tag to new_value, whatever you want to put
    tree.write(os.path.join(my_dir,fn))

if you still need detail explaination, go through this link https://www.datacamp.com/community/tutorials/python-xml-elementtree

answered Aug 24, 2020 at 20:01

Shekhar

11 bronze badge

Collectives™ on Stack Overflow

Parse multiple xml files in Python

3 Answers 3

9 Comments

7 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related