Download, unzip and parse xml into pandas dataframe

Question

I have a python script that downloads, unzip and parses an XML file published by a Canadian institution. Only some very specific tags are extracted and then all put into a pandas dataframe for later processing.

Everything works well. I just wonder if there is room for improvement here, specially in the parsing part. I am not sure if the nested for I use are a good idea or there is a better and cleaner way to parse.

import requests
import zipfile
import os
import glob
from lxml import etree
from io import StringIO, BytesIO
import pandas as pd
import xml.etree.ElementTree as ET

def download_file(url,filename):
    r = requests.get(url, allow_redirects=True)
    open(filename, 'wb').write(r.content)

def unzip_and_delete(filename):
    zip_file = zipfile.ZipFile(filename,'r')
    zip_file.extractall()
    zip_file.close()
    os.remove(filename)

def parse_xml_fields(file, base_tag, tag_list,final_list):
    root = etree.parse(file)
    nodes = root.findall("//{}".format(base_tag))
    for node in nodes:
        item = {}
        for tag in tag_list:
            if node.find(".//{}".format(tag)) is not None:
                item[tag] = node.find(".//{}".format(tag)).text.strip()
        final_list.append(item)

# My variables
field_list = ["MsbRegistrationNumber","StatusDescriptionEnglish","Surname","GivenName","MiddleName","Name","StreetAddress"]
entities_list = []


download_file('http://www10.fintrac-canafe.gc.ca/msb-esm/public/msb-search/zipdownload-eng/', 'fintrac.zip')

unzip_and_delete('fintrac.zip')

parse_xml_fields("MsbRegistryPublicDataFile.xml", "MsbInformation", field_list, entities_list)

df = pd.DataFrame(entities_list, columns=field_list)

df.to_excel("Canada_MSB_List.xlsx")

alecxe · Accepted Answer · 2017-08-07 13:17:20Z

2

You are traversing the tree searching for the same element twice - once when checking if it exists and then to get the text. You can do it once and find a text with .findtext() directly:

for node, tag in itertools.product(nodes, tag_list):
    node_text = node.findtext(".//{}".format(tag))
    if node_text is not None:
        final_list.append({tag: node_text.strip()})

Other notes:

remove unused imports and structure them according to PEP8 (reference)
use with context managers when opening a file and zip file
it would be cleaner if parse_xml_fields() would return a list of results instead of modifying an input list

edited Aug 7, 2017 at 13:17

answered Aug 7, 2017 at 13:08

alecxe

17.5k8 gold badges52 silver badges93 bronze badges

\$\begingroup\$ Thanks, aside from the main explanation about traversing the tree, I found useful your other notes to get my code cleaner. \$\endgroup\$

telex-wap
– telex-wap

2017-08-07 15:10:01 +00:00
Commented Aug 7, 2017 at 15:10

Add a comment |

MarianD · Accepted Answer · 2017-08-07 13:05:29Z

3

You may import itertools and then replace the part

for node in nodes:
    item = {}
    for tag in tag_list:
        if node.find(".//{}".format(tag)) is not None:
            item[tag] = node.find(".//{}".format(tag)).text.strip()
    final_list.append(item)

with

for node, tag in itertools.product(nodes, tag_list):
    if node.find(".//{}".format(tag)) is not None:
        final_list.append({tag: node.find(".//{}".format(tag)).text.strip()})

as itertools.product() creates a cartesian product of your two lists (and so the temporary dictionary item is no longer needed).

answered Aug 7, 2017 at 13:05

MarianD

1,9561 gold badge11 silver badges20 bronze badges

\$\begingroup\$ Thanks but I see something strange: this itertools.product approach is creating a new dictionary per every tag match found, when my previous method would create a dictionary containing all tag matches for one node. So for instance, if an entity has 7 tags, rather than one dictionary with 7 keys, I am getting 7 dictionaries with one key. \$\endgroup\$

telex-wap
– telex-wap

2017-08-07 15:15:50 +00:00
Commented Aug 7, 2017 at 15:15
\$\begingroup\$ For such small number of tags and nodes, when moreover the most consuming time are network transfers, it has no importance - the readability has, however (IMHO). \$\endgroup\$

MarianD
– MarianD

2017-08-07 16:15:55 +00:00
Commented Aug 7, 2017 at 16:15

Add a comment |

Stack Exchange Network

Download, unzip and parse xml into pandas dataframe

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Download, unzip and parse xml into pandas dataframe

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions