How to read XML file from URL in python?

Question

I want to access the information present in the sub node. Is this because of the structure of the file?

Tried extracting the author subnode information in a file separately and run python code. That works fine

import urllib
import xml.etree.ElementTree as ET

url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'

print 'Retrieving', url

document = urllib.urlopen (url).read()
print 'Retrieved', len(document), 'characters.'

print document[:50]

tree = ET.fromstring(document)

lst = tree.findall('title')
print lst[:100]

Any luck yet with the provided answers ?

iLuvLogix
– iLuvLogix

2019-02-19 12:41:45 +00:00
Commented Feb 19, 2019 at 12:41 — iLuvLogix
– iLuvLogix, Commented Feb 19, 2019 at 12:41

manuel_b · Accepted Answer · 2019-02-22 10:13:41Z

5

You couldn't find title elements because of the namespace.

Below a sample code to find:

Title from "document" tag
Title from inner "component" tag

    import xml.etree.ElementTree as ET
    import urllib.request

    url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'
    response = urllib.request.urlopen(url).read()
    tree = ET.fromstring(response)


    for docTitle in tree.findall('{urn:hl7-org:v3}title'):
        print(docTitle.text)

    for compTitle in tree.findall('.//{urn:hl7-org:v3}title'):
        print(compTitle.text)

UPDATE

If you need to search XML nodes you should use xPath Expressions

Example:

NS = '{urn:hl7-org:v3}'
ID = '829076996'    # ID TO BE FOUND

# XPATH TO FIND AUTHORS BY ID (search ID and return related author node)
xPathAuthorById = ''.join([
    ".//",
    NS, "author/",
    NS, "assignedEntity/",
    NS, "representedOrganization/",
    NS, "id[@extension='", ID,
    "']/../../.."
    ])

# XPATH TO FIND AUTHOR NAME ELEMENT
xPathAuthorName = ''.join([
    "./",
    NS, "assignedEntity/",
    NS, "representedOrganization/",
    NS, "name"
    ])

# FOR EACH AUTHOR FOUND, SEARCH ATTRIBUTES (example name)
for author in tree.findall(xPathAuthorById):
    name = author.find(xPathAuthorName)
    print(name.text)

This example prints the author name for the ID 829076996

UPDATE 2

You can easily process all assignedEntity tags with a findall method. For each of them you can have multiple products, so another findall method is needed (see example below).

xPathAssignedEntities = ''.join([
    ".//",
    NS, "author/",
    NS, "assignedEntity/",
    NS, "representedOrganization/",
    NS, "assignedEntity/", 
    NS, "assignedOrganization/", 
    NS, "assignedEntity"
    ])

xPathProdCode = ''.join([
    NS, "actDefinition/",
    NS, "product/",
    NS, "manufacturedProduct/",
    NS, "manufacturedMaterialKind/",
    NS, "code"
    ])


# GET ALL assignedEntity TAGS
for assignedEntity in tree.findall(xPathAssignedEntities):

    # GET ID AND NAME OF assignedEntity
    id = assignedEntity.find(NS + 'assignedOrganization/'+ NS + 'id').get('extension')
    name = assignedEntity.find(NS + 'assignedOrganization/' + NS + 'name').text

    # FOR EACH assignedEntity WE CAN HAVE MULTIPLE <performance> TAGS
    for performance in assignedEntity.findall(NS + 'performance'):
        actCode = performance.find(NS + 'actDefinition/'+ NS + 'code').get('displayName')
        prodCode = performance.find(xPathProdCode).get('code')
        print(id, '\t', name, '\t', actCode, '\t', prodCode)

This is the result:

829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-0050 
829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4900 
829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4910 
829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4940 
829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4960 
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-0050
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4900
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4910
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4940
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4960
829084545    Pfizer Pharmaceuticals LLC      MANUFACTURE     0049-4900 
829084545    Pfizer Pharmaceuticals LLC      MANUFACTURE     0049-4910 
829084545    Pfizer Pharmaceuticals LLC      MANUFACTURE     0049-4960 
829084545    Pfizer Pharmaceuticals LLC      PACK    0049-4900 
829084545    Pfizer Pharmaceuticals LLC      PACK    0049-4910 
829084545    Pfizer Pharmaceuticals LLC      PACK    0049-4960 
618054084    Pharmacia and Upjohn Company LLC    ANALYSIS    0049-0050
618054084    Pharmacia and Upjohn Company LLC    ANALYSIS    0049-4940
829084552    Pfizer Pharmaceuticals LLC      PACK    0049-4900 
829084552    Pfizer Pharmaceuticals LLC      PACK    0049-4910 
829084552    Pfizer Pharmaceuticals LLC      PACK    0049-4960

edited Feb 22, 2019 at 10:13

answered Feb 19, 2019 at 12:58

manuel_b

1,8333 gold badges21 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

PANKAJ KUMAR Over a year ago

Hey that's awesome. Is there any way to look for the attributes such as <id ="089153071" which is under sub node <author> Also, I want to access other information specific to a node. Like <name> only from author/assignedEntity subnode

manuel_b Over a year ago

Hi, I've just updated the code with a simple example about how to search XML with xPath. I hope you'll find useful.

PANKAJ KUMAR Over a year ago

Thanks Klaud. I tried modifying th xPath for more sub nodes but I was unable to retrieve the name for that specific sub node(assignedOrganization). What I am trying to accomplish is an output in this format: First row of output (618054084, Pharmacia and Upjohn Company LLC, ANALYSIS, 0049-4940)..... Second row of output (829084552, Pfizer Pharmaceuticals LLC, PACK, 0049-4900). Thanks for helping me out

manuel_b Over a year ago

Hi PANKAJ KUMAR, I tried to explain further the use of xPath and find/findall methods, providing an example. I hope it will help. Best regards!

PANKAJ KUMAR Over a year ago

Hey Klaud.. This is just wonderful. Thank you so much for all your efforts to help resolve this problem for me. Appreciate it. Thanks Buddy

iLuvLogix · Accepted Answer · 2019-02-19 11:52:57Z

4

You could use xmltodict in order to generate a python dictionary from the requested XML data..

Here's a basic example:

import urllib2
import xmltodict

def foobar(request):
    file = urllib2.urlopen('https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml')
    data = file.read()
    file.close()

    data = xmltodict.parse(data)
    return {'xmldata': data}

edited Feb 19, 2019 at 11:52

answered Feb 19, 2019 at 11:23

iLuvLogix

6,4783 gold badges30 silver badges45 bronze badges

Comments

vineethgn · Accepted Answer · 2019-02-19 12:13:06Z

I normally prefer beautifulsoup with lxml parser for parsing xml. Sample code below

import requests
from bs4 import BeautifulSoup

url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'

document = requests.get(url)

soup= BeautifulSoup(document.content,"lxml-xml")
print (soup.find("title"))

Output

<title>These highlights do not include all the information needed to use ZOLOFT safely and effectively. See full prescribing information for ZOLOFT. <br/>
<br/>ZOLOFT (sertraline hydrochloride) tablets, for oral use <br/>ZOLOFT (sertraline hydrochloride) oral solution <br/>Initial U.S. Approval: 1991</title>

You can then use the methods provided by beautifulsoup like find and find_all to find the corresponding node or subnodes

Collectives™ on Stack Overflow

How to read XML file from URL in python?

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related