9

I want to access the information present in the sub node. Is this because of the structure of the file?

Tried extracting the author subnode information in a file separately and run python code. That works fine

import urllib
import xml.etree.ElementTree as ET

url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'

print 'Retrieving', url

document = urllib.urlopen (url).read()
print 'Retrieved', len(document), 'characters.'

print document[:50]

tree = ET.fromstring(document)

lst = tree.findall('title')
print lst[:100]
1
  • Any luck yet with the provided answers ? Commented Feb 19, 2019 at 12:41

3 Answers 3

5

You couldn't find title elements because of the namespace.

Below a sample code to find:

  • Title from "document" tag
  • Title from inner "component" tag
    import xml.etree.ElementTree as ET
    import urllib.request

    url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'
    response = urllib.request.urlopen(url).read()
    tree = ET.fromstring(response)


    for docTitle in tree.findall('{urn:hl7-org:v3}title'):
        print(docTitle.text)

    for compTitle in tree.findall('.//{urn:hl7-org:v3}title'):
        print(compTitle.text)

UPDATE

If you need to search XML nodes you should use xPath Expressions

Example:

NS = '{urn:hl7-org:v3}'
ID = '829076996'    # ID TO BE FOUND

# XPATH TO FIND AUTHORS BY ID (search ID and return related author node)
xPathAuthorById = ''.join([
    ".//",
    NS, "author/",
    NS, "assignedEntity/",
    NS, "representedOrganization/",
    NS, "id[@extension='", ID,
    "']/../../.."
    ])

# XPATH TO FIND AUTHOR NAME ELEMENT
xPathAuthorName = ''.join([
    "./",
    NS, "assignedEntity/",
    NS, "representedOrganization/",
    NS, "name"
    ])

# FOR EACH AUTHOR FOUND, SEARCH ATTRIBUTES (example name)
for author in tree.findall(xPathAuthorById):
    name = author.find(xPathAuthorName)
    print(name.text)

This example prints the author name for the ID 829076996

UPDATE 2

You can easily process all assignedEntity tags with a findall method. For each of them you can have multiple products, so another findall method is needed (see example below).

xPathAssignedEntities = ''.join([
    ".//",
    NS, "author/",
    NS, "assignedEntity/",
    NS, "representedOrganization/",
    NS, "assignedEntity/", 
    NS, "assignedOrganization/", 
    NS, "assignedEntity"
    ])

xPathProdCode = ''.join([
    NS, "actDefinition/",
    NS, "product/",
    NS, "manufacturedProduct/",
    NS, "manufacturedMaterialKind/",
    NS, "code"
    ])


# GET ALL assignedEntity TAGS
for assignedEntity in tree.findall(xPathAssignedEntities):

    # GET ID AND NAME OF assignedEntity
    id = assignedEntity.find(NS + 'assignedOrganization/'+ NS + 'id').get('extension')
    name = assignedEntity.find(NS + 'assignedOrganization/' + NS + 'name').text

    # FOR EACH assignedEntity WE CAN HAVE MULTIPLE <performance> TAGS
    for performance in assignedEntity.findall(NS + 'performance'):
        actCode = performance.find(NS + 'actDefinition/'+ NS + 'code').get('displayName')
        prodCode = performance.find(xPathProdCode).get('code')
        print(id, '\t', name, '\t', actCode, '\t', prodCode)

This is the result:

829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-0050 
829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4900 
829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4910 
829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4940 
829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4960 
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-0050
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4900
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4910
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4940
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4960
829084545    Pfizer Pharmaceuticals LLC      MANUFACTURE     0049-4900 
829084545    Pfizer Pharmaceuticals LLC      MANUFACTURE     0049-4910 
829084545    Pfizer Pharmaceuticals LLC      MANUFACTURE     0049-4960 
829084545    Pfizer Pharmaceuticals LLC      PACK    0049-4900 
829084545    Pfizer Pharmaceuticals LLC      PACK    0049-4910 
829084545    Pfizer Pharmaceuticals LLC      PACK    0049-4960 
618054084    Pharmacia and Upjohn Company LLC    ANALYSIS    0049-0050
618054084    Pharmacia and Upjohn Company LLC    ANALYSIS    0049-4940
829084552    Pfizer Pharmaceuticals LLC      PACK    0049-4900 
829084552    Pfizer Pharmaceuticals LLC      PACK    0049-4910 
829084552    Pfizer Pharmaceuticals LLC      PACK    0049-4960
Sign up to request clarification or add additional context in comments.

5 Comments

Hey that's awesome. Is there any way to look for the attributes such as <id ="089153071" which is under sub node <author> Also, I want to access other information specific to a node. Like <name> only from author/assignedEntity subnode
Hi, I've just updated the code with a simple example about how to search XML with xPath. I hope you'll find useful.
Thanks Klaud. I tried modifying th xPath for more sub nodes but I was unable to retrieve the name for that specific sub node(assignedOrganization). What I am trying to accomplish is an output in this format: First row of output (618054084, Pharmacia and Upjohn Company LLC, ANALYSIS, 0049-4940)..... Second row of output (829084552, Pfizer Pharmaceuticals LLC, PACK, 0049-4900). Thanks for helping me out
Hi PANKAJ KUMAR, I tried to explain further the use of xPath and find/findall methods, providing an example. I hope it will help. Best regards!
Hey Klaud.. This is just wonderful. Thank you so much for all your efforts to help resolve this problem for me. Appreciate it. Thanks Buddy
4

You could use xmltodict in order to generate a python dictionary from the requested XML data..

Here's a basic example:

import urllib2
import xmltodict

def foobar(request):
    file = urllib2.urlopen('https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml')
    data = file.read()
    file.close()

    data = xmltodict.parse(data)
    return {'xmldata': data}

Comments

3

I normally prefer beautifulsoup with lxml parser for parsing xml. Sample code below

import requests
from bs4 import BeautifulSoup

url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'

document = requests.get(url)

soup= BeautifulSoup(document.content,"lxml-xml")
print (soup.find("title"))

Output

<title>These highlights do not include all the information needed to use ZOLOFT safely and effectively. See full prescribing information for ZOLOFT. <br/>
<br/>ZOLOFT (sertraline hydrochloride) tablets, for oral use <br/>ZOLOFT (sertraline hydrochloride) oral solution <br/>Initial U.S. Approval: 1991</title>

You can then use the methods provided by beautifulsoup like find and find_all to find the corresponding node or subnodes

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.