0

I'm trying to build a script to read an xml file. This is my first time parsing an xml and i'm doing it using python with xml.etree.ElementTree. The section of the file that i would like to process looks like:

    <component>
        <section>
                <id root="42CB916B-BB58-44A0-B8D2-89B4B27F04DF" />
                <code code="34089-3" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="DESCRIPTION SECTION" />
                <title mediaType="text/x-hl7-title+xml">DESCRIPTION</title>
                <text>
                        <paragraph>Renese<sup>®</sup> is designated generically as polythiazide, and chemically as 2<content styleCode="italics">H</content>-1,2,4-Benzothiadiazine-7-sulfonamide, 6-chloro-3,4-dihydro-2-methyl-3-[[(2,2,2-trifluoroethyl)thio]methyl]-, 1,1-dioxide. It is a white crystalline substance, insoluble in water but readily soluble in alkaline solution.</paragraph>
                        <paragraph>Inert Ingredients: dibasic calcium phosphate; lactose; magnesium stearate; polyethylene glycol; sodium lauryl sulfate; starch; vanillin. The 2 mg tablets also contain: Yellow 6; Yellow 10.</paragraph>
                </text>
                <effectiveTime value="20051214" />
        </section>
</component>    
<component>
        <section>
               <id root="CF5D392D-F637-417C-810A-7F0B3773264F" />
               <code code="42229-5" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="SPL UNCLASSIFIED SECTION" />
               <title mediaType="text/x-hl7-title+xml">ACTION</title>
               <text>
                        <paragraph>The mechanism of action results in an interference with the renal tubular mechanism of electrolyte reabsorption. At maximal therapeutic dosage all thiazides are approximately equal in their diuretic potency. The mechanism whereby thiazides function in the control of hypertension is unknown.</paragraph>
                </text>
                <effectiveTime value="20051214" />
                </section>
</component>

The full file can be downloaded from:

https://dailymed.nlm.nih.gov/dailymed/getFile.cfm?setid=abd6ecf0-dc8e-41de-89f2-1e36ed9d6535&type=zip&name=Renese

Here my code:

import xml.etree.ElementTree as ElementTree
import re

with open("ABD6ECF0-DC8E-41DE-89F2-1E36ED9D6535.xml") as f:
    xmlstring = f.read()

# Remove the default namespace definition (xmlns="http://some/namespace")
xmlstring = re.sub(r'\sxmlns="[^"]+"', '', xmlstring, count=1)

tree = ElementTree.fromstring(xmlstring)

for title in tree.iter('title'):
     print(title.text)

So far I'm able to print the titles but I would like to print also the corresponding text that is captured in the tags.

I have tried this:

for title in tree.iter('title'):
     print(title.text)
     for paragraph in title.iter('paragraph'):
         print(paragraph.text)

But I have no output from the paragraph.text

Doing

for title in tree.iter('title'):
         print(title.text)
         for paragraph in tree.iter('paragraph'):
             print(paragraph.text)

I print the text of the paragraphs but (obviously) it is printed all together for each title found in the xml structure.

I would like to find a way to 1) identify the title; 2) print the corresponding paragraph(s). How can I do it?

1 Answer 1

1

If you are willing to use lxml, then the following is a solution which uses XPath:

import re
from lxml.etree import fromstring


with open("ABD6ECF0-DC8E-41DE-89F2-1E36ED9D6535.xml") as f:
    xmlstring = f.read()

xmlstring = re.sub(r'\sxmlns="[^"]+"', '', xmlstring, count=1)

doc = fromstring(xmlstring.encode())  # lxml only accepts bytes input, hence we encode

for title in doc.xpath('//title'):  # for all title nodes
     title_text = title.xpath('./text()')  # get text value of the node

     # get all text values of the paragraph nodes that appear lower (//paragraph)
     # in the hierarchy than the parent (..) of <title>
     paragraphs_for_title = title.xpath('..//paragraph/text()')

     print(title_text[0] if title_text else '') 
     for paragraph in paragraphs_for_title: 
         print(paragraph)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.