0

I have an XML area.xml

<area>
<controls>
    <internal>yes</internal>
</controls>
<schools>
    <school id="001"/>
    <time>2020-05-18T14:21:00Z</time>
    <venture index="5">
        <venture>
            <basicData type="class">
                <wage numberOfDollars="13" Correction="4.61">
                    <tax>70</tax>
                </wage>
            </basicData>
        </venture>
    </venture>
    <venture index="9">
        <venture>
            <basicData type="class">
                <wage numberOfDollars="13" Correction="5.61">
                    <tax>70</tax>
                </wage>
            </basicData>
        </venture>
    </venture>
    <school id="056"/>
    <time>2020-05-18T14:21:00Z</time>
    <venture index="5">
        <venture>
            <basicData type="class">
                <wage numberOfDollars="13">
                    <tax>70</tax>
                </wage>
            </basicData>
        </venture>
    </venture>
    <venture index="9">
        <venture>
            <basicData type="class">
                <wage numberOfDollars="13">
                    <tax>70</tax>
                </wage>
            </basicData>
        </venture>
    </venture>
</schools>

What i am trying to achieve with Python: in a school node there are multiple wage nodes(leaves). if a wage node(leave)(1 or more) has an attribute called Correction i want the attribute value of the school node.

So the outcome of my script should be: 001 because this school has the attribute Correction in the wage node(leave)

First i tried it using ETree

import xml.etree.ElementTree as ET
data_file = 'area.xml'
tree = ET.parse(data_file)
root = tree.getroot()


t1 = "school"
t2 = "wage"

for e1, e2 in zip(root.iter(t1), root.iter(t2)):
    if hasattr(e2,'Correction'):
        e2.Correction
        print (e1.attrib['id'])

but that didn't work. Now I am trying to reach my goal using minidom but I find it quite hard.

This is my code so far:

from xml.dom import minidom

doc = minidom.parse("area.xml")

staffs = doc.getElementsByTagName("wage")
for wage in staffs:
        sid = wage.getAttribute("Correction")

        print("wage:%s" %
              (sid))

the output gives all values of the wage attribute Correction:

wage:4.61
wage:5.61
wage:
wage:

Which is obviously far from correct.

i could use some help getting me in the right direction

i am using python 3

thank you in advance

1
  • the end tag </area> is missing from my xml Commented May 18, 2020 at 21:13

2 Answers 2

2

in a school node there are multiple wage nodes

Not really. The school elements are empty. The venture siblings have the wage descendants. Since wage is not a descendant of school, this makes it a little tricky to select the corresponding school.

If you can use lxml you could use XPath to select the wage elements that have a Correction attribute and then select the first preceding school element and get its id attribute...

from lxml import etree

tree = etree.parse("area.xml")

schools_with_corrected_wages = set()

for corrected_wage in tree.xpath(".//wage[@Correction]"):
    schools_with_corrected_wages.add(corrected_wage.xpath("preceding::school[1]/@id")[0])

print(schools_with_corrected_wages)

This prints:

{'001'}

You could also use lxml to process the XML with XSLT...

XSLT 1.0 (test.xsl)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text"/>
  <xsl:strip-space elements="*"/>

  <xsl:key name="corrected_wage_by_school" match="wage[@Correction]" use="preceding::school[1]/@id"/>

  <xsl:template match="/">
    <xsl:for-each select="//school[key('corrected_wage_by_school',@id)]">
      <xsl:value-of select="concat(@id,'&#xA;')"/>
    </xsl:for-each>
  </xsl:template>

</xsl:stylesheet>

Python

from lxml import etree

tree = etree.parse("area.xml")        
xslt = etree.parse("test.xsl")
result = tree.xslt(xslt)

print(result)

This prints...

001
Sign up to request clarification or add additional context in comments.

1 Comment

I've studied your solution and i understand it and find it very usefull, thank you
0

Here's a less clever way.

from simplified_scrapy import SimplifiedDoc, req, utils
html = utils.getFileContent("area.xml")
doc = SimplifiedDoc(html)
schools = doc.selects('school') # Get all schools
n = len(schools)
i = 0
while i < n - 1:
    school = schools[i]
    school1 = schools[i + 1]
    h = doc.html[school._end:school1._start] # Get data between two schools
    staffs = doc.getElementsByReg(' Correction="', tag='wage', html=h)
    if staffs:
        print(school.id, staffs.Correction)
    i += 1

last = schools[n - 1]
h = doc.html[last._end:]
staffs = doc.getElementsByReg(' Correction="', tag='wage', html=h)
if staffs:
    print(last.id, staffs.Correction)

Result:

001 ['4.61', '5.61']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.