Python : Build the differents paths/trees from a xml file

Question

Here is an example of a xml file :

<?xml version="1.0" encoding="utf-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
  <SOAP-ENV:Header />
  <SOAP-ENV:Body>
    <ADD_LandIndex_001>
      <CNTROLAREA>
        <BSR>
          <status>ADD</status>
          <NOUN>LandIndex</NOUN>
          <REVISION>001</REVISION>
        </BSR>
      </CNTROLAREA>
      <DATAAREA>
        <LandIndex>
          <reportId>AMI100031</reportId>
          <requestKey>R3278458</requestKey>
          <SubmittedBy>EN4871</SubmittedBy>
          <submittedOn>2015/01/06 4:20:11 PM</submittedOn>
          <LandIndex>
            <agreementdetail>
              <agreementid>001       4860</agreementid>
              <agreementtype>NATURAL GAS</agreementtype>
              <currentstatus>
                <status>ACTIVE</status>
                <statuseffectivedate>1965/02/18</statuseffectivedate>
                <termdate>1965/02/18</termdate>
              </currentstatus>
              <designatedrepresentative>
              </designatedrepresentative>
            </agreementdetail>
          </LandIndex>
        </LandIndex>
      </DATAAREA>
    </ADD_LandIndex_001>
  </SOAP-ENV:Body>
</SOAP-ENV:Envelope>

I would like to store in a list all the differents paths that have a text in my xml file. So I would like something like that :

['Envelope/Body/ADD_LandIndex_01/CNTROLAREA/BSR/status', 'Envelope/Body/ADD_LandIndex_01/CNTROLAREA/BSR/LandIndex', ...]

I try a little code that does not work. I don't see how to take seperatly the last elements of one branch and how to all the path from the beginning when I switch of node in the middle (i.e Envelope/Body/ADD_LandIndex_01/DATAAREA...

import xml.etree.ElementTree as et
import os
import pandas as pd
from re import search

filename = 'file_try.xml'
element_tree = et.parse(filename)
root = element_tree.getroot()
namespace = "{http://schemas.xmlsoap.org/soap/envelope/}"


def remove_namespace(string,namespace) :
    
    if search(namespace, string) :
        new_string = string.replace(namespace,'')
    else : 
        new_string= string
    return new_string

dico = {}
title = root.tag
print(title)

for element in root.findall('.//') :
    #print(element)
    if len(list(element)) > 0 :
        #print('True ') 
        title= remove_namespace(title + '/' + element.tag, namespace)
        print(title+ '\n')

    else :
        
        title = root.tag

Can anyone help me ?

Thank you

Jack Fleeting · Accepted Answer · 2020-10-16 00:50:20Z

1

You can modify this for you actual code, but basically - it should look like this:

from lxml import etree
soap = """[your xml above]"""
root = etree.XML(soap.encode())    
tree = etree.ElementTree(root)
for target in root.xpath('//text()'):
    if len(target.strip())>0:       
        print(tree.getpath(target.getparent()).replace('SOAP-ENV:',''))

Output:

/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/status
/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN
/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/REVISION
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/reportId
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/requestKey
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/SubmittedBy
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/submittedOn
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementid
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementtype
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/status
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/statuseffectivedate
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/termdate

answered Oct 16, 2020 at 0:50

Jack Fleeting

25k6 gold badges27 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Maikiii Over a year ago

Thank you @Jack Fleeting for your answer, it helps me a lot, I wish I would have your skills ! How do you import the data from a xml file in a directory for your variable 'soap' ?

Maikiii Over a year ago

Sorry, I just found how to do it with : with open(filename, 'r') as f: soap = f.read()

Maikiii Over a year ago

How you saw, I did a previous post on that, I would like to have now another list that gets the text/content of these paths (elements). I can't find the way to do it with this library lxml, I tried .text() or text_content()` but i get an error: the goal is then to have that on a dataframe to export on excel. What is the function or the line to get the content ? Code :

for target in root.xpath('//text()'):      if len(target.strip())>0:                path = tree.getpath(target.getparent()).replace('SOAP-ENV:','')         data = target.text()         mylist_path.append(path)

Jack Fleeting Over a year ago

@Maikiii Glad it worked for you! As to the other thing, Stack Overflow policy says you should post it as a separate question.

Maikiii Over a year ago

@ Jack, thank you for the information, I redo a new qpost

|

Collectives™ on Stack Overflow

Python : Build the differents paths/trees from a xml file

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related