1

Here is an example of a xml file :

<?xml version="1.0" encoding="utf-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
  <SOAP-ENV:Header />
  <SOAP-ENV:Body>
    <ADD_LandIndex_001>
      <CNTROLAREA>
        <BSR>
          <status>ADD</status>
          <NOUN>LandIndex</NOUN>
          <REVISION>001</REVISION>
        </BSR>
      </CNTROLAREA>
      <DATAAREA>
        <LandIndex>
          <reportId>AMI100031</reportId>
          <requestKey>R3278458</requestKey>
          <SubmittedBy>EN4871</SubmittedBy>
          <submittedOn>2015/01/06 4:20:11 PM</submittedOn>
          <LandIndex>
            <agreementdetail>
              <agreementid>001       4860</agreementid>
              <agreementtype>NATURAL GAS</agreementtype>
              <currentstatus>
                <status>ACTIVE</status>
                <statuseffectivedate>1965/02/18</statuseffectivedate>
                <termdate>1965/02/18</termdate>
              </currentstatus>
              <designatedrepresentative>
              </designatedrepresentative>
            </agreementdetail>
          </LandIndex>
        </LandIndex>
      </DATAAREA>
    </ADD_LandIndex_001>
  </SOAP-ENV:Body>
</SOAP-ENV:Envelope>

I would like to store in a list all the differents paths that have a text in my xml file. So I would like something like that :

['Envelope/Body/ADD_LandIndex_01/CNTROLAREA/BSR/status', 'Envelope/Body/ADD_LandIndex_01/CNTROLAREA/BSR/LandIndex', ...]

I try a little code that does not work. I don't see how to take seperatly the last elements of one branch and how to all the path from the beginning when I switch of node in the middle (i.e Envelope/Body/ADD_LandIndex_01/DATAAREA...

import xml.etree.ElementTree as et
import os
import pandas as pd
from re import search

filename = 'file_try.xml'
element_tree = et.parse(filename)
root = element_tree.getroot()
namespace = "{http://schemas.xmlsoap.org/soap/envelope/}"


def remove_namespace(string,namespace) :
    
    if search(namespace, string) :
        new_string = string.replace(namespace,'')
    else : 
        new_string= string
    return new_string

dico = {}
title = root.tag
print(title)

for element in root.findall('.//') :
    #print(element)
    if len(list(element)) > 0 :
        #print('True ') 
        title= remove_namespace(title + '/' + element.tag, namespace)
        print(title+ '\n')

    else :
        
        title = root.tag

Can anyone help me ?

Thank you

1 Answer 1

1

You can modify this for you actual code, but basically - it should look like this:

from lxml import etree
soap = """[your xml above]"""
root = etree.XML(soap.encode())    
tree = etree.ElementTree(root)
for target in root.xpath('//text()'):
    if len(target.strip())>0:       
        print(tree.getpath(target.getparent()).replace('SOAP-ENV:',''))

Output:

/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/status
/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN
/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/REVISION
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/reportId
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/requestKey
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/SubmittedBy
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/submittedOn
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementid
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementtype
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/status
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/statuseffectivedate
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/termdate
Sign up to request clarification or add additional context in comments.

6 Comments

Thank you @Jack Fleeting for your answer, it helps me a lot, I wish I would have your skills ! How do you import the data from a xml file in a directory for your variable 'soap' ?
Sorry, I just found how to do it with : with open(filename, 'r') as f: soap = f.read()
How you saw, I did a previous post on that, I would like to have now another list that gets the text/content of these paths (elements). I can't find the way to do it with this library lxml, I tried .text() or text_content()` but i get an error: the goal is then to have that on a dataframe to export on excel. What is the function or the line to get the content ? Code : for target in root.xpath('//text()'): if len(target.strip())>0: path = tree.getpath(target.getparent()).replace('SOAP-ENV:','') data = target.text() mylist_path.append(path)
@Maikiii Glad it worked for you! As to the other thing, Stack Overflow policy says you should post it as a separate question.
@ Jack, thank you for the information, I redo a new qpost
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.