1

I want to list all the elements path in xml with respect to their root. for example

<A>
   <B>
     <C>Name</C>
     <D>Name</D>
   </B>
</A>

So i want to list them as :-

A/B/C
A/B/D

I am able to parse xml using "Element" Object of python but not able to extract xpath from it. Any help?

2
  • does it mean there should also be: A/B Commented Apr 27, 2018 at 6:52
  • no only absolute path from root Commented Apr 27, 2018 at 7:05

3 Answers 3

1

One can construct a parent map of the parsed tree and then use it to construct a needed XPath:

import xml.etree.ElementTree as parser

def get_parent_map(root):
    return {c:p for p in root.iter() for c in p}

def extract_text_info(root, original_root):
    parent_map = get_parent_map(original_root)

    for child in root:
        if child.text is not None and len(child.text.strip()) > 0:
            c = child
            arr = []
            while c != original_root:
                arr.append(c.tag)
                c = parent_map[c]
            arr.append(original_root.tag)

            print('/'.join(arr[::-1]))
            print(child.text)

        extract_text_info(child, original_root)

Then we have

xml = """<A>
       <B>
         <C>Name</C>
         <D>Name</D>
       </B>
     </A> """

root = parser.fromstring(xml)
extract_text_info(root, root)

> A/B/C
> Name
> A/B/D
> Name
Sign up to request clarification or add additional context in comments.

Comments

1

One Of the ways I figured out is through code.

 import xml.etree.ElementTree as ET


def parseXML(root,sm):
    sm = sm + "/" + root.tag[root.tag.rfind('}')+1:]
    for child in root:
      parseXML(child,sm)
    if len(list(root)) == 0:
      print(sm)

tree = ET.parse('test.xml')
root = tree.getroot()
parseXML(root,"")

Don't know if there is any inbuilt function for the same.

2 Comments

Umm... are you able to use the lxml library? My xpath is rusty and I'm sure there's a better way, but you could try something like: ['/'.join(a.tag for a in el.xpath('.//ancestor::*')) for el in tree.xpath('//*[not(child::*)]')] - eg - you find all the leaf nodes (those with no children) - requery to get their complete list of ancestors, then join the node names. That'll give you ['A/B/C', 'A/B/D'] on your sample data.
Is there a way to get the value within the element as well while you are iterating ? But this code definitely helped iterate all the elements and get their paths ! Beautiful....
0

sample.html

<A>
   <B>
     <C>Name1</C>
     <D>Name2</D>
   </B>
</A>

parse.py

from bs4 import BeautifulSoup

def get_root_elements(path_to_file):
    soup = BeautifulSoup(open(path_to_file), 'lxml')
    all_elements = soup.find_all()

    count_element_indices = [len(list(a.parents)) for a in all_elements]

    absolute_roots_index = min(
        (index for index, element in enumerate(count_element_indices)
            if element == max(count_element_indices)
        )
    )

    return all_elements[absolute_roots_index:]

def get_path(element):
    to_remove = ['[document]', 'body', 'html']
    path = [element.name] + [e.name for e in element.parents if e.name not in to_remove]

    return ' / '.join(path[::-1])

Python Shell

In [1]: file = 'path/to/sample.html'

In [2]: run parse.py

In [3]: roots = get_root_elements(file)

In [4]: print(roots)
[<c>Name1</c>, <d>Name2</d>]

In [4]: for root in roots:
   ...:    print(get_path(root))
a / b / c
a / b / d

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.