parse nested html lists using lxml in python

Question

I am trying to parse the elements of an html list which looks like this:

<ol>
    <li>r1</li>
    <li>r2
        <ul>
            <li>n1</li>
            <li>n2</li>
        </ul>
    </li>
    <li>r3
        <ul>
            <li>d1
                <ol>
                    <li>e1</li>
                    <li>e2</li>
                </ol>
            </li>
            <li>d2</li>
        </ul>
    </li>
    <li>r4</li>
</ol>

I am fine with parsing this for the most part, but the biggest problem for me is in getting the dom text back. Unfortunately lxml's node.text_content() returns the text form of the complete tree under it. Can I obtain the text content of just that element using lxml, or would I need to use string manipulation or regex for that?

For eg: the node with d1 returns "d1e1e2", whereas, I want it to return just d1.

Don Question · Accepted Answer · 2012-11-08 00:57:07Z

2

Each node has an attribute called text. That's what you are looking for.

e.g.:

for node in root.iter("*"):
    print node.text
    # print node.tail # e.g.: <div> <span> abc </span> def </div> => abc def

answered Nov 8, 2012 at 0:57

Don Question

11.7k5 gold badges39 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

parse nested html lists using lxml in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related