I am trying to parse the elements of an html list which looks like this:
<ol>
<li>r1</li>
<li>r2
<ul>
<li>n1</li>
<li>n2</li>
</ul>
</li>
<li>r3
<ul>
<li>d1
<ol>
<li>e1</li>
<li>e2</li>
</ol>
</li>
<li>d2</li>
</ul>
</li>
<li>r4</li>
</ol>
I am fine with parsing this for the most part, but the biggest problem for me is in getting the dom text back. Unfortunately lxml's node.text_content() returns the text form of the complete tree under it. Can I obtain the text content of just that element using lxml, or would I need to use string manipulation or regex for that?
For eg: the node with d1 returns "d1e1e2", whereas, I want it to return just d1.