I am parsing an HTML file with nested unordered lists, here is an example:
<ul>
<li class="category_x">xyz abc
<ul>
<li>foo 123 bar</li>
<li>456 bar foo</li>
</ul>
</li>
<li class="category_x">aaa bbb ccc
<ul>
<li>xxx yyy zzz</li>
<li>123 abc 456</li>
</ul>
</li>
</ul>
I am interested in the relationship li > ul > li (think at it as Jsoup objects of type Element: grandParentNode > parentNode > eNode), but using the method grandParentNode.text() I am getting also the text in the whole nested <ul> list (included eNode.text()).
// getting the triplets
Elements triplets = doc.select("li > ul > li");
// print the triplet
for (Element eNode : triplets)
{
Element parentNode = eNode.parent();
Element grandParentNode = parentNode.parent();
System.out.println("Current node: " + eNode.text());
System.out.println("Grand parent: " + grandParentNode.text());
}
The output is:
Current node: foo 123 bar
Grand parent: xyz abc foo 123 bar 456 bar foo
Current node: 456 bar foo
Grand parent: xyz abc foo 123 bar 456 bar foo
Current node: xxx yyy zzz
Grand parent: aaa bbb ccc xxx yyy zzz 123 abc 456
Current node: 123 abc 456
Grand parent: aaa bbb ccc xxx yyy zzz 123 abc 456
I would like it to be:
Current node: foo 123 bar
Grand parent: xyz abc
Current node: 456 bar foo
Grand parent: xyz abc
Current node: xxx yyy zzz
Grand parent: aaa bbb ccc
Current node: 123 abc 456
Grand parent: aaa bbb ccc
Having a look at the Jsoup documentation it seems I need to modify the HTML in order to get those strings included in something like a value="" attribute, but I can not modify the HTML...
On top of this all those <li class="category_x"> are repeated everywhere with the same value on every node which is not a "li leaf" of the tree, so they are not really helpful in filtering data.
I have already tried stuff like doc.select("li:lt(1) > ul > li"); but it's not working, the problem is the structure of the HTML and how I am using the method text() from the Element class of Jsoup. The thing is that I have no idea of how to avoid text().
Any idea?
Thanks