Parsing nested HTML unordered lists with Jsoup

Question

I am parsing an HTML file with nested unordered lists, here is an example:

<ul>
    <li class="category_x">xyz abc
        <ul>
            <li>foo 123 bar</li>
            <li>456 bar foo</li>
        </ul>
    </li>
    <li class="category_x">aaa bbb ccc
        <ul>
            <li>xxx yyy zzz</li>
            <li>123 abc 456</li>
        </ul>
    </li>
</ul>

I am interested in the relationship li > ul > li (think at it as Jsoup objects of type Element: grandParentNode > parentNode > eNode), but using the method grandParentNode.text() I am getting also the text in the whole nested <ul> list (included eNode.text()).

    // getting the triplets
    Elements triplets = doc.select("li > ul > li");

    // print the triplet
    for (Element eNode : triplets)
    {
        Element parentNode = eNode.parent();
        Element grandParentNode = parentNode.parent();

        System.out.println("Current node: " + eNode.text());
        System.out.println("Grand parent: " + grandParentNode.text());
    }

The output is:

Current node: foo 123 bar
Grand parent: xyz abc foo 123 bar 456 bar foo
Current node: 456 bar foo
Grand parent: xyz abc foo 123 bar 456 bar foo
Current node: xxx yyy zzz
Grand parent: aaa bbb ccc xxx yyy zzz 123 abc 456
Current node: 123 abc 456
Grand parent: aaa bbb ccc xxx yyy zzz 123 abc 456

I would like it to be:

Current node: foo 123 bar
Grand parent: xyz abc
Current node: 456 bar foo
Grand parent: xyz abc
Current node: xxx yyy zzz
Grand parent: aaa bbb ccc
Current node: 123 abc 456
Grand parent: aaa bbb ccc

Having a look at the Jsoup documentation it seems I need to modify the HTML in order to get those strings included in something like a value="" attribute, but I can not modify the HTML... On top of this all those <li class="category_x"> are repeated everywhere with the same value on every node which is not a "li leaf" of the tree, so they are not really helpful in filtering data.

I have already tried stuff like doc.select("li:lt(1) > ul > li"); but it's not working, the problem is the structure of the HTML and how I am using the method text() from the Element class of Jsoup. The thing is that I have no idea of how to avoid text().

Any idea?

Thanks

ashatte · Accepted Answer · 2014-01-03 22:24:29Z

2

Use the ownText() method to select only the text owned directly by an Element, ignoring the text of any child Elements.

So change this line:

System.out.println("Grand parent: " + grandParentNode.text());

to

System.out.println("Grand parent: " + grandParentNode.ownText());

The output will now show:

Current node: foo 123 bar
Grand parent: xyz abc
Current node: 456 bar foo
Grand parent: xyz abc
Current node: xxx yyy zzz
Grand parent: aaa bbb ccc
Current node: 123 abc 456
Grand parent: aaa bbb ccc

answered Jan 3, 2014 at 22:24

ashatte

5,5388 gold badges41 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Parsing nested HTML unordered lists with Jsoup

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related