1

I am parsing an HTML file with nested unordered lists, here is an example:

<ul>
    <li class="category_x">xyz abc
        <ul>
            <li>foo 123 bar</li>
            <li>456 bar foo</li>
        </ul>
    </li>
    <li class="category_x">aaa bbb ccc
        <ul>
            <li>xxx yyy zzz</li>
            <li>123 abc 456</li>
        </ul>
    </li>
</ul>

I am interested in the relationship li > ul > li (think at it as Jsoup objects of type Element: grandParentNode > parentNode > eNode), but using the method grandParentNode.text() I am getting also the text in the whole nested <ul> list (included eNode.text()).

    // getting the triplets
    Elements triplets = doc.select("li > ul > li");

    // print the triplet
    for (Element eNode : triplets)
    {
        Element parentNode = eNode.parent();
        Element grandParentNode = parentNode.parent();

        System.out.println("Current node: " + eNode.text());
        System.out.println("Grand parent: " + grandParentNode.text());
    }

The output is:

Current node: foo 123 bar
Grand parent: xyz abc foo 123 bar 456 bar foo
Current node: 456 bar foo
Grand parent: xyz abc foo 123 bar 456 bar foo
Current node: xxx yyy zzz
Grand parent: aaa bbb ccc xxx yyy zzz 123 abc 456
Current node: 123 abc 456
Grand parent: aaa bbb ccc xxx yyy zzz 123 abc 456

I would like it to be:

Current node: foo 123 bar
Grand parent: xyz abc
Current node: 456 bar foo
Grand parent: xyz abc
Current node: xxx yyy zzz
Grand parent: aaa bbb ccc
Current node: 123 abc 456
Grand parent: aaa bbb ccc

Having a look at the Jsoup documentation it seems I need to modify the HTML in order to get those strings included in something like a value="" attribute, but I can not modify the HTML... On top of this all those <li class="category_x"> are repeated everywhere with the same value on every node which is not a "li leaf" of the tree, so they are not really helpful in filtering data.

I have already tried stuff like doc.select("li:lt(1) > ul > li"); but it's not working, the problem is the structure of the HTML and how I am using the method text() from the Element class of Jsoup. The thing is that I have no idea of how to avoid text().

Any idea?

Thanks

1 Answer 1

2

Use the ownText() method to select only the text owned directly by an Element, ignoring the text of any child Elements.

So change this line:

System.out.println("Grand parent: " + grandParentNode.text());

to

System.out.println("Grand parent: " + grandParentNode.ownText());

The output will now show:

Current node: foo 123 bar
Grand parent: xyz abc
Current node: 456 bar foo
Grand parent: xyz abc
Current node: xxx yyy zzz
Grand parent: aaa bbb ccc
Current node: 123 abc 456
Grand parent: aaa bbb ccc
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.