1

I have HTML lists with the exact same structure that I need to parse using JSoup (my language is Java). Here's an example:

<div class="ulist">
  <ul>
    <li><p>Healthy Food</p></li>
    <div class="ulist">
      <ul>
        <li><p>Vegetables</p></li>
        <div class="ulist">
          <ul>
            <li> <p>Carrots</p> </li>
            <li> <p>Lettuce</p> </li>
            <li> <p>Cucumbers</p> </li>
          </ul>
        </div> </li>
        <li> <p>Fruits</p>
          <div class="ulist">
            <ul>
              <li> <p>Apples</p> </li>
              <li> <p>Bananas</p> </li>
              <li> <p>Canned Fruits</p></li>
              <div class="ulist">
                <ul>
                  <li> <p>Peaches</p> </li>
                  <li> <p>Pears</p> </li>
                </ul>
              </div>
            </ul>
          </div>
        </li>
      </ul>
    </div>
  </ul>
</div>

Since this data is basically just a Tree data structure, I want to be able to parse it and create a Tree from the data. I'm having difficulty doing this with JSoup, as it appears you can't really traverse the DOM as expected.

For example, code like the following:

Elements elList = doc.select("ul");
for (Element el: elList){
  Elements subList = el.select("ul");
  for (Element subEl : subList){
    //do whatever you need to do
  }
}

Produces the following results, where it appears it's not "walking" or "traversing" down, but rather keeps selecting the same thing from the doc:

enter image description here

What is some code that will traverse this list and put it in a tree structure?

2
  • 1
    What is the problem traversing the DOM? Commented Jan 20, 2015 at 18:59
  • 1
    I think the problem is me, that I don't know how to use JSoup properly to traverse the DOM. Basically the problem is that the results for calls like Element list = doc.select("ul").first();, if I then call the same code again on the results Element subList = list.select("ul").first();, I get the same results as the first call. I guess I'm expecting the library to "consume" only the selected section. Not sure if this makes sense. Commented Jan 20, 2015 at 19:10

2 Answers 2

1

In JSoup, both the select() and the getElementByTag() return the current element as part of the results, if it matches the tag.

So when you do doc.select("ul"), and do a select() on the result, you'll get the same result, as you have already noticed.

The key to doing this properly is to take that first element, and then search its children.

Something along the lines of:

public static Node processTree( Element elem ) {

     Node result;

     Elements elList = elem.getElementsByTag("ul");

     if ( elList == null || elList.size() == 0 ) {
         return null;
     };

     result = new Node();
     Element current = elList.first();
     elList = current.children();

     // Process LI elements and add them as content to the
     // result Node
     ...

     // Now go down the tree

     if ( elList != null && elList.size() != 0 ) {

        for ( Element el : elList ) {
            Node elTree = processTree( el );
            if ( elTree != null ) {
                result.addChild( elTree );
            }
        }
     }

     return result;
}

(This is, of course, just a sketch. Node will be your tree structure node. The point of this is to show you that you have to traverse the children. You may process the li elements in the same loop if you like)

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for this answer, simple and helpful.
0

JSoup builds the DOM as a data structure in memory that you can access in a very powerful "random access" way, for example using the excellent css selector implementation. To solve your problem with JSoup you can cycle over the results like this:

Elements elList = doc.select("ul");
for (Element el: elList){
    Elements subList = el.select("ul");
    for (Element subEl : subList){
       //do whatever you need to do
    }
}

However, if you need to traverse very big html files and the files are well structured you may want to use a library like SAX. This avoids holding the whole DOM in memory.

1 Comment

Thanks for the answer, but this doesn't work as expected (which is the source of my problem). I've added a screenshot and example above that you can use to clarify.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.