Parse HTML list using JSoup to create Tree structure

Question

I have HTML lists with the exact same structure that I need to parse using JSoup (my language is Java). Here's an example:

<div class="ulist">
  <ul>
    <li><p>Healthy Food</p></li>
    <div class="ulist">
      <ul>
        <li><p>Vegetables</p></li>
        <div class="ulist">
          <ul>
            <li> <p>Carrots</p> </li>
            <li> <p>Lettuce</p> </li>
            <li> <p>Cucumbers</p> </li>
          </ul>
        </div> </li>
        <li> <p>Fruits</p>
          <div class="ulist">
            <ul>
              <li> <p>Apples</p> </li>
              <li> <p>Bananas</p> </li>
              <li> <p>Canned Fruits</p></li>
              <div class="ulist">
                <ul>
                  <li> <p>Peaches</p> </li>
                  <li> <p>Pears</p> </li>
                </ul>
              </div>
            </ul>
          </div>
        </li>
      </ul>
    </div>
  </ul>
</div>

Since this data is basically just a Tree data structure, I want to be able to parse it and create a Tree from the data. I'm having difficulty doing this with JSoup, as it appears you can't really traverse the DOM as expected.

For example, code like the following:

Elements elList = doc.select("ul");
for (Element el: elList){
  Elements subList = el.select("ul");
  for (Element subEl : subList){
    //do whatever you need to do
  }
}

Produces the following results, where it appears it's not "walking" or "traversing" down, but rather keeps selecting the same thing from the doc:

enter image description here

What is some code that will traverse this list and put it in a tree structure?

I think the problem is me, that I don't know how to use JSoup properly to traverse the DOM. Basically the problem is that the results for calls like Element list = doc.select("ul").first();, if I then call the same code again on the results Element subList = list.select("ul").first();, I get the same results as the first call. I guess I'm expecting the library to "consume" only the selected section. Not sure if this makes sense. — Monochrome
– Monochrome, Commented Jan 20, 2015 at 19:10

RealSkeptic · Accepted Answer · 2015-01-20 19:59:08Z

1

In JSoup, both the select() and the getElementByTag() return the current element as part of the results, if it matches the tag.

So when you do doc.select("ul"), and do a select() on the result, you'll get the same result, as you have already noticed.

The key to doing this properly is to take that first element, and then search its children.

Something along the lines of:

public static Node processTree( Element elem ) {

     Node result;

     Elements elList = elem.getElementsByTag("ul");

     if ( elList == null || elList.size() == 0 ) {
         return null;
     };

     result = new Node();
     Element current = elList.first();
     elList = current.children();

     // Process LI elements and add them as content to the
     // result Node
     ...

     // Now go down the tree

     if ( elList != null && elList.size() != 0 ) {

        for ( Element el : elList ) {
            Node elTree = processTree( el );
            if ( elTree != null ) {
                result.addChild( elTree );
            }
        }
     }

     return result;
}

(This is, of course, just a sketch. Node will be your tree structure node. The point of this is to show you that you have to traverse the children. You may process the li elements in the same loop if you like)

answered Jan 20, 2015 at 19:59

RealSkeptic

34.8k7 gold badges56 silver badges83 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Monochrome Over a year ago

Thank you for this answer, simple and helpful.

luksch · Accepted Answer · 2015-01-20 19:30:26Z

0

JSoup builds the DOM as a data structure in memory that you can access in a very powerful "random access" way, for example using the excellent css selector implementation. To solve your problem with JSoup you can cycle over the results like this:

Elements elList = doc.select("ul");
for (Element el: elList){
    Elements subList = el.select("ul");
    for (Element subEl : subList){
       //do whatever you need to do
    }
}

However, if you need to traverse very big html files and the files are well structured you may want to use a library like SAX. This avoids holding the whole DOM in memory.

answered Jan 20, 2015 at 19:30

luksch

11.7k6 gold badges41 silver badges54 bronze badges

1 Comment

Monochrome Over a year ago

Thanks for the answer, but this doesn't work as expected (which is the source of my problem). I've added a screenshot and example above that you can use to clarify.

Collectives™ on Stack Overflow

Parse HTML list using JSoup to create Tree structure

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related