How to extract text from specific rows in nested tables with Jsoup

Question

I'm using Jsoup to extract text from a website, and I can't figure out how to properly get specific rows of data in nested tables. I need to get the plain text after the parts that say Property Address: and Mailing Address:, so I can store the data.

Here is the HTML source I am parsing:

<table width="730" border="0" cellspacing="0" cellpadding="2">
  <tr> 
    <td><table width="730" border="0" cellspacing="0" cellpadding="2">
      <tr> 
        <td><h1>Property Information</h1>
          <table width="758">
            <tr>[IRRELEVANT]</tr>
            <tr>[IRRELEVANT]</tr>
            <tr>
              <td colspan="3"><strong>Property Address:</strong>&nbsp;!!THIS PLAIN TEXT HERE IS WHAT I NEED!! DATA1</td>
              <td>&nbsp;</td>
              </tr>
            <tr>
              <td colspan="3"><strong>Mailing Address:</strong>!!NEED THIS TOO!! DATA2</td>
              <td>&nbsp;</td>
              </tr>
            <tr>[IRRELEVANT]</tr>...................

I was using this as a template, but it doesn't work, and I have no idea how to make it work.

Document documentSerialNumberPageData = Jsoup.connect(stringURLOfSerialNumberPage).get();   //connect to serial number page
Elements elementsSerialNumberPageData = documentSerialNumberPageData.select("#tabletext tbody > tr > td > tbody > tr > td > tbody > tr > td");  //this is not even remotely correct... :(
Element elementAddress = elementsSerialNumberPageData.get(0);
System.out.println(elementAddress.text());

My knowledge of HTML/CSS is very limited, but I'm proficient in Java. Any suggestions? Thanks! Full Source Here: https://github.com/PhotonPhighter/NODScraper/blob/master/src/nodscraper/Main.java

fonkap · Accepted Answer · 2014-10-28 21:01:36Z

3

You can try this:

Elements innerTable = documentSerialNumberPageData.select("body > table:nth-child(2) > tbody > tr > td > table > tbody > tr > td > table:nth-child(2)");
String propertyAddress = ((org.jsoup.nodes.TextNode)innerTable.select("tr:nth-child(3) > td > strong").first().nextSibling()).text();
String mailingAddress = ((org.jsoup.nodes.TextNode)innerTable.select("tr:nth-child(4) > td > strong").first().nextSibling()).text();

First, you select the table, then you select the strong tag in the first td in the third tr, then you pick the next sibling to that, you take the text() in it and you are done. We do the same for the forth tr.

With text() JSoup will translate the   into spaces, if you prefer not, you can also call toString().

Hope that it helps.

PS: Can I suggest a trick? You can use developer tools of Chrome or Firefox to find a tag in a html page, then right click and Copy CSS Path. This will give you the selector you can use in JSoup!

edited Oct 28, 2014 at 21:01

answered Oct 28, 2014 at 20:55

fonkap

2,5171 gold badge17 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ali Over a year ago

Thank you! That is super helpful! I had no idea you could do that in browser to get those tags.

Collectives™ on Stack Overflow

How to extract text from specific rows in nested tables with Jsoup

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related