0

I'm using Jsoup to extract text from a website, and I can't figure out how to properly get specific rows of data in nested tables. I need to get the plain text after the parts that say Property Address: and Mailing Address:, so I can store the data.

Here is the HTML source I am parsing:

<table width="730" border="0" cellspacing="0" cellpadding="2">
  <tr> 
    <td><table width="730" border="0" cellspacing="0" cellpadding="2">
      <tr> 
        <td><h1>Property Information</h1>
          <table width="758">
            <tr>[IRRELEVANT]</tr>
            <tr>[IRRELEVANT]</tr>
            <tr>
              <td colspan="3"><strong>Property Address:</strong>&nbsp;!!THIS PLAIN TEXT HERE IS WHAT I NEED!! DATA1</td>
              <td>&nbsp;</td>
              </tr>
            <tr>
              <td colspan="3"><strong>Mailing Address:</strong>!!NEED THIS TOO!! DATA2</td>
              <td>&nbsp;</td>
              </tr>
            <tr>[IRRELEVANT]</tr>...................

I was using this as a template, but it doesn't work, and I have no idea how to make it work.

Document documentSerialNumberPageData = Jsoup.connect(stringURLOfSerialNumberPage).get();   //connect to serial number page
Elements elementsSerialNumberPageData = documentSerialNumberPageData.select("#tabletext tbody > tr > td > tbody > tr > td > tbody > tr > td");  //this is not even remotely correct... :(
Element elementAddress = elementsSerialNumberPageData.get(0);
System.out.println(elementAddress.text());

My knowledge of HTML/CSS is very limited, but I'm proficient in Java. Any suggestions? Thanks! Full Source Here: https://github.com/PhotonPhighter/NODScraper/blob/master/src/nodscraper/Main.java

1 Answer 1

3

You can try this:

Elements innerTable = documentSerialNumberPageData.select("body > table:nth-child(2) > tbody > tr > td > table > tbody > tr > td > table:nth-child(2)");
String propertyAddress = ((org.jsoup.nodes.TextNode)innerTable.select("tr:nth-child(3) > td > strong").first().nextSibling()).text();
String mailingAddress = ((org.jsoup.nodes.TextNode)innerTable.select("tr:nth-child(4) > td > strong").first().nextSibling()).text();

First, you select the table, then you select the strong tag in the first td in the third tr, then you pick the next sibling to that, you take the text() in it and you are done. We do the same for the forth tr.

With text() JSoup will translate the &nbsp; into spaces, if you prefer not, you can also call toString().

Hope that it helps.

PS: Can I suggest a trick? You can use developer tools of Chrome or Firefox to find a tag in a html page, then right click and Copy CSS Path. This will give you the selector you can use in JSoup!

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! That is super helpful! I had no idea you could do that in browser to get those tags.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.