1

I'm parsing this page segment:

<tr valign="middle">
   <td class="inner"><span style=""><span class="" title=""></span> 2  <span class="icon ok" title="Verified"></span> </span><span class="icon cat_tv" title="Video » TV" style="bottom:-2;"></span> <a href="/VALUE.html" style="line-height:1.4em;">VALUE</a> </td>
   <td width="1%" align="center" nowrap="nowrap" class="small inner" >VALUE</td>
   <td width="1%" align="right" nowrap="nowrap" class="small inner" >VALUE</td>
   <td width="1%" align="center" nowrap="nowrap" class="small inner" >VALUE</td>
</tr>

I have this segment in variable tv: HtmlElement tv = tr.get(i);

I read tag <a href="/VALUE.html" style="line-height:1.4em;">VALUE</a> in this way:

HtmlElement a = tv.getElementsByTagName("a").get(0);        
object.name.value(a.getTextContent());

url = a.getAttribute("href");
object.url_detail.value(myBase + url);

How can I read only VALUE field of the other <td>....</td> sections?

2
  • What framework are you using for the parsing? Commented Mar 12, 2013 at 13:01
  • maybe using tv.getElementsByTagName("td") and looping over the result and getting the text content using getTextContent() ? did you try that ? Commented Mar 12, 2013 at 13:02

2 Answers 2

5

I would suggest using XPath, which is the recommended way of parsing XML/HTML

Reference: How to read XML using XPath in Java

Also take a look at this question: RegEx match open tags except XHTML self-contained tags

Update

If I understood correctly, you need the "VALUE" from each td, right? If so, your XPath would something like this:

//td[@class="small inner"]/text()
Sign up to request clarification or add additional context in comments.

Comments

1

You may try a wonderful java package jsoup.

UPDATE: using the package, you can solve the problem like this:

    String html = "<tr valign=\"middle\">"
            + "   <td class=\"inner\">"
            + "   <span style=\"\"><span class=\"\" title=\"\"></span> 2  <span class=\"icon ok\" title=\"Verified\"></span> </span><span class=\"icon cat_tv\" title=\"Video » TV\" style=\"bottom:-2;\"></span>"
            + "   <a href=\"/VALUE.html\" style=\"line-height:1.4em;\">VALUE</a> "
            + "   </td>"
            + "   <td width=\"1%\" align=\"center\" nowrap=\"nowrap\" class=\"small inner\" >VALUE</td>"
            + "   <td width=\"1%\" align=\"right\" nowrap=\"nowrap\" class=\"small inner\" >VALUE</td>"
            + "   <td width=\"1%\" align=\"center\" nowrap=\"nowrap\" class=\"small inner\" >VALUE</td>"
            + "</tr>";
    Document doc = Jsoup.parse(html, "", Parser.xmlParser());
    Elements labelPLine = doc.select("a[href]");
    System.out.println("value 1:" + labelPLine.text());

    Elements labelPLine2 = doc.select("td[width=1%");
    Iterator<Element> it = labelPLine2.iterator();
    int n = 2;
    while (it.hasNext()) {
        System.out.println("value " + (n++) + ":" + it.next().text());
    }

The result would be:

value 1:VALUE
value 2:VALUE
value 3:VALUE
value 4:VALUE

1 Comment

You should say how you could solve the problem using jsoup. Otherwise this is a non-answer and should just have been a comment.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.