2

I have the following piece of HTML code which I need to parse to retrieve the player name and the runs he has scored. In this case it's 'Ross Taylor' and 9. What's the best way to do parse this info? Don't want to use an HTML parser. Is REGEX the best way (I know people are dead against this! But I just want these 2 bits of info and hence don't want to use a parser)? I've been racking my brains on how I should figure out where the player name is in the html file and the consequent row which has the runs scored. The HTML comment part below is a hard coded one. I can reach this place. Then retrieve the name between the tags. Is this a good way to do it? Also how do I retrieve the runs part in the immediate next row?


<!-- <a href="javascript:void(0);" onClick="return showHwkTooltip(this, 'lvpyrbat1');" class="livePlayerCurrent">*Luke Woodcock</a>-->

<a href="/icc_cricket_worldcup2011/content/current/player/38920.html" target="_blank" class="livePlayerCurrent" title="view the player profile for Ross Taylor">
*Ross Taylor
</a>    <span style="margin-left:5px;" title="left-hand bat">(lhb)</span >

   </td >
   <td><b>9</b></td>
   <td>9</td>
   <td>1</td>
   <td>0</td>
   <td>100.00</td>
   <td></td>
   <td colspan="3" align="left"><span class="batStyl">striker</style></td>
   <td></td>
   <td colspan="8"></td>
  </tr>

Please let me know if you need more info.

Regards, Sam

3
  • Please read message formatting rules in editor help. Commented Feb 16, 2011 at 18:24
  • 1
    Use a parser. Even for two pieces of information. Don't fall into the regex rabbithole for parsing HTML. Commented Feb 16, 2011 at 18:43
  • @CanSpice Also could you please suggest an HTML/XML parser? How would it be different than using REGEX for the above example? Commented Feb 16, 2011 at 20:01

3 Answers 3

9

What's the best way to do parse this info?

Use an HTML parser.

Don't want to use an HTML parser.

I disagree.

Is REGEX the best way

No.

Sign up to request clarification or add additional context in comments.

3 Comments

Could you please suggest an HTML/XML parser? How would it be different than using REGEX for the above example?
@sammydude: java-source.net/open-source/html-parsers is the third link in a Google search for java html parser.
Agree with the answers provided by CommonsWare. Since I had a very minor requirement, went ahead with REGEX.
1

Please consider using the proper tool for the job, e.g., a html/xml parser not regex.

If you really want to do it using regex you can try the following out:

Extract score

  (?<=\\<b\\>)\\d+(?=\\</b\\>)

Extract player name

  (?<=\\>)[^\\<]+(?=\\</a\\>)

The second regex assumed you sanitized the xml by removing the anchortag between comment tags.

 <!-- ... -->

What it does it extract the value within any anchortag. This is one of the fundamental restrictions when using regex, it isn't context-aware.

6 Comments

@johan-sjoberg Thanks for your response! I just pasted a part of the HTML file. There are many more occurrences of rows similar to the one showing runs. So, the above regex string wouldn't work out right? Also my hook for getting the batsman's name is 'livePlayerCurrent' as there are many other anchor tags present in the file. Could you please give the previous 'livePlayerCurrent' regex string which you updated before the current update? :-)
@johan-sjoberg For the runs scored I would like to fetch the row just after the batsman tag. Is it possible?
@johan-sjoberg Also could you please suggest an HTML/XML parser? How would it be different than using REGEX for the above example?
I would recommend nekohtml, or alternatively htmlparser although it's API is quite different. If you html is well-formed, you could get away with java's own sax parser. These will save you alot of trouble over covering all special cases with regex.
@johan-sjober Thanks Johan. Looks like nekohtml has some setup issues with Android. Will check it out. Saxparser is more akin to parsing XML it seems. Hard pressed for time, so trying out REGEX too. Is it possible to use the 'livePlayerCurrent' tag to zero in on the REGEX pattern? Any pattern to retrieve the row after 'livePlayerCurrent' anchor tag? Sorry about the repeated bugging. :-)
|
0

For what it is worth, you can also have a look at Jsoup. I used it in my projects,and it handles malformed html very well. I believe that might be the only reason I'm using it ;)

Regards, EZFrag

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.