extracting text from specific tags in html using Mathematica

Question

For a page with html like this structure:

          <tr class="">
            <td class="number">1</td>
            <td class="name"><a href="..." >Jack Green</a></td>
            <td class="score-cell ">
              <span class="display">98
                <span class="tooltip column1"></span>
              </span>
            </td>
            <td class="score-cell ">
              ...
            </td>
          ...
          <tr class="">
            <td class="number">2</td>
            <td class="name"><a href="..." target="_top">Nicole Smith</a></td>
            <td class="score-cell ">
             ...
            </td>

How do I ONLY extract the text from the name tag to end up with a list {Jack Green, Nicole Smith}? Some method elegant I hope.

Chris Degnen · Accepted Answer · 2015-07-19 11:37:17Z

2

input =
  "          <tr class=\"\">
              <td class=\"number\">1</td>
              <td class=\"name\"><a href=\"...\" >Jack Green</a></td>
              <td class=\"score-cell \">
                <span class=\"display\">98
                  <span class=\"tooltip column1\"></span>
                </span>
              </td>
              <td class=\"score-cell \">
                ...
              </td>
            ...
            <tr class=\"\">
              <td class=\"number\">2</td>
              <td class=\"name\"><a href=\"...\" target=\"_top\">Nicole Smith</a></td>
              <td class=\"score-cell \">
               ...
              </td>";

(* Eliminate unnecessary whitespace and add a start character *)
html = StringJoin["X", StringReplace[StringTrim[input],
   {"\n" ~~ " " .. -> "", ">" ~~ " " .. ~~ "<" -> "><"}]];

(* Find the tags and positions of tags containing 'name' *)
tags = StringCases[html, "<" ~~ Except[">"] .. ~~ ">"];
nametagpositions = Position[StringMatchQ[ToLowerCase /@ tags, "*name*"], True];

(* Split on the tags and extract on the name tag positions *)
splits = StringSplit[html, "<" ~~ Except[">"] .. ~~ ">"];
Extract[splits, nametagpositions + 2]

{Jack Green, Nicole Smith}

Note

The start character is required to guarantee the correct split. As you can see in the demonstration below, the initial splits between the a characters are not counted until there is a substring to report. With a start character the positions of required items can be reliably used.

html = "aa1aaa2aa";
splits = StringSplit[html, "a"]

{1, , ,2}

html = "aaaaaaa1aaa2aaaaaaa";
splits = StringSplit[html, "a"]

{1, , ,2}

html = "0aaaaaaa1aaa2aaaaaaa";
splits = StringSplit[html, "a"]

{0, , , , , , ,1, , ,2}

edited Jul 19, 2015 at 11:37

answered Jul 16, 2015 at 8:56

Chris Degnen

8,7002 gold badges25 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

fika_fika Over a year ago

this is fantastic, wonder if it's better to encapsulate into a function. Let me go ahead and test this on a few webpages source html. Thanks for the string manipulation virtuosity.

fika_fika Over a year ago

So I went ahead and test it, one of which:

input=Import["http://games.crossfit.com/scores/leaderboard.php?stage=5&sort=0&division=1&region=0&regional=6&numberperpage=60&userid=0&competition=0&frontpage=0&expanded=1&year=15&full=1&showtoggles=0&hidedropdowns=1&showathleteac=1&athletename=&scaled=0","Source"];

and the output doesn't seem to produce the output correctly. The tag under question is <td class="name">

Chris Degnen Over a year ago

I have fixed the code. It now work on your CrossFit page. There were 2 changes: adding the start character and incrementing the Extract positions (in the last line).

Collectives™ on Stack Overflow

extracting text from specific tags in html using Mathematica

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related