2

For a page with html like this structure:

          <tr class="">
            <td class="number">1</td>
            <td class="name"><a href="..." >Jack Green</a></td>
            <td class="score-cell ">
              <span class="display">98
                <span class="tooltip column1"></span>
              </span>
            </td>
            <td class="score-cell ">
              ...
            </td>
          ...
          <tr class="">
            <td class="number">2</td>
            <td class="name"><a href="..." target="_top">Nicole Smith</a></td>
            <td class="score-cell ">
             ...
            </td>

How do I ONLY extract the text from the name tag to end up with a list {Jack Green, Nicole Smith}? Some method elegant I hope.

1 Answer 1

2
input =
  "          <tr class=\"\">
              <td class=\"number\">1</td>
              <td class=\"name\"><a href=\"...\" >Jack Green</a></td>
              <td class=\"score-cell \">
                <span class=\"display\">98
                  <span class=\"tooltip column1\"></span>
                </span>
              </td>
              <td class=\"score-cell \">
                ...
              </td>
            ...
            <tr class=\"\">
              <td class=\"number\">2</td>
              <td class=\"name\"><a href=\"...\" target=\"_top\">Nicole Smith</a></td>
              <td class=\"score-cell \">
               ...
              </td>";

(* Eliminate unnecessary whitespace and add a start character *)
html = StringJoin["X", StringReplace[StringTrim[input],
   {"\n" ~~ " " .. -> "", ">" ~~ " " .. ~~ "<" -> "><"}]];

(* Find the tags and positions of tags containing 'name' *)
tags = StringCases[html, "<" ~~ Except[">"] .. ~~ ">"];
nametagpositions = Position[StringMatchQ[ToLowerCase /@ tags, "*name*"], True];

(* Split on the tags and extract on the name tag positions *)
splits = StringSplit[html, "<" ~~ Except[">"] .. ~~ ">"];
Extract[splits, nametagpositions + 2]

{Jack Green, Nicole Smith}

Note

The start character is required to guarantee the correct split. As you can see in the demonstration below, the initial splits between the a characters are not counted until there is a substring to report. With a start character the positions of required items can be reliably used.

html = "aa1aaa2aa";
splits = StringSplit[html, "a"]

{1, , ,2}

html = "aaaaaaa1aaa2aaaaaaa";
splits = StringSplit[html, "a"]

{1, , ,2}

html = "0aaaaaaa1aaa2aaaaaaa";
splits = StringSplit[html, "a"]

{0, , , , , , ,1, , ,2}

Sign up to request clarification or add additional context in comments.

3 Comments

this is fantastic, wonder if it's better to encapsulate into a function. Let me go ahead and test this on a few webpages source html. Thanks for the string manipulation virtuosity.
So I went ahead and test it, one of which: input=Import["http://games.crossfit.com/scores/leaderboard.php?stage=5&sort=0&division=1&region=0&regional=6&numberperpage=60&userid=0&competition=0&frontpage=0&expanded=1&year=15&full=1&showtoggles=0&hidedropdowns=1&showathleteac=1&athletename=&scaled=0","Source"]; and the output doesn't seem to produce the output correctly. The tag under question is <td class="name">
I have fixed the code. It now work on your CrossFit page. There were 2 changes: adding the start character and incrementing the Extract positions (in the last line).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.