0

i have gone though this post why not use regular expression for HTML. As a part of the task given to me, i had no choice but to use regular expression for HTML.

i have HTML code and separately tried like

 <td class="a-nowrap">

          <span class="a-letter-space"></span><span>13</span>

        </td>

i have been able to get the 13 using following regular expression :

<td class="a-nowrap">\s*<span class="a-letter-space"></span><span>(\d*)</span>\s*</td>

and similarly from

<td class="a-nowrap">

          <a class="a-link-normal" title="69% of reviews have 5 stars" href="">5 star</a><span class="a-letter-space"></span>          

        </td>

got 5 star using the regular expression

<td class="a-nowrap">\s*<a class="a-link-normal" [^>]*>\s*(.*)</a>\s*</td>

But when both of the HTML code is combined like,

<table id="histogramTable" class="a-normal a-align-middle a-spacing-base">

  <tr class="a-histogram-row">



        <td class="a-nowrap">

          <a class="a-link-normal" title="69% of reviews have 5 stars" href="">5 star</a><span class="a-letter-space"></span>          

        </td>

        <td class="a-span10">

          <a class="a-link-normal" title="69% of reviews have 5 stars" href=""><div class="a-meter"><div class="a-meter-bar" style="width: 69.1358024691358%;"></div></div></a>

        </td>

        <td class="a-nowrap">

          <span class="a-letter-space"></span><span>13</span>

        </td>

  </tr>
  <td class="a-nowrap">

      <a class="a-link-normal" title="2% of reviews have 1 stars" href="">1 star</a><span class="a-letter-space"></span>          

    </td>

    <td class="a-span10">

      <a class="a-link-normal" title="2% of reviews have 1 stars" href=""><div class="a-meter"><div class="a-meter-bar" style="width: 2.46913580246914%;"></div></div></a>

    </td>

    <td class="a-nowrap">

      <span class="a-letter-space"></span><span>2</span>

    </td>


</table>

how to extract 5 star and 13 using regular expression?

1
  • updated my answer with new shorter regex, which works for the modified input you have provided. Commented Nov 11, 2013 at 14:58

1 Answer 1

1

If you don't want to use HTML parser, use one regex after another or add .*this between two patterns, I have modified a bit your star regex as it didn't work properly:

First enable dotall flag (s) and then use this:

<td class="a-nowrap">\s*<a class="a-link-normal" [^>]*>\s*(\d star).*<td class="a-nowrap">\s*<span class="a-letter-space"></span><span>(\d*)</span>\s*</td>

Output:

Group 1: 5 star

Group 2: 13

EDIT:

I have made shorter regex:

REGEX:

>(\d star)<.+?>(\d+?)<

Which used on pythonregex.com with the edited input you have provided gives:

OUTPUT:

>>> regex.findall(string)
[(u'5 star', u'13'), (u'1 star', u'2')]
Sign up to request clarification or add additional context in comments.

8 Comments

using above expression, it will be like [('5 star', ''), ('', '13')] but i want something like [('5 star', '13')], '|' or expression making this trouble. any idea on that?
@naveenyadav that's strange as I use the patterns you have provided, just added OR between them, so the pattern will catch either ** 5 stars** and/or 13. Do these patterns work for you when you use them separately?
@naveenyadav well so you almost get what you want : ) ok so let me think a bit.
@naveenyadav well you get that output as it matches both cases, but you have both results you wanted, so you could use them as you wished right? Unfortunately I'm not able to check how does this regex work properly as I have never used regex for HTML : (
code is working fine. I appreciate your effort to help me out. thanks
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.