Regex within html tags

Question

I would like to parse the HD price from the following snipper of HTML. I am only have fragments of the html code, so I cannot use an HTML parser for this.

<div id="left-stack">        
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>

Basically, the format would be to "Find the price before the word "HD Version" (case insensitive). Here is what I have so far:

re.match(r'^(\d|.){1,6}...HD\sVersion', string)

How would I extract the value "19.99" from the above string?

Please insert here the usual admonishment about parsing HTML with regex — Adam Smith
– Adam Smith, Commented Sep 11, 2014 at 23:06
I cannot do that, I don't have the full html. I only have the above snippet as a string. — David542
– David542, Commented Sep 11, 2014 at 23:07

alecxe · Accepted Answer · 2014-09-11 23:25:44Z

4

BeautifulSoup is very lenient to the HTML it parses, you can use it for the chunks/parts of HTML too:

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

data = u"""
<div id="left-stack">
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>
"""

soup = BeautifulSoup(data)
print soup.find('span', class_='price').text[1:]

Prints:

19.99

edited Sep 11, 2014 at 23:25

answered Sep 11, 2014 at 23:13

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

hwnd · Accepted Answer · 2014-09-11 23:27:21Z

4

You've asked for a regular expression here, but it's not the right tool for parsing HTML. Use BeautifulSoup for this.

>>> from bs4 import BeautifulSoup
>>> html = '''
<div id="left-stack">        
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>'''
>>> soup = BeautifulSoup(html)
>>> val  = soup.find('span', {'class':'price'}).text
>>> print val[1:]
19.99

edited Sep 11, 2014 at 23:27

answered Sep 11, 2014 at 23:14

hwnd

70.9k4 gold badges100 silver badges135 bronze badges

Comments

Padraic Cunningham · Accepted Answer · 2014-09-11 23:22:13Z

2

You can still parse using BeautifulSoup, you don't need the full html:

from bs4 import BeautifulSoup
html="""
<div id="left-stack">
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>
"""

soup = BeautifulSoup(html)
sp = soup.find(attrs={"class":"price"}) 
print sp.text[1:]
19.99

edited Sep 11, 2014 at 23:22

answered Sep 11, 2014 at 23:12

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Comments

Adam Smith · Accepted Answer · 2014-09-11 23:33:42Z

2

The current BeautifulSoup answers only show how to grab all <span class="price"> tags. This is better:

from bs4 import BeautifulSoup

soup = """<div id="left-stack">        
 <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>"""

for HD_Version in (tag for tag in soup('li') if tag.text.lower() == 'hd version'):
    price = HD_Version.parent.findPreviousSibling('span', attrs={'class':'price'}).text

In general, using regular expressions to parse an irregular language like HTML is asking for trouble. Stick with an established parser.

edited Sep 11, 2014 at 23:33

answered Sep 11, 2014 at 23:24

Adam Smith

54.6k13 gold badges84 silver badges120 bronze badges

3 Comments

Padraic Cunningham Over a year ago

If the OP only has the snippet then there is only one span class="price">

Adam Smith Over a year ago

@PadraicCunningham Right, but I don't necessarily trust the OP's input. This should grab the first <span class="price"> before the <ul> that contains <li>HD Version</li>. After all if OP only has that exact snippet, then why is he looking for a programmatic solution?

Adam Smith Over a year ago

@Unihedron fair enough. I added a clarification that I was referring to the beautifulsoup answers, and also the usual warning not to do what OP is trying to do

Unihedron · Accepted Answer · 2014-09-11 23:12:56Z

0

You can use this regex:

\d+(?:\.\d+)?(?=\D+HD Version)

\D+ skips ahead of non-digits in a lookahead, effectively asserting that our match (19.99) is the last digit ahead of HD Version.

Here is a regex demo.

Use the i modifier in the regex to make the matching case-insensitive and change + to* if the number can be directly before HD Version.

edited Sep 11, 2014 at 23:12

answered Sep 11, 2014 at 23:07

Unihedron

11.1k13 gold badges65 silver badges72 bronze badges

Collectives™ on Stack Overflow

Regex within html tags

5 Answers 5

Comments

Comments

Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related