1

I would like to parse the HD price from the following snipper of HTML. I am only have fragments of the html code, so I cannot use an HTML parser for this.

<div id="left-stack">        
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>

Basically, the format would be to "Find the price before the word "HD Version" (case insensitive). Here is what I have so far:

re.match(r'^(\d|.){1,6}...HD\sVersion', string)

How would I extract the value "19.99" from the above string?

3
  • 1
    Please insert here the usual admonishment about parsing HTML with regex Commented Sep 11, 2014 at 23:06
  • I cannot do that, I don't have the full html. I only have the above snippet as a string. Commented Sep 11, 2014 at 23:07
  • 2
    you can still use BeautifulSoup Commented Sep 11, 2014 at 23:07

5 Answers 5

4

BeautifulSoup is very lenient to the HTML it parses, you can use it for the chunks/parts of HTML too:

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

data = u"""
<div id="left-stack">
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>
"""

soup = BeautifulSoup(data)
print soup.find('span', class_='price').text[1:]

Prints:

19.99
Sign up to request clarification or add additional context in comments.

Comments

4

You've asked for a regular expression here, but it's not the right tool for parsing HTML. Use BeautifulSoup for this.

>>> from bs4 import BeautifulSoup
>>> html = '''
<div id="left-stack">        
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>'''
>>> soup = BeautifulSoup(html)
>>> val  = soup.find('span', {'class':'price'}).text
>>> print val[1:]
19.99

Comments

2

You can still parse using BeautifulSoup, you don't need the full html:

from bs4 import BeautifulSoup
html="""
<div id="left-stack">
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>
"""

soup = BeautifulSoup(html)
sp = soup.find(attrs={"class":"price"}) 
print sp.text[1:]
19.99

Comments

2

The current BeautifulSoup answers only show how to grab all <span class="price"> tags. This is better:

from bs4 import BeautifulSoup

soup = """<div id="left-stack">        
 <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>"""

for HD_Version in (tag for tag in soup('li') if tag.text.lower() == 'hd version'):
    price = HD_Version.parent.findPreviousSibling('span', attrs={'class':'price'}).text

In general, using regular expressions to parse an irregular language like HTML is asking for trouble. Stick with an established parser.

3 Comments

If the OP only has the snippet then there is only one span class="price">
@PadraicCunningham Right, but I don't necessarily trust the OP's input. This should grab the first <span class="price"> before the <ul> that contains <li>HD Version</li>. After all if OP only has that exact snippet, then why is he looking for a programmatic solution?
@Unihedron fair enough. I added a clarification that I was referring to the beautifulsoup answers, and also the usual warning not to do what OP is trying to do
0

You can use this regex:

\d+(?:\.\d+)?(?=\D+HD Version)
  • \D+ skips ahead of non-digits in a lookahead, effectively asserting that our match (19.99) is the last digit ahead of HD Version.

Here is a regex demo.

Use the i modifier in the regex to make the matching case-insensitive and change + to* if the number can be directly before HD Version.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.