regex regular expression python

Question

I'm having problems with this method in python called findall. I'm accessing a web pages HTML and trying to return the name of a product in this case 'bread' and print it out to the console.

Don't parse HTML with regular expressions. Many people will tell you this. — squiguy
– squiguy, Commented Apr 15, 2013 at 3:09
Looks to me like you're getting the number of spaces wrong. Try \s+ instead to be less dependent on the count, like "Item:\s+is in\s+lane 12\s+(\w*)". (Disclaimer: not really tested.) And while the advice not to use regex to parse HTML is good, while something like BeautifulSoup is going to make it easier to get at the text, if you want to extract bread from the text, you're probably going to wind up using regexes at that point anyway. — DSM
– DSM, Commented Apr 15, 2013 at 3:16
Wow DSM that did the trick I can't believe it just putting \s+. I don't know how the spaces were incorrect. i tried over hundred times even copied and pasted the HTML thanks alot — Calvin Jones
– Calvin Jones, Commented Apr 15, 2013 at 3:23

nad2000 · Accepted Answer · 2013-04-15 03:35:39Z

3

Don't use regex for HTML parsing. There are a few solutions. I suggest BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/)

Having said so, however, in this particular case, RE will suffice. Just relax it a notch. There might be more or less spaces or maybe those are tabs. So instead of literal spaces use the space class \s:

product = re.findall(r'Item:\s*is\s*in\s*lane\s*12\s*(\w*)', content)
print product[0]

Since The '*', '+', and '?' qualifiers are all greedy (they match as much text as possible) you don't need to restrict it with [^<]*<br>

edited Apr 15, 2013 at 3:35

answered Apr 15, 2013 at 3:14

nad2000

5,0141 gold badge33 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Aleksei Zyrianov · Accepted Answer · 2013-04-15 03:31:13Z

1

In case you still want to use regexps, here's a working one for your case:

product = re.findall(r'<br>\s*Item:\s+is\s+in\s+lane 12\s+(\w*)[^<]*<br>', content)

It takes into account DSM's space flexibility suggestion and non-letters after (\w*) that might appear before <br>.

answered Apr 15, 2013 at 3:31

Aleksei Zyrianov

2,3821 gold badge25 silver badges33 bronze badges

Collectives™ on Stack Overflow

regex regular expression python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related