Regex in Python for html

Question

I wanted to write a regex expression for:

<td class="prodSpecAtribute" rowspan="2">[words]</td>

or

<td class="prodSpecAtribute">[words]</td>

for the second case I have:

find2 = re.compile('<td class="prodSpecAtribute">(.*)</td>')

But, how can I create a regex which can use either of the 2 expressions

Are you limited to regex in this situation? Sometimes it's safer to not use regex for HTML parsing... (see Beautiful Soup, or something similar...) — summea
– summea, Commented May 21, 2013 at 19:27
@MikeSamuel: Well, before 2.7.3 and 3.2.something it's actually kind of slow and finicky… but yeah, still better than trying to solve an HTML parsing problem with regex. — abarnert
– abarnert, Commented May 21, 2013 at 19:29

Andrew Clark · Accepted Answer · 2013-05-21 20:24:44Z

4

Don't use regular expressions for this, use an HTML parser like BeautifulSoup. For example:

>>> from bs4 import BeautifulSoup
>>> soup1 = BeautifulSoup('<td class="prodSpecAtribute" rowspan="2">[words]</td>')
>>> soup1.find('td', class_='prodSpecAtribute').contents[0]
u'[words]'
>>> soup2 = BeautifulSoup('<td class="prodSpecAtribute">[words]</td>')
>>> soup2.find('td', class_='prodSpecAtribute').contents[0]
u'[words]'

Or to find all matches:

soup = BeautifulSoup(page)
for td in soup.find_all('td', class_='prodSpecAtribute'):
    print td.contents[0]

With BeautifulSoup 3:

soup = BeautifulSoup(page)
for td in soup.findAll('td', {'class': 'prodSpecAtribute'}):
    print td.contents[0]

edited May 21, 2013 at 20:24

answered May 21, 2013 at 19:30

Andrew Clark

210k36 gold badges285 silver badges310 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

DSM Over a year ago

I might use find_all here instead to handle the multiple-tag case.

Josh Over a year ago

@DSM could you please elaborate, I didn't understand your point. Thanks

Josh Over a year ago

Would this be correct: 'soup = BeautifulSoup(page) info = soup.findAll('td', class_= prodSpecAtribute)'

Josh Over a year ago

@F.J I get an error: for elements in soup.findall('td', class_= 'prodSpecAtribtue'): TypeError: 'NoneType' object is not callable

Andrew Clark Over a year ago

This indicates that soup is None, did you import BeautifulSoup and use soup = BeautifulSoup(page) before this?

|

guettli · Accepted Answer · 2013-05-21 19:29:13Z

3

if you ask for a regex:

find2 = re.compile('<td class="prodSpecAtribute"( rowspan="2")?>(.*)</td>')

But I would use BeautifulSoup.

answered May 21, 2013 at 19:29

guettli

27.7k109 gold badges423 silver badges779 bronze badges

1 Comment

abarnert Over a year ago

Great answer. But you might want to show how simple and readable the BeautifulSoup one-liner solution is.

Zsolt Botykai · Accepted Answer · 2013-05-21 19:30:24Z

0

find2 = re.compile('<td class="prodSpecAtribute"[^>]*>(.*)</td>')

Will work. But there are better solutions for HTML parsing...

answered May 21, 2013 at 19:30

Zsolt Botykai

52k14 gold badges90 silver badges111 bronze badges

2 Comments

eyquem Over a year ago

You must limit the greedy nature of .*

Zsolt Botykai Over a year ago

@eyquem No I must not. The one who asked a question must. And for the sample data he had provided my solution works. But you are right of course.

Visgean Skeloru · Accepted Answer · 2013-05-21 21:23:07Z

0

I would not recommend neither regex nor BeautifulSoup. There is a project pyquery http://pythonhosted.org/pyquery/ that is much faster as it uses lxml.html library, speed comparasion can be found here: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/. From my own experience BeautifulSoup is really slow.

So in your situation it is easy as this code:

>>>from pyquery import PyQuery as pq
>>>page = pq('<td class="prodSpecAtribute">[words]</td>')
>>>page('.prodSpecAtribute').text()
>>>'[words]'

Once again BS is really slow.

answered May 21, 2013 at 21:23

Visgean Skeloru

2,2631 gold badge24 silver badges34 bronze badges

Collectives™ on Stack Overflow

Regex in Python for html

4 Answers 4

8 Comments

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

8 Comments

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related