0

I wanted to write a regex expression for:

<td class="prodSpecAtribute" rowspan="2">[words]</td>

or

<td class="prodSpecAtribute">[words]</td>

for the second case I have:

find2 = re.compile('<td class="prodSpecAtribute">(.*)</td>')

But, how can I create a regex which can use either of the 2 expressions

6
  • 5
    Are you limited to regex in this situation? Sometimes it's safer to not use regex for HTML parsing... (see Beautiful Soup, or something similar...) Commented May 21, 2013 at 19:27
  • Python has a good HTML parser Commented May 21, 2013 at 19:27
  • <td class="prodSpecAtribute"[^>]*?>(.*?)</td> Commented May 21, 2013 at 19:28
  • @MikeSamuel: Well, before 2.7.3 and 3.2.something it's actually kind of slow and finicky… but yeah, still better than trying to solve an HTML parsing problem with regex. Commented May 21, 2013 at 19:29
  • It's pretty simple to use xpath for this kind of tasks. Commented May 21, 2013 at 19:30

4 Answers 4

4

Don't use regular expressions for this, use an HTML parser like BeautifulSoup. For example:

>>> from bs4 import BeautifulSoup
>>> soup1 = BeautifulSoup('<td class="prodSpecAtribute" rowspan="2">[words]</td>')
>>> soup1.find('td', class_='prodSpecAtribute').contents[0]
u'[words]'
>>> soup2 = BeautifulSoup('<td class="prodSpecAtribute">[words]</td>')
>>> soup2.find('td', class_='prodSpecAtribute').contents[0]
u'[words]'

Or to find all matches:

soup = BeautifulSoup(page)
for td in soup.find_all('td', class_='prodSpecAtribute'):
    print td.contents[0]

With BeautifulSoup 3:

soup = BeautifulSoup(page)
for td in soup.findAll('td', {'class': 'prodSpecAtribute'}):
    print td.contents[0]
Sign up to request clarification or add additional context in comments.

8 Comments

I might use find_all here instead to handle the multiple-tag case.
@DSM could you please elaborate, I didn't understand your point. Thanks
Would this be correct: 'soup = BeautifulSoup(page) info = soup.findAll('td', class_= prodSpecAtribute)'
@F.J I get an error: for elements in soup.findall('td', class_= 'prodSpecAtribtue'): TypeError: 'NoneType' object is not callable
This indicates that soup is None, did you import BeautifulSoup and use soup = BeautifulSoup(page) before this?
|
3

if you ask for a regex:

find2 = re.compile('<td class="prodSpecAtribute"( rowspan="2")?>(.*)</td>')

But I would use BeautifulSoup.

1 Comment

Great answer. But you might want to show how simple and readable the BeautifulSoup one-liner solution is.
0
find2 = re.compile('<td class="prodSpecAtribute"[^>]*>(.*)</td>')

Will work. But there are better solutions for HTML parsing...

2 Comments

You must limit the greedy nature of .*
@eyquem No I must not. The one who asked a question must. And for the sample data he had provided my solution works. But you are right of course.
0

I would not recommend neither regex nor BeautifulSoup. There is a project pyquery http://pythonhosted.org/pyquery/ that is much faster as it uses lxml.html library, speed comparasion can be found here: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/. From my own experience BeautifulSoup is really slow.

So in your situation it is easy as this code:

>>>from pyquery import PyQuery as pq
>>>page = pq('<td class="prodSpecAtribute">[words]</td>')
>>>page('.prodSpecAtribute').text()
>>>'[words]'

Once again BS is really slow.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.