Python regular expression for multiple tags

Question

I would like to know how to retrieve all results from each  tag.

import re
htmlText = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
print re.match('<p[^>]*size="[0-9]">(.*?)</p>', htmlText).groups()

result:

('item1', )

what I need:

('item1', 'item2', 'item3')

-1 for trying to parse non-regular languages with regular expressions. — Svante
– Svante, Commented Jun 10, 2009 at 0:39
agreed, isn't there a python library, that's famous for parsing html? BeautifulSoup? htmllib? — DevelopingChris
– DevelopingChris, Commented Jun 10, 2009 at 13:56
Thanks for your response. I needed a python way to print out all the values of the p tags from a small html without installing anything new in the server. — Felipe Andrade
– Felipe Andrade, Commented Jun 16, 2009 at 16:24

Peter Boughton · Accepted Answer · 2009-06-09 22:14:02Z

11

For this type of problem, it is recommended to use a DOM parser, not regex.

I've seen Beautiful Soup frequently recommended for Python

answered Jun 9, 2009 at 22:14

Peter Boughton

113k32 gold badges125 silver badges177 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Kenan Banks · Accepted Answer · 2009-06-10 12:50:23Z

5

The regex answer is extremely fragile. Here's proof (and a working BeautifulSoup example).

from BeautifulSoup import BeautifulSoup

# Here's your HTML
html = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'

# Here's some simple HTML that breaks your accepted 
# answer, but doesn't break BeautifulSoup.
# For each example, the regex will ignore the first <p> tag.
html2 = '<p size="4" data="5">item1</p><p size="4">item2</p><p size="4">item3</p>'
html3 = '<p data="5" size="4" >item1</p><p size="4">item2</p><p size="4">item3</p>'
html4 = '<p data="5" size="12">item1</p><p size="4">item2</p><p size="4">item3</p>'

# This BeautifulSoup code works for all the examples.
paragraphs = BeautifulSoup(html).findAll('p')
items = [''.join(p.findAll(text=True)) for p in paragraphs]

Use BeautifulSoup.

edited Jun 10, 2009 at 12:50

answered Jun 10, 2009 at 3:19

Kenan Banks

213k36 gold badges160 silver badges176 bronze badges

4 Comments

Brett Bim Over a year ago

I don't think you need to import re. Also, I'm curious what your example provides that mine doesn't other than the list comprehension.

Kenan Banks Over a year ago

Brett - mine will correctly handle cases like item1, whereas yours will fail. Also, the items array here will convert to a list of strings, whereas your example will return tag.contents, which is actually a (very memory hungry) BeautifulSoup object.

Brett Bim Over a year ago

Cool! I didn't know about the object being memory intensive, I've only used it on small parsing projects and never run into issues. Thanks for the update. I voted yours up based on your explanation.

Kenan Banks Over a year ago

I've used BeautifulSoup for some very large (500KB+) HTML files, and you run into a pretty hard wall if you don't learn to conserve memory. BeautifulSoup is extremely convenient but NOT very efficient.

Brett Bim · Accepted Answer · 2009-06-09 23:00:36Z

5

Beautiful soup is definitely the way to go with a problem like this. The code is cleaner and easier to read. Once you have it installed, getting all the tags looks something like this.

from BeautifulSoup import BeautifulSoup
import urllib2

def getTags(tag):
  f = urllib2.urlopen("http://cnn.com")
  soup = BeautifulSoup(f.read())
  return soup.findAll(tag)


if __name__ == '__main__':
  tags = getTags('p')
  for tag in tags: print(tag.contents)

This will print out all the values of the p tags.

answered Jun 9, 2009 at 23:00

Brett Bim

3,3584 gold badges30 silver badges26 bronze badges

1 Comment

Felipe Andrade Over a year ago

Thanks for your response. I just needed a python way to print out all the values of the p tags without installing anything new in the server.

Stephan202 · Accepted Answer · 2009-06-09 22:38:25Z

2

Alternatively, xml.dom.minidom will parse your HTML if,

...it is wellformed
...you embed it in a single root element.

E.g.,

>>> import xml.dom.minidom
>>> htmlText = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
>>> d = xml.dom.minidom.parseString('<not_p>%s</not_p>' % htmlText)
>>> tuple(map(lambda e: e.firstChild.wholeText, d.firstChild.childNodes))
('item1', 'item2', 'item3')

answered Jun 9, 2009 at 22:38

Stephan202

61.9k14 gold badges132 silver badges135 bronze badges

Comments

RichieHindle · Accepted Answer · 2009-06-10 13:53:20Z

2

You can use re.findall like this:

import re
html = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
print re.findall('<p[^>]*size="[0-9]">(.*?)</p>', html)
# This prints: ['item1', 'item2', 'item3']

Edit: ...but as the many commenters have pointed out, using regular expressions to parse HTML is usually a bad idea.

edited Jun 10, 2009 at 13:53

answered Jun 9, 2009 at 22:12

RichieHindle

283k49 gold badges367 silver badges408 bronze badges

5 Comments

Felipe Andrade Over a year ago

Thanks! I just found it on Python docs! docs.python.org/dev/howto/regex.html

Kenan Banks Over a year ago

I'm sorry but this is an awful answer. What if there's a space between the size attribute and the closing bracket: ?

RichieHindle Over a year ago

@Triptych: There isn't. Have you considered the possibility that the OP knows what he's doing? 8-) Had the question been "How do I parse this HTML?" then I wouldn't have suggested a regular expression. But it was "How do I make my regular expression work?", and this is an answer to that question.

nosklo Over a year ago

-1: gave an example of regex to parse html, without even saying that this is really bad, and lots of newbies will read. Evil comes from acts like that.

Brett Bim Over a year ago

@RichieHindle: The original poster didn't say anything about making a regular expression work. He said he wanted to retrieve the results from each p tag. Regular expressions aren't suited to do that.

Collectives™ on Stack Overflow

Python regular expression for multiple tags

5 Answers 5

Comments

4 Comments

1 Comment

Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

4 Comments

1 Comment

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related