regexp python with parsing html page

Question

Good day. Little problem with regexp.

I have a regexp that look like

rexp2 = re.findall(r'<p>(.*?)</p>', data)

And i need to grab all in

<div id="header">
<h1></h1>
<p>
localhost OpenWrt Backfire<br />
Load: 0.00 0.00 0.00<br />
Hostname: localhost
</p>
</div>

But my code doesnt work :( What im doing wrong?

What does "doesn't work" look like?

duffymo
– duffymo

2010-09-06 12:54:12 +00:00
Commented Sep 6, 2010 at 12:54 — duffymo
– duffymo, Commented Sep 6, 2010 at 12:54
stackoverflow.com/questions/1732348/…

wRAR
– wRAR

2010-09-06 13:00:34 +00:00
Commented Sep 6, 2010 at 13:00 — wRAR
– wRAR, Commented Sep 6, 2010 at 13:00

Community · Accepted Answer · 2017-05-23 12:07:01Z

4

Statutory Warning: It is a Bad Idea to parse (X)HTML using regular expression.

Fortunately there is a better way. To get going, first install the BeautifulSoup module. Next, read up on the documentation. Third, code!

Here is one way to do what you are trying to do:

from BeautifulSoup import BeautifulSoup
html = """<div id="header">
<h1></h1>
<p>
localhost OpenWrt Backfire<br />
Load: 0.00 0.00 0.00<br />
Hostname: localhost
</p>
</div>"""
soup = BeautifulSoup(html)
for each in soup.findAll(name = 'p'):
    print each

edited May 23, 2017 at 12:07

CommunityBot

11 silver badge

answered Sep 6, 2010 at 13:22

Manoj Govindan

75.2k21 gold badges138 silver badges142 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

duffymo · Accepted Answer · 2010-09-06 12:53:38Z

1

I wouldn't recommend using regular expressions this way. Try parsing HTML with Beautiful Soup instead and walk the DOM tree.

answered Sep 6, 2010 at 12:53

duffymo

310k46 gold badges376 silver badges570 bronze badges

1 Comment

Alexander Over a year ago

Ok.How can a do it with beautiful soup?

jcubic · Accepted Answer · 2010-09-06 12:58:09Z

0

dot is not mathching enter, use re.DOTALL

re.findall(r'<p>(.*?)</p>', data, re.DOTALL)

answered Sep 6, 2010 at 12:58

jcubic

67.1k58 gold badges252 silver badges466 bronze badges

Comments

rkhayrov · Accepted Answer · 2010-09-06 12:59:15Z

0

You need to specify re.M (multiline) flag to match multiline strings. But parsing HTML with regexps isn't a particularly good idea.

It looks like you want some stats from an OpenWrt-powered router. Why don't you write simple CGI script that outputs required information in machine-readable format?

answered Sep 6, 2010 at 12:59

rkhayrov

10.3k2 gold badges38 silver badges42 bronze badges

Collectives™ on Stack Overflow

regexp python with parsing html page

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related