Python Regex Help

Question

I am trying to sort through HTML tags and I can't seem to get it right.

What I have done so far

import urllib
import re

s = raw_input('Enter URL: ')
f = urllib.urlopen(s) 
s = f.read() 
f.close 
r = re.compile('<TAG\b[^>]*>(.*?)</TAG>',)
result = re.findall(r, s)
print(result)

Where I replace "TAG" with tag I want to see.

Thanks in advance.

Use an XML parser to parse HTML. Mandatory link: stackoverflow.com/questions/1732348/… — Sven Marnach
– Sven Marnach, Commented Jan 31, 2011 at 22:05
Don't parse HTML with regex. Regex is an insufficiently complex tool to parse HTML. If someone is asking you to do this, beat them over the head with a stick and then use BeautifulSoup instead. It'll be less painful for the both of you. — Chinmay Kanchi
– Chinmay Kanchi, Commented Jan 31, 2011 at 22:27
It is not a good idea to use a xml parser if you are scanning html from the web some html page are very far from a xml compliant file. — VGE
– VGE, Commented Feb 1, 2011 at 8:27

Miguel · Accepted Answer · 2011-01-31 22:00:08Z

5

You should really try using libraries which can perform HTML parsing out of the box. Beautiful Soup is one of my favorites.

answered Jan 31, 2011 at 22:00

Miguel

511 bronze badge

Sign up to request clarification or add additional context in comments.

2 Comments

rxmnnxfpvg Over a year ago

BeautifulSoup is perfect for this.

Krayons Over a year ago

This is a regex learning experience and it was the only real example I could come up with. I.E. if dog.avicat.avipig.jpg is there anyway I could sort it out to be dog.avi cat.avi pig.jpg

gerry · Accepted Answer · 2011-01-31 23:38:16Z

1

An example from BS is this

from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))
soup.findAll('b')
[<b>one</b>, <b>two</b>]

As for a regular expression, you can use

aa = doc[0]
aa
'<html><head><title>Page title</title></head>'
pt = re.compile('(?<=<title>).*?(?=</title>)')
re.findall(pt,aa)
['Page title']

answered Jan 31, 2011 at 23:38

gerry

1,5691 gold badge12 silver badges23 bronze badges

Comments

Matti Lyra · Accepted Answer · 2011-01-31 22:22:41Z

1

I'm not entirely clear on what you are trying to achieve with the regex. Capturing the contents between two div tags for instance works with

re.compile("<div.*?>.*?</div>")

Although you will run into some problems with nested divs with the above one.

answered Jan 31, 2011 at 22:22

Matti Lyra

13.1k8 gold badges53 silver badges67 bronze badges

2 Comments

Eli Over a year ago

I think you forgot the parentheses around the capture group. Did you mean: re.compile("<div.*?>(.*?)</div>")?

Matti Lyra Over a year ago

He's not using the capture group in his code for anything so I thought it wasn't needed. I'm sure an able programmer can add it should the need for one arise.

Collectives™ on Stack Overflow

Python Regex Help

3 Answers 3

2 Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related