1

I am trying to sort through HTML tags and I can't seem to get it right.

What I have done so far

import urllib
import re

s = raw_input('Enter URL: ')
f = urllib.urlopen(s) 
s = f.read() 
f.close 
r = re.compile('<TAG\b[^>]*>(.*?)</TAG>',)
result = re.findall(r, s)
print(result)

Where I replace "TAG" with tag I want to see.

Thanks in advance.

4
  • 3
    Use an XML parser to parse HTML. Mandatory link: stackoverflow.com/questions/1732348/… Commented Jan 31, 2011 at 22:05
  • 1
    Don't parse HTML with regex. Regex is an insufficiently complex tool to parse HTML. If someone is asking you to do this, beat them over the head with a stick and then use BeautifulSoup instead. It'll be less painful for the both of you. Commented Jan 31, 2011 at 22:27
  • What sort of results are you currently getting? Commented Jan 31, 2011 at 22:27
  • It is not a good idea to use a xml parser if you are scanning html from the web some html page are very far from a xml compliant file. Commented Feb 1, 2011 at 8:27

3 Answers 3

5

You should really try using libraries which can perform HTML parsing out of the box. Beautiful Soup is one of my favorites.

Sign up to request clarification or add additional context in comments.

2 Comments

BeautifulSoup is perfect for this.
This is a regex learning experience and it was the only real example I could come up with. I.E. if dog.avicat.avipig.jpg is there anyway I could sort it out to be dog.avi cat.avi pig.jpg
1

An example from BS is this

from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))
soup.findAll('b')
[<b>one</b>, <b>two</b>]

As for a regular expression, you can use

aa = doc[0]
aa
'<html><head><title>Page title</title></head>'
pt = re.compile('(?<=<title>).*?(?=</title>)')
re.findall(pt,aa)
['Page title']

Comments

1

I'm not entirely clear on what you are trying to achieve with the regex. Capturing the contents between two div tags for instance works with

re.compile("<div.*?>.*?</div>")

Although you will run into some problems with nested divs with the above one.

2 Comments

I think you forgot the parentheses around the capture group. Did you mean: re.compile("<div.*?>(.*?)</div>")?
He's not using the capture group in his code for anything so I thought it wasn't needed. I'm sure an able programmer can add it should the need for one arise.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.