1

I want to extract a string from a url (link). That string is in a <h3></h3> tag.

 link = http://www.test.com/page.html

 Content of link: <h3>Text here</h3>

What would be an elegant way to first get the content/sourcecode of page.html and then exctract the link? Thanks!

4 Answers 4

2

I'd recommend Beatiful Soup. That's a nice parser for botched HTML pages (for the most cases you don't have to worry about the page not being well-formed).

Sign up to request clarification or add additional context in comments.

Comments

1
import urllib2
url="http://www.test.com/page.html"
page=urllib2.urlopen(url)
data=page.read()
for item in data.split("</h3>"):
    if "<h3>" in item:
         print item.split("<h3>")[1]

Comments

1

You can use URLLib2 to retrieve the content of the URL:

http://docs.python.org/library/urllib2.html

You could then use the HTML parser in the Python libraries to find the right content:

http://docs.python.org/library/htmlparser.html

Comments

-1

Provided the text you want is the only <h3>-wrapped text on the page, try:

from urllib2 import urlopen
from re import search
text = search(r'(?<=<h3>).+?(?=</h3>)', urlopen(link).read()).group(0)

If there are multiple <h3>-wrapped strings you can either put more details into the pattern or use re.finditer()/re.findall()

2 Comments

You should use a non-greedy qualifier as otherwise it may match something like 'Heading</h3>........<h3>Other Heading'
OP's task is just to get <h3> tags, using regex is perfectly ok.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.