0

I wish to fetch the source of a webpage and parse individual tags myself. How can I do this in Python?

4 Answers 4

3
import urllib2
urllib2.urlopen('http://stackoverflow.com').read()

That's the simple answer, but you should really look at BeautifulSoup

http://www.crummy.com/software/BeautifulSoup/

Sign up to request clarification or add additional context in comments.

Comments

2

Some options are:

All except httplib2 and Beautiful Soup are in the Python Standard Library. The pages for each of the packages above contain simple examples that will let you see what suits your needs best.

Comments

1

I would suggest you use BeautifulSoup

#for HTML parsing
from BeautifulSoup import BeautifulSoup
import urllib2

doc = urllib2.urlopen('http://google.com').read()

soup = BeautifulSoup(''.join(doc))

soup.contents[0].name

After this you can pretty much parse anything out of this document. See documentation which has detailed examples of how to do it.

Comments

1

All the answers here are true, and BeautifulSoup is great, however when the source HTML is dynamically created by javascript, and that's usually the case these days, you'll need to use some engine that first creates the final HTML and only then fetch it, or else you'll have most of the content missing.

As far as I know, the easiest way is simply using the browser's engine for this. In my experience, Python+Selenium+Firefox is the least resistant path

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.