I wish to fetch the source of a webpage and parse individual tags myself. How can I do this in Python?
4 Answers
import urllib2
urllib2.urlopen('http://stackoverflow.com').read()
That's the simple answer, but you should really look at BeautifulSoup
Comments
Some options are:
All except httplib2 and Beautiful Soup are in the Python Standard Library. The pages for each of the packages above contain simple examples that will let you see what suits your needs best.
Comments
I would suggest you use BeautifulSoup
#for HTML parsing
from BeautifulSoup import BeautifulSoup
import urllib2
doc = urllib2.urlopen('http://google.com').read()
soup = BeautifulSoup(''.join(doc))
soup.contents[0].name
After this you can pretty much parse anything out of this document. See documentation which has detailed examples of how to do it.
Comments
All the answers here are true, and BeautifulSoup is great, however when the source HTML is dynamically created by javascript, and that's usually the case these days, you'll need to use some engine that first creates the final HTML and only then fetch it, or else you'll have most of the content missing.
As far as I know, the easiest way is simply using the browser's engine for this. In my experience, Python+Selenium+Firefox is the least resistant path