6

I'm looking to write a Python script (using 3.4.3) that grabs a HTML page from a URL and can go through the DOM to try to find a specific element.

I currently have this:

#!/usr/bin/env python
import urllib.request

def getSite(url):
    return urllib.request.urlopen(url)

if __name__ == '__main__':
    content = getSite('http://www.google.com').read()
    print(content)

When I print content it does print out the entire html page which is something close to what I want... although I would ideally like to be able to navigate through the DOM rather then treating it as a giant string.

I'm still fairly new to Python but have experience with multiple other languages (mainly Java, C#, C++, C, PHP, JS). I've done something similar with Java before but wanted to try it out in Python.

3
  • 2
    You should use something like BeautifulSoup for this. Commented Mar 12, 2015 at 3:21
  • Close to be duplicate of Parsing HTML Python. Commented Mar 12, 2015 at 3:25
  • 1
    You could also use lxml. Commented Mar 12, 2015 at 3:26

2 Answers 2

12

There are many different modules you could use. For example, lxml or BeautifulSoup.

Here's an lxml example:

import lxml.html

mysite = urllib.request.urlopen('http://www.google.com').read()
lxml_mysite = lxml.html.fromstring(mysite)

description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description
text = description.get('content') # content attribute of the tag

>>> print(text)
"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

And a BeautifulSoup example:

from bs4 import BeautifulSoup

mysite = urllib.request.urlopen('http://www.google.com').read()
soup_mysite = BeautifulSoup(mysite)

description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
text = description['content'] # text of content attribute

>>> print(text)
u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

Notice how BeautifulSoup returns a unicode string, while lxml does not. This can be useful/hurtful depending on what is needed.

Sign up to request clarification or add additional context in comments.

6 Comments

It seems that trying to use BeautifulSoup gives me an error as I'm using Python 3.4.3.
'File "find.py", line 3, in <module> from bs4 import BeautifulSoup File "C:\Users\Jake\Desktop\bs4_init_.py", line 175 except Exception, e: ^ SyntaxError: invalid syntax' I looked it up and it seems to be something to do with the fact that it's a 2.x library?
Can someone please tell me as to why do people suggest BeautifulSoup or lxml over the native html parser?
@Shatu: Modules like BeautifulSoup and lxml are better in performance, generally speaking.
@Shatu: Speed, memory usage, etc. I'm unsure how either perform with malformed data
|
3

Check out the BeautifulSoup module.

from bs4 import BeautifulSoup
import urllib                                       
soup = BeautifulSoup(urllib.urlopen("http://google.com").read())

for link in soup.find_all('a'):
    print(link.get('href'))

1 Comment

Hiya, this may well solve the problem... but it'd be good if you could edit your answer and provide a little explanation about how and why it works :) Don't forget - there are heaps of newbies on Stack overflow, and they could learn a thing or two from your expertise - what's obvious to you might not be so to them.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.