Going through HTML DOM in Python

Question

I'm looking to write a Python script (using 3.4.3) that grabs a HTML page from a URL and can go through the DOM to try to find a specific element.

I currently have this:

#!/usr/bin/env python
import urllib.request

def getSite(url):
    return urllib.request.urlopen(url)

if __name__ == '__main__':
    content = getSite('http://www.google.com').read()
    print(content)

When I print content it does print out the entire html page which is something close to what I want... although I would ideally like to be able to navigate through the DOM rather then treating it as a giant string.

I'm still fairly new to Python but have experience with multiple other languages (mainly Java, C#, C++, C, PHP, JS). I've done something similar with Java before but wanted to try it out in Python.

You should use something like BeautifulSoup for this.

rnevius
– rnevius

2015-03-12 03:21:13 +00:00
Commented Mar 12, 2015 at 3:21 — rnevius
– rnevius, Commented Mar 12, 2015 at 3:21
Close to be duplicate of Parsing HTML Python.

alecxe
– alecxe

2015-03-12 03:25:24 +00:00
Commented Mar 12, 2015 at 3:25 — alecxe
– alecxe, Commented Mar 12, 2015 at 3:25
You could also use lxml.

Zach Gates
– Zach Gates

2015-03-12 03:26:56 +00:00
Commented Mar 12, 2015 at 3:26 — Zach Gates
– Zach Gates, Commented Mar 12, 2015 at 3:26

Zach Gates · Accepted Answer · 2015-03-12 04:00:56Z

12

There are many different modules you could use. For example, lxml or BeautifulSoup.

Here's an lxml example:

import lxml.html

mysite = urllib.request.urlopen('http://www.google.com').read()
lxml_mysite = lxml.html.fromstring(mysite)

description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description
text = description.get('content') # content attribute of the tag

>>> print(text)
"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

And a BeautifulSoup example:

from bs4 import BeautifulSoup

mysite = urllib.request.urlopen('http://www.google.com').read()
soup_mysite = BeautifulSoup(mysite)

description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
text = description['content'] # text of content attribute

>>> print(text)
u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

Notice how BeautifulSoup returns a unicode string, while lxml does not. This can be useful/hurtful depending on what is needed.

answered Mar 12, 2015 at 4:00

Zach Gates

4,1651 gold badge29 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Jake Alsemgeest Over a year ago

It seems that trying to use BeautifulSoup gives me an error as I'm using Python 3.4.3.

Jake Alsemgeest Over a year ago

'File "find.py", line 3, in <module> from bs4 import BeautifulSoup File "C:\Users\Jake\Desktop\bs4_init_.py", line 175 except Exception, e: ^ SyntaxError: invalid syntax' I looked it up and it seems to be something to do with the fact that it's a 2.x library?

Shatu Over a year ago

Can someone please tell me as to why do people suggest BeautifulSoup or lxml over the native html parser?

Zach Gates Over a year ago

@Shatu: Modules like BeautifulSoup and lxml are better in performance, generally speaking.

Zach Gates Over a year ago

@Shatu: Speed, memory usage, etc. I'm unsure how either perform with malformed data

|

Boa · Accepted Answer · 2015-03-12 03:30:53Z

3

Check out the BeautifulSoup module.

from bs4 import BeautifulSoup
import urllib                                       
soup = BeautifulSoup(urllib.urlopen("http://google.com").read())

for link in soup.find_all('a'):
    print(link.get('href'))

edited Mar 12, 2015 at 3:30

answered Mar 12, 2015 at 3:23

Boa

2,7071 gold badge25 silver badges39 bronze badges

1 Comment

Taryn East Over a year ago

Hiya, this may well solve the problem... but it'd be good if you could edit your answer and provide a little explanation about how and why it works :) Don't forget - there are heaps of newbies on Stack overflow, and they could learn a thing or two from your expertise - what's obvious to you might not be so to them.

Collectives™ on Stack Overflow

Going through HTML DOM in Python

2 Answers 2

6 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related