Parse HTML with Python [duplicate]

Question

I want to create a function using Python to get the website content, for example get the website organization content.

In the code, organization is University of Tokyo:

<tr class="odd">
  <th>Organization:</th>
  <td>University of Tokyo</td>
</tr>

how can i directly get the website content without any new installation like get http://www.ip-adress.com/ip_tracer/157.123.22.11

@jesseslu Do you need to download the file? Or only parse and access it? — Rudolf Mühlbauer
– Rudolf Mühlbauer, Commented Oct 11, 2012 at 6:46
I think you will have a problem opening this webiste as suggested by others. added an answer to do this... — root
– root, Commented Oct 11, 2012 at 7:26

Rudolf Mühlbauer · Accepted Answer · 2012-10-11 07:10:06Z

3

I like BeautifulSoup, it makes it easy to access data in HTML strings. The actual complexity depends on how the HTML is formed. If the HTML uses 'id's and 'class'es, it is easy. If not, you depend on something more static, like "take the first div, the second list item, ...", which is terrible if the contents of the HTML changes a lot.

To download the HTML, i quote the example from the BeautifulSoup docs:

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
    where, linebreak, what = incident.contents[:3]
    print where.strip()
    print what.strip()
    print

edited Oct 11, 2012 at 7:10

answered Oct 11, 2012 at 6:48

Rudolf Mühlbauer

2,53118 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

AntiGMO Over a year ago

how can i directly get the website content without any new installation like get ip-adress.com/ip_tracer/157.123.22.11

score 2 · Accepted Answer · 2012-10-11 07:04:36Z

2

Use BeautifulSoup:

import bs4

html = """<tr class="odd">
  <th>Organization:</th>
  <td>University of Tokyo</td>
</tr>
"""
soup = bs4.BeautifulSoup(html)
univ = soup.tr.td.getText()
assert univ == u"University of Tokyo"

Edit:

If you need to read the HTML first, use urllib2:

import urllib2

html = urllib2.urlopen("http://example.com/").read()

edited Oct 11, 2012 at 7:04

answered Oct 11, 2012 at 6:50

user647772

4 Comments

AntiGMO Over a year ago

how can i directly get the website content without any new installation like get ip-adress.com/ip_tracer/157.123.22.11

user647772 Over a year ago

See my edit for how to read the contents.

avramov Over a year ago

Don't use urllib2! Use requests instead.

user647772 Over a year ago

@egasimus Requests is nice but it's not part of the Python Standard Library.

root · Accepted Answer · 2012-10-11 17:09:12Z

0

You will get a 403 Access Forbidden error using urllib2.urlopen as this website is filtering access by checking if it is being accessed by a recognised user agent. So here's the full thing:

import urllib2
import lxml.html as lh

req = urllib2.Request("http://www.ip-adress.com/ip_tracer/157.123.22.11", headers={'User-Agent' : "Magic Browser"})
html = urllib2.urlopen(req).read()
doc=lh.fromstring(html)
print ''.join(doc.xpath('.//*[@class="odd"]')[-1].text_content().split())
>>> 
Organization:ZenithDataSystems

edited Oct 11, 2012 at 17:09

answered Oct 11, 2012 at 7:19

root

81.1k25 gold badges111 silver badges120 bronze badges

12 Comments

AntiGMO Over a year ago

hi when i run it , it shows import lxml.html as lh ImportError: No module named lxml.html?

AntiGMO Over a year ago

the lxml.html stand for what?

AntiGMO Over a year ago

Thanks, after install lxml, it still has error Traceback (most recent call last): File "ext.py", line 2, in ? import lxml.html as lh File "/usr/lib64/python2.4/site-packages/lxml/html/__init__.py", line 42, in ? from lxml import etree ImportError: /usr/lib64/python2.4/site-packages/lxml/etree.so: undefined symbol: xmlMemDisplayLast

AntiGMO Over a year ago

yes, i'm using Python 2.4.3. using centos 5.5

AntiGMO Over a year ago

yes, i'm using Python 2.4.3. using centos 5.5

|

Collectives™ on Stack Overflow

Parse HTML with Python [duplicate]

3 Answers 3

1 Comment

4 Comments

12 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

4 Comments

12 Comments

Linked

Related