0

I want to create a function using Python to get the website content, for example get the website organization content.

In the code, organization is University of Tokyo:

<tr class="odd">
  <th>Organization:</th>
  <td>University of Tokyo</td>
</tr>

how can i directly get the website content without any new installation like get http://www.ip-adress.com/ip_tracer/157.123.22.11

3
  • @jesseslu Do you need to download the file? Or only parse and access it? Commented Oct 11, 2012 at 6:46
  • 1
    Well, you need to get the html file :) Commented Oct 11, 2012 at 6:59
  • I think you will have a problem opening this webiste as suggested by others. added an answer to do this... Commented Oct 11, 2012 at 7:26

3 Answers 3

3

I like BeautifulSoup, it makes it easy to access data in HTML strings. The actual complexity depends on how the HTML is formed. If the HTML uses 'id's and 'class'es, it is easy. If not, you depend on something more static, like "take the first div, the second list item, ...", which is terrible if the contents of the HTML changes a lot.

To download the HTML, i quote the example from the BeautifulSoup docs:

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
    where, linebreak, what = incident.contents[:3]
    print where.strip()
    print what.strip()
    print
Sign up to request clarification or add additional context in comments.

1 Comment

how can i directly get the website content without any new installation like get ip-adress.com/ip_tracer/157.123.22.11
2

Use BeautifulSoup:

import bs4

html = """<tr class="odd">
  <th>Organization:</th>
  <td>University of Tokyo</td>
</tr>
"""
soup = bs4.BeautifulSoup(html)
univ = soup.tr.td.getText()
assert univ == u"University of Tokyo"

Edit:

If you need to read the HTML first, use urllib2:

import urllib2

html = urllib2.urlopen("http://example.com/").read()

4 Comments

how can i directly get the website content without any new installation like get ip-adress.com/ip_tracer/157.123.22.11
See my edit for how to read the contents.
Don't use urllib2! Use requests instead.
@egasimus Requests is nice but it's not part of the Python Standard Library.
0

You will get a 403 Access Forbidden error using urllib2.urlopen as this website is filtering access by checking if it is being accessed by a recognised user agent. So here's the full thing:

import urllib2
import lxml.html as lh

req = urllib2.Request("http://www.ip-adress.com/ip_tracer/157.123.22.11", headers={'User-Agent' : "Magic Browser"})
html = urllib2.urlopen(req).read()
doc=lh.fromstring(html)
print ''.join(doc.xpath('.//*[@class="odd"]')[-1].text_content().split())
>>> 
Organization:ZenithDataSystems

12 Comments

hi when i run it , it shows import lxml.html as lh ImportError: No module named lxml.html?
the lxml.html stand for what?
Thanks, after install lxml, it still has error Traceback (most recent call last): File "ext.py", line 2, in ? import lxml.html as lh File "/usr/lib64/python2.4/site-packages/lxml/html/__init__.py", line 42, in ? from lxml import etree ImportError: /usr/lib64/python2.4/site-packages/lxml/etree.so: undefined symbol: xmlMemDisplayLast
yes, i'm using Python 2.4.3. using centos 5.5
yes, i'm using Python 2.4.3. using centos 5.5
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.