10

I am novice to lxml. I want to download the web page and get interested data from, my code is:

import urllib2
from lxml import etree

url = "http://www.example.com/"

html = urllib2.urlopen(url)

root = etree.parse(html) # the problem is here

can anyone explain me why it is wrong?

error is:

Traceback (most recent call last):
  File "yatego.py", line 10, in <module>
    root = etree.parse(html)
  File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:54187)
  File "parser.pxi", line 1550, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:79703)
  File "parser.pxi", line 1580, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:80012)
  File "parser.pxi", line 1463, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:78908)
  File "parser.pxi", line 1019, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:75905)
  File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739)
  File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)
  File "parser.pxi", line 585, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71955)
lxml.etree.XMLSyntaxError: Entity 'mdash' not defined, line 4, column 21

This code:

url = "http://www.example.com/"

res = requests.get(url)
doc = lxml.html.parse(res.content)

gives this error:

File "yatego.py", line 11, in <module>
    doc = lxml.html.parse(res.content)
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 692, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:54187)
  File "parser.pxi", line 1528, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:79485)
  File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:79768)
  File "parser.pxi", line 1457, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:78843)
  File "parser.pxi", line 997, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:75698)
  File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739)
  File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)
  File "parser.pxi", line 583, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71927)
IOError: Error reading file '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>IANA &mdash; Example domains</title>

This code:

doc = lxml.html.parse(url)

works fine

So where is the problem?

2
  • What is the problem ? What did you expect and what did you get ? Commented Mar 20, 2012 at 9:08
  • what error message do you get? what do you expect to happen? Commented Mar 20, 2012 at 9:08

3 Answers 3

11

The key here is the exception:

IOError: Error reading file '<!DOCTYPE html PUBLIC  ...

Youre passing the content of a file to a function that expects a path to a file. Same reason doc = lxml.html.parse(url) works, a url "is a" filepath.

Does the following work better?

doc = lxml.html.fromstring(res.content)
Sign up to request clarification or add additional context in comments.

2 Comments

You mean this is wrong: res = requests.get(url) doc = lxml.html.parse(res.content)
yes, the line where you assign to doc is wrong (unless I'm wrong) and should look as I posted.
6

You should use lxml.html to parse HTML instead of lxml.etree.

You can also open the url directly with lxml:

doc = lxml.html.parse(url)

Sometimes lxml will have trouble dealing with HTTP's quirks, in which case you'd need to use a more robust solution to fetch pages, like requests:

res = requests.get(url)
doc = lxml.html.parse(res.content)

2 Comments

This will work a little bit better. Not entirely sure if it works with lxml.html.parse, but I know it works with lxml.etree.parse. res = requests.get(url); doc = lxml.html.parse(res.raw)
Using lxml.html.parse(url) doesn't execute an HTML request and results in an error "Error reading file".
0

You should use html.read() to begin with: HTML is not a string type. Also, you should really check if the URL downloaded properly, as this is by no means assured.

UPD. Use html.parse(filename_or_url)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.