lxml in python, parse from url

Question

I am novice to lxml. I want to download the web page and get interested data from, my code is:

import urllib2
from lxml import etree

url = "http://www.example.com/"

html = urllib2.urlopen(url)

root = etree.parse(html) # the problem is here

can anyone explain me why it is wrong?

error is:

Traceback (most recent call last):
  File "yatego.py", line 10, in <module>
    root = etree.parse(html)
  File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:54187)
  File "parser.pxi", line 1550, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:79703)
  File "parser.pxi", line 1580, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:80012)
  File "parser.pxi", line 1463, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:78908)
  File "parser.pxi", line 1019, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:75905)
  File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739)
  File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)
  File "parser.pxi", line 585, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71955)
lxml.etree.XMLSyntaxError: Entity 'mdash' not defined, line 4, column 21

This code:

url = "http://www.example.com/"

res = requests.get(url)
doc = lxml.html.parse(res.content)

gives this error:

File "yatego.py", line 11, in <module>
    doc = lxml.html.parse(res.content)
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 692, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:54187)
  File "parser.pxi", line 1528, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:79485)
  File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:79768)
  File "parser.pxi", line 1457, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:78843)
  File "parser.pxi", line 997, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:75698)
  File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739)
  File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)
  File "parser.pxi", line 583, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71927)
IOError: Error reading file '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>IANA &mdash; Example domains</title>

This code:

doc = lxml.html.parse(url)

works fine

So where is the problem?

What is the problem ? What did you expect and what did you get ? — Kien Truong
– Kien Truong, Commented Mar 20, 2012 at 9:08
what error message do you get? what do you expect to happen? — Mizipzor
– Mizipzor, Commented Mar 20, 2012 at 9:08

Mizipzor · Accepted Answer · 2012-03-20 09:40:44Z

11

The key here is the exception:

IOError: Error reading file '<!DOCTYPE html PUBLIC  ...

Youre passing the content of a file to a function that expects a path to a file. Same reason doc = lxml.html.parse(url) works, a url "is a" filepath.

Does the following work better?

doc = lxml.html.fromstring(res.content)

edited Mar 20, 2012 at 9:40

answered Mar 20, 2012 at 9:35

Mizipzor

52.7k25 gold badges99 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user873286 Over a year ago

You mean this is wrong: res = requests.get(url) doc = lxml.html.parse(res.content)

Mizipzor Over a year ago

yes, the line where you assign to doc is wrong (unless I'm wrong) and should look as I posted.

Zach Kelling · Accepted Answer · 2012-03-20 09:19:43Z

6

You should use lxml.html to parse HTML instead of lxml.etree.

You can also open the url directly with lxml:

doc = lxml.html.parse(url)

Sometimes lxml will have trouble dealing with HTTP's quirks, in which case you'd need to use a more robust solution to fetch pages, like requests:

res = requests.get(url)
doc = lxml.html.parse(res.content)

answered Mar 20, 2012 at 9:19

Zach Kelling

54.1k15 gold badges112 silver badges108 bronze badges

2 Comments

hostingutilities.com Over a year ago

This will work a little bit better. Not entirely sure if it works with lxml.html.parse, but I know it works with lxml.etree.parse. res = requests.get(url); doc = lxml.html.parse(res.raw)

Suncat2000 Over a year ago

Using lxml.html.parse(url) doesn't execute an HTML request and results in an error "Error reading file".

orome · Accepted Answer · 2015-09-14 11:57:44Z

0

You should use html.read() to begin with: HTML is not a string type. Also, you should really check if the URL downloaded properly, as this is by no means assured.

UPD. Use html.parse(filename_or_url)

edited Sep 14, 2015 at 11:57

orome

49.2k64 gold badges236 silver badges458 bronze badges

answered Mar 20, 2012 at 9:09

WeaselFox

7,3989 gold badges53 silver badges77 bronze badges

Collectives™ on Stack Overflow

lxml in python, parse from url

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related