python lxml.html.parse not reading url

Question

Why is html.parse(url) failing, when using requests then html.fromstring works and html.parse(url2) works? lxml 3.4.2

    Python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import requests
>>> from lxml import html
>>> url = 'http://www.oddschecker.com'
>>> page = requests.get(url).content
>>> tree = html.fromstring(page)
>>> html.parse(url)

Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    html.parse(url)
  File "C:\program files\Python27\lib\site-packages\lxml\html\__init__.py", line 788, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 3301, in lxml.etree.parse (src\lxml\lxml.etree.c:72453)
  File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:105915)
  File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106214)
  File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:105213)
  File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:100163)
  File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94286)
  File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:95722)
  File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:94754)
IOError: Error reading file 'http://www.oddschecker.com': failed to load HTTP resource
>>> url2 = 'http://www.google.com'
>>> html.parse(url2)
<lxml.etree._ElementTree object at 0x00000000033BAF88>

Maybe oddschecker.com is rejecting the User Agent from lxml — dursk
– dursk, Commented Mar 2, 2015 at 2:05
@mattm Could be. Any idea what user agent lxml claims to be? — foosion
– foosion, Commented Mar 2, 2015 at 2:26
@mattm alecxe's answer confirms your suggestion that it's rejecting lxml's user agent (none). — foosion
– foosion, Commented Mar 2, 2015 at 19:12

alecxe · Accepted Answer · 2015-03-02 18:49:50Z

2

Adding some clarification to @michael_stackof's answer. This particular URL would return 403 Forbidden status code if User-Agent header is not supplied.

According to the lxml's source code, it uses urllib2.urlopen() without supplying a User-Agent header which results into 403, which results into failed to load HTTP resource error.

On the other hand, requests provides a default User-Agent header if not explicitly passed:

>>> requests.get(url).request.headers['User-Agent']
'python-requests/2.3.0 CPython/2.7.6 Darwin/14.1.0'

To prove the point, set the User-Agent header to None and see:

>>> requests.get(url).status_code
200
>>> requests.get(url, headers={'User-Agent': None}).status_code
403

answered Mar 2, 2015 at 18:49

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

foosion Over a year ago

That does clarify things. Thanks!

michael_stackof · Accepted Answer · 2015-03-02 02:22:35Z

2

When the http status not 200, html.parse will quit.

the return status of http://www.oddschecker.com. enter image description here

answered Mar 2, 2015 at 2:22

michael_stackof

2233 silver badges12 bronze badges

4 Comments

foosion Over a year ago

I tried nytimes.com with html.parse() and it worked fine. nytimes.com generates some non-200 responses.

michael_stackof Over a year ago

this one . stackoverflow.com/questions/9303567/…

foosion Over a year ago

google.com works without a trailing /, as does nytimescom. oddschecker.com fails with or without a trailing /. Therefore, I don't understand the link to that question.

alecxe Over a year ago

@michael_stackof check out my answer - should clarify things here.

Collectives™ on Stack Overflow

python lxml.html.parse not reading url

2 Answers 2

1 Comment

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related