2

Why is html.parse(url) failing, when using requests then html.fromstring works and html.parse(url2) works? lxml 3.4.2

    Python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import requests
>>> from lxml import html
>>> url = 'http://www.oddschecker.com'
>>> page = requests.get(url).content
>>> tree = html.fromstring(page)
>>> html.parse(url)

Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    html.parse(url)
  File "C:\program files\Python27\lib\site-packages\lxml\html\__init__.py", line 788, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 3301, in lxml.etree.parse (src\lxml\lxml.etree.c:72453)
  File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:105915)
  File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106214)
  File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:105213)
  File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:100163)
  File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94286)
  File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:95722)
  File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:94754)
IOError: Error reading file 'http://www.oddschecker.com': failed to load HTTP resource
>>> url2 = 'http://www.google.com'
>>> html.parse(url2)
<lxml.etree._ElementTree object at 0x00000000033BAF88>
3
  • 1
    Maybe oddschecker.com is rejecting the User Agent from lxml Commented Mar 2, 2015 at 2:05
  • @mattm Could be. Any idea what user agent lxml claims to be? Commented Mar 2, 2015 at 2:26
  • @mattm alecxe's answer confirms your suggestion that it's rejecting lxml's user agent (none). Commented Mar 2, 2015 at 19:12

2 Answers 2

2

Adding some clarification to @michael_stackof's answer. This particular URL would return 403 Forbidden status code if User-Agent header is not supplied.

According to the lxml's source code, it uses urllib2.urlopen() without supplying a User-Agent header which results into 403, which results into failed to load HTTP resource error.

On the other hand, requests provides a default User-Agent header if not explicitly passed:

>>> requests.get(url).request.headers['User-Agent']
'python-requests/2.3.0 CPython/2.7.6 Darwin/14.1.0'

To prove the point, set the User-Agent header to None and see:

>>> requests.get(url).status_code
200
>>> requests.get(url, headers={'User-Agent': None}).status_code
403
Sign up to request clarification or add additional context in comments.

1 Comment

That does clarify things. Thanks!
2

When the http status not 200, html.parse will quit.

the return status of http://www.oddschecker.com. enter image description here

4 Comments

I tried nytimes.com with html.parse() and it worked fine. nytimes.com generates some non-200 responses.
google.com works without a trailing /, as does nytimescom. oddschecker.com fails with or without a trailing /. Therefore, I don't understand the link to that question.
@michael_stackof check out my answer - should clarify things here.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.