3

I am trying to parse HTML page in python using urllib2 and ElementTree and I am facing trouble parsing the HTML. Webpage contains "&" within quoted string but ElementTree throws parseError for lines containing &

Script:

import urllib2

url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'
req = urllib2.Request(url, headers={'Content-type': 'text/xml'})
r = urllib2.urlopen(req).read()

import xml.etree.ElementTree as ET
htmlpage=ET.fromstring(r)

This throws following error in Python 2.7

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1282, in XML
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1624, in feed
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1488, in _raiseerror
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 676, column 73

Error corresponds to following line

<input type="hidden" id="HdnFldAndamanNicobar" value="1,Andaman & Nicobar Islands;" />

Looks like when HTML page is read, & sign is not parsed as &amp; in variable r

I tried to parse using htmlTreeParse using R program and "&" gets converted to &amp; properly.

Let me know if I am missing anything in urllib2

EDIT : I replaced "&" to &amp; but line 904 contains < sign within javascript which throws same error. There should be a better option rather than replacing characters.

LINE:904    for (i = 0; i < strac.length - 1; i++) {

1 Answer 1

7

First of all, xml.etree.ElementTree is an XML parser. It does not handle HTML entities out of the box. & is an illegal thing to have inside the XML and this is why it is failing.

Get yourself going with a real specialized HTML parser, BeautifulSoup:

>>> from urllib2 import urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'
>>> soup = BeautifulSoup(urlopen(url))
>>> soup.find('td').text.strip()
u'ELECTION COMMISSION OF INDIA'

See also:

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. Are there any option that can be specified in urllib2 request so that these conversions happen
@Manuel I don't think there is anything like this in urllib2. urllib2 does the job of getting the page. Handling HTML entities is a job for a parser.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.