Can't parse html using xml.etree.ElementTree

Question

I am trying to parse the xml of google.com however I am getting a 'not well-formed' error. Why is this? Thanks

➜  testing cat code.py
from urllib.request import urlopen; from xml.etree.ElementTree import fromstring
fromstring(urlopen('https://www.google.com').read().replace(b'<!doctype html>',b'<!DOCTYPE html>'))
➜  testing python3 code.py
Traceback (most recent call last):
  File "code.py", line 2, in <module>
    fromstring(urlopen('https://www.google.com').read().replace(b'<!doctype html>',b'<!DOCTYPE html>'))
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/etree/ElementTree.py", line 1315, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 1826
➜  testing

Jack Fleeting · Accepted Answer · 2020-06-10 17:24:42Z

3

You are probably getting the error message because you are trying to parse HTML with an XML parser; it won't work. Try it with a library with an HTML parser. Also, I would recommend getting the page with requests, instead. So together:

import requests
import lxml.html as lh

req = requests.get('https://www.google.com')
lh.fromstring(req.text)

and it should work.

answered Jun 10, 2020 at 17:24

Jack Fleeting

25k6 gold badges27 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Can't parse html using xml.etree.ElementTree

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related