0

I want to parse in one project XML and HTML at the same time.

I tried this:

from xml.etree import ElementTree as ET

tree = ET.parse(fpath)
html_file = ET.parse(htmlpath)

and got this error:

Traceback (most recent call last): File "C:.py", line 55, in html_file = ET.parse("htmlpath") File "C:\Users\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1197, in parse tree.parse(source, parser) File "C:\Users\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 598, in parse self._root = parser._parse_whole(source) xml.etree.ElementTree.ParseError: undefined entity  : line 690, column 78

2
  • The document referenced by html_path is not well-formed, and therefore it cannot be parsed as XML (ElementTree works with XML, not arbitrary HTML). The problem is that the document contains the   entity reference without the corresponding declaration for the entity. See stackoverflow.com/q/14744945/407651. Commented Aug 15, 2019 at 9:56
  • I suggest that you try the BeautifulSoup library: pypi.org/project/beautifulsoup4. You can use it for both XML and HTML. Commented Aug 15, 2019 at 13:47

1 Answer 1

0

The nbsp is a standard html5 entity. It may help to convert those to their unicode characters before running the xml parser. In python3.4+ you can use html.unescape for that.

from html import escape, unescape
textXML = re.sub("\\&\\w+\\;", lambda x: escape(unescape(x.group(0))), text)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.