Parseing xml and html in same project

Question

I want to parse in one project XML and HTML at the same time.

I tried this:

from xml.etree import ElementTree as ET

tree = ET.parse(fpath)
html_file = ET.parse(htmlpath)

and got this error:

Traceback (most recent call last): File "C:.py", line 55, in html_file = ET.parse("htmlpath") File "C:\Users\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1197, in parse tree.parse(source, parser) File "C:\Users\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 598, in parse self._root = parser._parse_whole(source) xml.etree.ElementTree.ParseError: undefined entity  : line 690, column 78

The document referenced by html_path is not well-formed, and therefore it cannot be parsed as XML (ElementTree works with XML, not arbitrary HTML). The problem is that the document contains the   entity reference without the corresponding declaration for the entity. See stackoverflow.com/q/14744945/407651. — mzjn
– mzjn, Commented Aug 15, 2019 at 9:56
I suggest that you try the BeautifulSoup library: pypi.org/project/beautifulsoup4. You can use it for both XML and HTML. — mzjn
– mzjn, Commented Aug 15, 2019 at 13:47

Guido U. Draheim · Accepted Answer · 2023-05-15 22:00:11Z

0

The nbsp is a standard html5 entity. It may help to convert those to their unicode characters before running the xml parser. In python3.4+ you can use html.unescape for that.

from html import escape, unescape
textXML = re.sub("\\&\\w+\\;", lambda x: escape(unescape(x.group(0))), text)

answered May 15, 2023 at 22:00

Guido U. Draheim

3,2861 gold badge23 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Parseing xml and html in same project

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related