Converting XML illegal &char to utf8 - python

Question

There is a list of XML and HTML character references at: https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references.

However there are things that aren't defined at all in that list but they were used in older HTML scripts. When I am processing the Senseval-2 format (with fixes) dataset from http://www.d.umn.edu/~tpederse/data.html, I encounter the following words where it breaks my script which tried to use xml.et.elementTree to parse the data.

What are the unicode equivalence of these words?

&and.
&and.A
&and.B
&and.D
&and.L's
&backquote.alim)
&backquote.ulema
&dash
&dash.
&dash."
&dashq.
&degree.
&degree.C
&ellip
&ellip.
&ellip.0
&ellip.1
&ellip.11
&ellip.2
&ellip.23
&ellip.28
&ellip.38
&ellip.4
&ellip.6
&ellip.64
&ellip.?"
&ellip.two
&times.

my script:

import xml.etree.ElementTree as et
s1 = 'train-fix.xml' # from http://www.d.umn.edu/~tpederse/Data/Sval1to2.fix.tar.gz
tree = et.parse(s1)
root = tree.getroot()

gives this traceback:

Traceback (most recent call last):
  File "senseval.py", line 4, in <module>
    tree = et.parse(s1)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1182, in parse
    tree.parse(source, parser)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
    parser.feed(data)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
    self._raiseerror(v)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 41, column 113

Those aren't xml entities, the should terminate whit a ;, not a .. Entity reference: w3.org/TR/xml-entity-names — mata
– mata, Commented Sep 26, 2013 at 15:25
not really. dash could be a html5 character entity, but ellip on the other hand isn't a valid entity anywhere I could find, neither is degree... — mata
– mata, Commented Sep 26, 2013 at 15:57
There is a list of entities in the DTD file linked from that page, but sans actual character definitions. As for the error, etree is right: without the trailing ; this is just not XML. — bobince
– bobince, Commented Sep 27, 2013 at 10:46

mzjn · Accepted Answer · 2013-09-28 18:27:31Z

4

+25

The "words" look like malformed entity references. A valid entity reference has a semicolon at the end. I looked at test-fix.xml (in Sval1to2.fix.tar.gz) and it seems very likely that &dash (or &dash.) is meant to represent some kind of dash or hyphen. The file has the .xml extension and it would be fairly close to being well-formed XML if the bad entity references were fixed.

On the page that you link to (http://www.d.umn.edu/~tpederse/data.html), it says:

Please note that our converted data will not "parse" as true xml text. This is due to the fact that in the original sense-tagged text, characters that require special handling in xml are not escaped, and so forth. We are considering ways to make this data "true" xml, and would be most grateful for any feedback on how to best do this.

So even though the document looks very much like XML, it is not XML and the people who published it are well aware of that.

edited Sep 28, 2013 at 18:27

answered Sep 28, 2013 at 17:13

mzjn

51.5k16 gold badges139 silver badges265 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:16:58Z

3

I found this answer that made it possible to parse your xml using Python lxml package:

Fetching data using Python & lxml

Install the lxml package from here: http://lxml.de/

And use this code:

import lxml.html
root = lxml.html.parse('train-fix.xml').getroot()

Hope it'll work for you

edited May 23, 2017 at 12:16

CommunityBot

11 silver badge

answered Sep 26, 2013 at 15:39

wilfo

7151 gold badge6 silver badges19 bronze badges

1 Comment

alvas Over a year ago

+1 for the lxml parses but it doesn't resolve the problem of what are those characters? =(

jhermann · Accepted Answer · 2013-09-28 17:03:32Z

3

The basic but disappointing answer is: they're typos (using . instead of ;).

Here's most of them:

times → http://www.fileformat.info/info/unicode/char/d7/index.htm
degree → http://www.fileformat.info/info/unicode/char/b0/index.htm
dash → http://www.fileformat.info/info/unicode/char/search.htm?q=dash&preview=entity
ellip → http://www.fileformat.info/info/unicode/char/2026/index.htm

… and so on, you have to look at the context for some of these, to judge whether the original text author meant something specific, or simply typo'ed even worse (dashq‽).

Your most appropriate course of action is to use a simple chain of string replace method calls to fix the mess, before parsing.

answered Sep 28, 2013 at 17:03

jhermann

2,11115 silver badges18 bronze badges

Comments

LMC · Accepted Answer · 2013-10-04 18:16:17Z

if you have Linux available use xmllint to find errors and fix them

xmllint --recover ~/tmp/test-fix.xml --output ~/tmp/test-fix-fixed.xml 
/home/luis/tmp/test-fix.xml:179: parser error : EntityRef: expecting ';'
inate, Hesse and the Saarland; North Rhine-Westphalia, Baden-Wu&umlaut.rttemberg
                                                                           ^
/home/luis/tmp/test-fix.xml:179: parser error : EntityRef: expecting ';'
Bavaria would remain untouched, and the planned five East German La&umlaut.nder
...
/home/luis/tmp/test-fix.xml:3832: parser error : EntityRef: expecting ';'
Charlie Watts today) we should be ready to hit the road together as Lyndon &and.
                                                                           ^
/home/luis/tmp/test-fix.xml:3841: parser error : Opening and ending tag mismatch: corpus line 1 and lexelt
</lexelt>
     ^
/home/luis/tmp/test-fix.xml:3842: parser error : Extra content at the end of the document
<lexelt item="behaviour-n">


                                                                           ^

Collectives™ on Stack Overflow

Converting XML illegal &char to utf8 - python

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related