Parsing xml with "not well-formed" characters in python

Question

I am getting xml data from an application, which I want to parse in python:

#!/usr/bin/python

import xml.etree.ElementTree as ET
import re

xml_file = 'tickets_prod.xml'
xml_file_handle = open(xml_file,'r')
xml_as_string = xml_file_handle.read()
xml_file_handle.close()

xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)
root = ET.fromstring(xml_cleaned)

It works for smaller datasets with example data, but when I go to real live data, I get

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 364658, column 72

Looking at the xml file, I see this line 364658:

WARNING - (1 warnings in check_logfiles.protocol-2013-05-28-12-53-46) - ^[[0:36mnotice: Scope(Class[Hwsw]): Not required on ^[[0m</description>

I guess it is the ^[ which makes python choke - it is also highlighted blue in vim. Now I was hoping that I could clean the data with my regex substitution, but that did not work.

The best thing would be fixing the application which generated the xml, but that is out of scope. So I need to deal with the data as it is. How can I work around this? I could live with just throwing away "illegal" characters.

You could try with beautiful soup which did a very good job on escaping invalid characters for me. — Jakob
– Jakob, Commented Oct 29, 2013 at 9:18
Looks like some overly clever guy wanted to have bold or colored output on his tty for this warning. You should remove everything from the escape up to the following 'm'. — Ingo
– Ingo, Commented Oct 29, 2013 at 9:51

Armin Rigo · Accepted Answer · 2013-10-29 09:44:34Z

3

You already do:

xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)

but the character ^[ is probably Python's \x1b. If xml.parser.expat chokes on it, you need simply to clean up more, by only accepting some characters below 0x20 (space). For example:

xml_cleaned = re.sub(u'[^\n\r\t\x20-\x7f]+',u'',xml_as_string)

answered Oct 29, 2013 at 9:44

Armin Rigo

13.1k41 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Isaac Over a year ago

Just wondering: is there a good ressource where I can look up how special characters like ^[ are sometimes represented? I have been stumbling uppon problems like this more then once, and would like to to know how to handle these in the future.

Armin Rigo Over a year ago

In two words, Python uses \xNN systematically except for \t \n \r. On Unix the characters 0 to 31 are generally written as \@, \A, ..., \Z, \[, \\ , \], \^, \_, i.e. using the characters 64 to 95 after the backslash. There are still other representations but I can't point you to a guide...

captam3rica · Accepted Answer · 2022-06-25 10:35:33Z

0

I know this is pretty old, but stumbled upon the following url that has a list of all of the primary characters and their encodings.

https://medium.com/interview-buddy/handling-ascii-character-in-python-58993859c38e

edited Jun 25, 2022 at 10:35

answered Apr 1, 2019 at 20:41

captam3rica

3514 silver badges7 bronze badges

Collectives™ on Stack Overflow

Parsing xml with "not well-formed" characters in python

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related