0

I am getting xml data from an application, which I want to parse in python:

#!/usr/bin/python

import xml.etree.ElementTree as ET
import re

xml_file = 'tickets_prod.xml'
xml_file_handle = open(xml_file,'r')
xml_as_string = xml_file_handle.read()
xml_file_handle.close()

xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)
root = ET.fromstring(xml_cleaned)

It works for smaller datasets with example data, but when I go to real live data, I get

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 364658, column 72

Looking at the xml file, I see this line 364658:

WARNING - (1 warnings in check_logfiles.protocol-2013-05-28-12-53-46) - ^[[0:36mnotice: Scope(Class[Hwsw]): Not required on ^[[0m</description>

I guess it is the ^[ which makes python choke - it is also highlighted blue in vim. Now I was hoping that I could clean the data with my regex substitution, but that did not work.

The best thing would be fixing the application which generated the xml, but that is out of scope. So I need to deal with the data as it is. How can I work around this? I could live with just throwing away "illegal" characters.

2
  • You could try with beautiful soup which did a very good job on escaping invalid characters for me. Commented Oct 29, 2013 at 9:18
  • 1
    Looks like some overly clever guy wanted to have bold or colored output on his tty for this warning. You should remove everything from the escape up to the following 'm'. Commented Oct 29, 2013 at 9:51

2 Answers 2

3

You already do:

xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)

but the character ^[ is probably Python's \x1b. If xml.parser.expat chokes on it, you need simply to clean up more, by only accepting some characters below 0x20 (space). For example:

xml_cleaned = re.sub(u'[^\n\r\t\x20-\x7f]+',u'',xml_as_string)
Sign up to request clarification or add additional context in comments.

2 Comments

Just wondering: is there a good ressource where I can look up how special characters like ^[ are sometimes represented? I have been stumbling uppon problems like this more then once, and would like to to know how to handle these in the future.
In two words, Python uses \xNN systematically except for \t \n \r. On Unix the characters 0 to 31 are generally written as \@, \A, ..., \Z, \[, \\ , \], \^, \_, i.e. using the characters 64 to 95 after the backslash. There are still other representations but I can't point you to a guide...
0

I know this is pretty old, but stumbled upon the following url that has a list of all of the primary characters and their encodings.

https://medium.com/interview-buddy/handling-ascii-character-in-python-58993859c38e

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.