I am getting xml data from an application, which I want to parse in python:
#!/usr/bin/python
import xml.etree.ElementTree as ET
import re
xml_file = 'tickets_prod.xml'
xml_file_handle = open(xml_file,'r')
xml_as_string = xml_file_handle.read()
xml_file_handle.close()
xml_cleaned = re.sub(u'[^\x01-\x7f]+',u'',xml_as_string)
root = ET.fromstring(xml_cleaned)
It works for smaller datasets with example data, but when I go to real live data, I get
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 364658, column 72
Looking at the xml file, I see this line 364658:
WARNING - (1 warnings in check_logfiles.protocol-2013-05-28-12-53-46) - ^[[0:36mnotice: Scope(Class[Hwsw]): Not required on ^[[0m</description>
I guess it is the ^[ which makes python choke - it is also highlighted blue in vim. Now I was hoping that I could clean the data with my regex substitution, but that did not work.
The best thing would be fixing the application which generated the xml, but that is out of scope. So I need to deal with the data as it is. How can I work around this? I could live with just throwing away "illegal" characters.