Python removing invalid ascii characters

Question

I have recently written a script to extract all bookmarks from a pdf and save them in a docx file. It works for 90% of the files but unfortunaltely there are some that seem to have problems with unicode.

I get the bookmarks in a list like this:

[[u'3. Mechatronik f\xfcr Doppelkupplungsgetriebe, Sicherungshalter B, Sicherung 14 auf Sicherungshalter C', 2],
[u'4. Geber f\xfcr Getriebeeingangsdrehzahl, Hydraulikdruckgeber 1 f\xfcr automatisches Getriebe, Magnetventil 2, Magnetventil \x04, Magnetventil 5', 2],
[u'5. W\xe4hlhebel, Schalter f\xfcr W\xe4hlhebel in P gesperrt, Magnet f\xfcr W\xe4hlhebelsperre', 2], 
[u'6. W\xe4hlhebel, Geber 2 f\xfcr Antriebswellendrehzahl, W\xe4hlhebel-Positionsanzeige', 2]]

When i try to run the function i get the error:

ValueError('All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters',)

Code:

from docx import Document

list1 = [[u'3. Mechatronik f\xfcr Doppelkupplungsgetriebe, Sicherungshalter B, Sicherung 14 auf Sicherungshalter C', 2],
    [u'4. Geber f\xfcr Getriebeeingangsdrehzahl, Hydraulikdruckgeber 1 f\xfcr automatisches Getriebe, Magnetventil 2, Magnetventil \x04, Magnetventil 5', 2],
    [u'5. W\xe4hlhebel, Schalter f\xfcr W\xe4hlhebel in P gesperrt, Magnet f\xfcr W\xe4hlhebelsperre', 2],
    [u'6. W\xe4hlhebel, Geber 2 f\xfcr Antriebswellendrehzahl, W\xe4hlhebel-Positionsanzeige', 2]]

def save_docx(list1):
document = Document('default.docx')
file = open("Error_Log.txt", 'w')
for i in list1:
    try:
        p = document.add_paragraph()
        p.add_run(i[0]).bold = True
    except Exception as e:
        file.write(repr(e) + '\n')
file.close()
document.save('Bookmarks.docx')

save_docx(list1)

Im guessing the problem ist the \x0 but I can not figure out how to remove parts like this without ruining the whole document. I have tried diffenrent encodings and anything else I could find online but nothing worked so far.

Any help would be much appreciated!

did you try this? i[0].encode('utf-8') based on the discussion in stackoverflow.com/questions/5760936/… — Gerrit Verhaar
– Gerrit Verhaar, Commented Dec 7, 2016 at 10:54
yes i tried de- and encoding in various ways e.g. i[0].encode('ascii' 'ignore') etc. nothing worked. Also looked at libraries that might help but no luck so far. — TacashiX
– TacashiX, Commented Dec 7, 2016 at 11:03
nice answer from @jackmorris. Could it be that after the encode the control character was still in the string? Thus the end result would be the same (error 'no control characters') — Gerrit Verhaar
– Gerrit Verhaar, Commented Dec 7, 2016 at 11:23

JPEG_ · Accepted Answer · 2016-12-07 13:10:12Z

1

Your assumption seems correct: \x04 is a control character, and your error message explicitly states that controls aren't allowed.

You can filter out control characters from your strings before adding them to the document, which should fix your issue. This can be done with Python's unicodedata module, specifically unicodedata.category. The categories you want to exclude start with 'C' (from http://www.unicode.org/reports/tr44/#GC_Values_Table), which encompasses all of the control characters.

The following should work, in place of your current add_run line:

line = filter(lambda c: unicodedata.category(c)[0] != 'C', i[0])
p.add_run(line).bold = True

As an aside, the typical way of including unicode characters in a unicode string is with \uXXXX, rather than \xXX (where XXXX is the hex of the unicode code point).

edited Dec 7, 2016 at 13:10

answered Dec 7, 2016 at 11:04

JPEG_

3211 gold badge3 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

mata Over a year ago

The category returned by unicodedata for \x04 is Cc, not C. And I wouldn't say that the \uXXXX notation is the "typical" way, there is no difference between \xXX, \u00XX and \U000000XX for a code point below 256, and python itself always uses the shortest possible form, e.g ascii("\U000000FF") (or repr(u"\U000000FF") in python2) gives \xff.

JPEG_ Over a year ago

The category 'C' includes 'Cc', as well as 'Cf', which is a format control character.

JPEG_ Over a year ago

To the other point, 'typical' is probably the wrong word to use, however I think it makes more sense to specify unicode characters as code points rather than byte values, particularly when you exceed 256. You're right in saying that it makes no difference for low-valued code points.

TacashiX Over a year ago

Amazing answer! Thank you very much! Im quite new to python and this would have taken me ages to figure out.

mata Over a year ago

Yes, but you're comparing unicodedata.category(c) != 'C', which will fail if the returned category is Cc and therefore filter nothing, you'd need to only compare the first character. And as the OP probably didn't type that string but copy its representation from somewhere, suggesting to change escape sequences seems a bit excessive. I prefer python's way of using the shortest possible form to escape a code point, it's just a different way of expressing numeric values. That the same escape form can be used to represent a byte value in a different context has nothing to do with unicode.

|

Collectives™ on Stack Overflow

Python removing invalid ascii characters

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related