0

I have recently written a script to extract all bookmarks from a pdf and save them in a docx file. It works for 90% of the files but unfortunaltely there are some that seem to have problems with unicode.

I get the bookmarks in a list like this:

[[u'3. Mechatronik f\xfcr Doppelkupplungsgetriebe, Sicherungshalter B, Sicherung 14 auf Sicherungshalter C', 2],
[u'4. Geber f\xfcr Getriebeeingangsdrehzahl, Hydraulikdruckgeber 1 f\xfcr automatisches Getriebe, Magnetventil 2, Magnetventil \x04, Magnetventil 5', 2],
[u'5. W\xe4hlhebel, Schalter f\xfcr W\xe4hlhebel in P gesperrt, Magnet f\xfcr W\xe4hlhebelsperre', 2], 
[u'6. W\xe4hlhebel, Geber 2 f\xfcr Antriebswellendrehzahl, W\xe4hlhebel-Positionsanzeige', 2]]

When i try to run the function i get the error:

ValueError('All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters',)

Code:

from docx import Document

list1 = [[u'3. Mechatronik f\xfcr Doppelkupplungsgetriebe, Sicherungshalter B, Sicherung 14 auf Sicherungshalter C', 2],
    [u'4. Geber f\xfcr Getriebeeingangsdrehzahl, Hydraulikdruckgeber 1 f\xfcr automatisches Getriebe, Magnetventil 2, Magnetventil \x04, Magnetventil 5', 2],
    [u'5. W\xe4hlhebel, Schalter f\xfcr W\xe4hlhebel in P gesperrt, Magnet f\xfcr W\xe4hlhebelsperre', 2],
    [u'6. W\xe4hlhebel, Geber 2 f\xfcr Antriebswellendrehzahl, W\xe4hlhebel-Positionsanzeige', 2]]

def save_docx(list1):
document = Document('default.docx')
file = open("Error_Log.txt", 'w')
for i in list1:
    try:
        p = document.add_paragraph()
        p.add_run(i[0]).bold = True
    except Exception as e:
        file.write(repr(e) + '\n')
file.close()
document.save('Bookmarks.docx')

save_docx(list1)

Im guessing the problem ist the \x0 but I can not figure out how to remove parts like this without ruining the whole document. I have tried diffenrent encodings and anything else I could find online but nothing worked so far.

Any help would be much appreciated!

3
  • did you try this? i[0].encode('utf-8') based on the discussion in stackoverflow.com/questions/5760936/… Commented Dec 7, 2016 at 10:54
  • yes i tried de- and encoding in various ways e.g. i[0].encode('ascii' 'ignore') etc. nothing worked. Also looked at libraries that might help but no luck so far. Commented Dec 7, 2016 at 11:03
  • nice answer from @jackmorris. Could it be that after the encode the control character was still in the string? Thus the end result would be the same (error 'no control characters') Commented Dec 7, 2016 at 11:23

1 Answer 1

1

Your assumption seems correct: \x04 is a control character, and your error message explicitly states that controls aren't allowed.

You can filter out control characters from your strings before adding them to the document, which should fix your issue. This can be done with Python's unicodedata module, specifically unicodedata.category. The categories you want to exclude start with 'C' (from http://www.unicode.org/reports/tr44/#GC_Values_Table), which encompasses all of the control characters.

The following should work, in place of your current add_run line:

line = filter(lambda c: unicodedata.category(c)[0] != 'C', i[0])
p.add_run(line).bold = True

As an aside, the typical way of including unicode characters in a unicode string is with \uXXXX, rather than \xXX (where XXXX is the hex of the unicode code point).

Sign up to request clarification or add additional context in comments.

6 Comments

The category returned by unicodedata for \x04 is Cc, not C. And I wouldn't say that the \uXXXX notation is the "typical" way, there is no difference between \xXX, \u00XX and \U000000XX for a code point below 256, and python itself always uses the shortest possible form, e.g ascii("\U000000FF") (or repr(u"\U000000FF") in python2) gives \xff.
The category 'C' includes 'Cc', as well as 'Cf', which is a format control character.
To the other point, 'typical' is probably the wrong word to use, however I think it makes more sense to specify unicode characters as code points rather than byte values, particularly when you exceed 256. You're right in saying that it makes no difference for low-valued code points.
Amazing answer! Thank you very much! Im quite new to python and this would have taken me ages to figure out.
Yes, but you're comparing unicodedata.category(c) != 'C', which will fail if the returned category is Cc and therefore filter nothing, you'd need to only compare the first character. And as the OP probably didn't type that string but copy its representation from somewhere, suggesting to change escape sequences seems a bit excessive. I prefer python's way of using the shortest possible form to escape a code point, it's just a different way of expressing numeric values. That the same escape form can be used to represent a byte value in a different context has nothing to do with unicode.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.