Python codecs module

Question

I am trying to load a file saved as UTF-8 into python (ver2.6.6) which contains 14 different languages. I am using the python codecs module to decode the txt file.

import codecs
f = open('C:/temp/list_test.txt', 'r')
    for lines in f:
        line=filter_str(lines.decode("utf-8")

This all works well. I parse the entire file and then want to export 14 different language files. The problem that I can't understand is the following

I use the following code for output:

malangout = codecs.open("C:/temp/'polish.txt",'w','utf-8','surrogateescape')
    for item in lang_dic['English']:
         temp = lang_dic[lang1][item]
         malangout.write(temp + '\n')
    malangout.close()

Example:

Language: Polish
Expected output: Dziennik zakłóceń
Actual output: Dziennik zak‚óceƒ

The string is stored as is:

u'Dziennik zak\u201a\xf3ce\u0192'

I have tried many encoding from the python docs (7.8 codecs). Any infomation would help at this point.

You say in a comment: "I save an ascii file as an UTF-8 in notepad": ascii is a subset of utf8, that wouldn't cause a problem. Do you mean "ANSI" instead of "ascii"? What is the result of import locale; print(locale.getpreferredencoding()) on your system? — John Machin
– John Machin, Commented Jan 22, 2012 at 21:32

unutbu · Accepted Answer · 2012-01-23 10:51:46Z

1

The string is stored as is:
u'Dziennik zak\u201a\xf3ce\u0192'

Well, that's a problem since

In [25]: print(u'Dziennik zak\u201a\xf3ce\u0192')
Dziennik zak‚óceƒ

in contrast to

In [26]: print(u'Dziennik zak\u0142\xf3ce\u0144')
Dziennik zakłóceń

So it looks like the unicode you are storing is incorrect. Are you sure it is correct in C:/temp/list_test.txt? That is, does list_test.txt contain

In [28]: u'Dziennik zak\u201a\xf3ce\u0192'.encode('utf-8')
Out[28]: 'Dziennik zak\xe2\x80\x9a\xc3\xb3ce\xc6\x92'

or

In [27]: u'Dziennik zak\u0142\xf3ce\u0144'.encode('utf-8')
Out[27]: 'Dziennik zak\xc5\x82\xc3\xb3ce\xc5\x84'

?

PS. You may want to change

temp + '\n'

to

temp + u'\n'

to make it clear you are adding two unicode together to form a unicode. The two lines above have the same result in Python2, but in Python3 adding a unicode and str together would raise a TypeError. Even though in Python3, '\n' is unicode, I think the challenge in transitioning to Python3 will be in changing one's mental attitude toward mixing unicode and str. In Python2 it is silently attempted for you, in Python3 it is disallowed.

edited Jan 23, 2012 at 10:51

answered Jan 22, 2012 at 15:46

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user1163567 Over a year ago

you bring up a great point. it is stored that way. I save an ascii file as an UTF-8 in notepad and i don't think you can do that

user1163567 Over a year ago

do i need to do a special remapping before saving as UTF-8?

unutbu Over a year ago

Your Python code looks fine overall (see PS above). There may be an indentation problem in the post, but besides that, I don't see a problem.

user1163567 Over a year ago

ok so the problem is my data file and the way it is stored correct

unutbu Over a year ago

Yes, stay away from notepad. :)

|

Collectives™ on Stack Overflow

Python codecs module

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related