2

I am trying to load a file saved as UTF-8 into python (ver2.6.6) which contains 14 different languages. I am using the python codecs module to decode the txt file.

import codecs
f = open('C:/temp/list_test.txt', 'r')
    for lines in f:
        line=filter_str(lines.decode("utf-8")

This all works well. I parse the entire file and then want to export 14 different language files. The problem that I can't understand is the following

I use the following code for output:

malangout = codecs.open("C:/temp/'polish.txt",'w','utf-8','surrogateescape')
    for item in lang_dic['English']:
         temp = lang_dic[lang1][item]
         malangout.write(temp + '\n')
    malangout.close() 

Example:

  • Language: Polish
  • Expected output: Dziennik zakłóceń
  • Actual output: Dziennik zak‚óceƒ

The string is stored as is:

u'Dziennik zak\u201a\xf3ce\u0192'

I have tried many encoding from the python docs (7.8 codecs). Any infomation would help at this point.

1
  • You say in a comment: "I save an ascii file as an UTF-8 in notepad": ascii is a subset of utf8, that wouldn't cause a problem. Do you mean "ANSI" instead of "ascii"? What is the result of import locale; print(locale.getpreferredencoding()) on your system? Commented Jan 22, 2012 at 21:32

1 Answer 1

1

The string is stored as is:

u'Dziennik zak\u201a\xf3ce\u0192'

Well, that's a problem since

In [25]: print(u'Dziennik zak\u201a\xf3ce\u0192')
Dziennik zak‚óceƒ

in contrast to

In [26]: print(u'Dziennik zak\u0142\xf3ce\u0144')
Dziennik zakłóceń

So it looks like the unicode you are storing is incorrect. Are you sure it is correct in C:/temp/list_test.txt? That is, does list_test.txt contain

In [28]: u'Dziennik zak\u201a\xf3ce\u0192'.encode('utf-8')
Out[28]: 'Dziennik zak\xe2\x80\x9a\xc3\xb3ce\xc6\x92'

or

In [27]: u'Dziennik zak\u0142\xf3ce\u0144'.encode('utf-8')
Out[27]: 'Dziennik zak\xc5\x82\xc3\xb3ce\xc5\x84'

?


PS. You may want to change

temp + '\n'

to

temp + u'\n'

to make it clear you are adding two unicode together to form a unicode. The two lines above have the same result in Python2, but in Python3 adding a unicode and str together would raise a TypeError. Even though in Python3, '\n' is unicode, I think the challenge in transitioning to Python3 will be in changing one's mental attitude toward mixing unicode and str. In Python2 it is silently attempted for you, in Python3 it is disallowed.

Sign up to request clarification or add additional context in comments.

6 Comments

you bring up a great point. it is stored that way. I save an ascii file as an UTF-8 in notepad and i don't think you can do that
do i need to do a special remapping before saving as UTF-8?
Your Python code looks fine overall (see PS above). There may be an indentation problem in the post, but besides that, I don't see a problem.
ok so the problem is my data file and the way it is stored correct
Yes, stay away from notepad. :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.