python :same character, different behavior

Question

I'm generating file names from a list pulled out from a postgres DB with Python 2.7.9. In this list there are words with special char. Normally I use ''.join() to record the name and fire it to my loader but I have just one name that want be recognized. the .py is set for utf-8 coding, but the words are in Portuguese, I think latin-1 coding.

from pydub import AudioSegment
from pydub.playback import play
templist = ['+ Orégano','- Búfala','+ Rúcola']
count_ins = (len(templist)-1)
while (count_ins >= 0 ):
    kot_istructions = AudioSegment.from_ogg('/home/effe/voice_orders/Voz/'+"".join(templist[count_ins])+'.ogg')
    count_ins-=1
    play(kot_istructions)

The first two files are loaded:

/home/effe/voice_orders/Voz/+ Orégano.ogg

/home/effe/voice_orders/Voz/- Búfala.ogg

The third should be:

/home/effe/voice_orders/Voz/+ Rúcola.ogg

But python is trying to load

/home/effe/voice_orders/Voz/+ R\xc3\xbacola.ogg

Why just this one? I've tried to use normalize() to remove the accent but since this is a string the method didn't work. Print works well, as db update. Just file name creation doesn't works as expected. Suggestions?

Unicode strings require the "u" prefix in Python 2: [u'+ Orégano', u'- Búfala', u'+ Rúcola']. — dlask
– dlask, Commented Jun 25, 2015 at 18:25
Please do use iteration instead of manually counting indices. A simple for word in templist suffices. Then, get rid of the join-call, it's only working incidentially here because you only have one argument that is a string - it's not really doing what you think it is. The string representation looks like proper utf-8 encoding, the question is: is your filesystem's encoding utf-8? — deets
– deets, Commented Jun 25, 2015 at 18:33
Have you considered using python 3? The Unicode handling was redone. — A. L. Flanagan
– A. L. Flanagan, Commented Jun 25, 2015 at 18:47
@dlask : I can't edit the list because is generated in real time and used in various parts of the program. @deets : Sometimes I need to edit "on the fly" an index count, is more quick to edit than for/in cycle. My filesystem (as my db) is set to LANG=pt_BR.UTF-8 . @A.L.Flanagan: I can't. I'm using python with Odoo and I need Python 2.7. — Federico Leoni
– Federico Leoni, Commented Jun 25, 2015 at 19:35
don't put an answer (sentence after "Solved") into the question. Post it as your own answer instead — jfs
– jfs, Commented Jun 26, 2015 at 21:06

Danver Braganza · Accepted Answer · 2015-06-25 18:47:49Z

1

It seems the root cause might be that the encoding of these names in inconsisitent within your database.

If you run:

>>> 'R\xc3\xbacola'.decode('utf-8')

You get

u'R\xfacola'

which is in fact a Python unicode, correctly representing the name. So, what should you do? Although it's a really unclean programming style, you could play .encode()/.decode() whackamole, where you try to decode the raw string from your db using utf-8, and failing that, latin-1. It would look something like this:

try:
    clean_unicode = dirty_string.decode('utf-8')
except UnicodeDecodeError:
    clean_unicode = dirty_string.decode('latin-1')

As a general rule, always work with clean unicode objects within your own source, and only convert to an encoding on saving it out. Also, don't let people insert data into a database without specifying the encoding, as that will stop you from having this problem in the first place.

Hope that helps!

answered Jun 25, 2015 at 18:47

Danver Braganza

1,36510 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Federico Leoni Over a year ago

With .encode()/.decode() I get the same error, but with a different encoding: IOError: [Errno 2] No such file or directory: u'/home/effe/voice_orders/Voz/+ R\xfacola.ogg' Meaning your conversion is working indeed but is not the result I need. Again, why just with ú in this position and not in the other case?

Danver Braganza Over a year ago

If you're looking for a file that already exists, what is the actual encoding in its name? Can you browse to it and see what it's name is? Try running os.listdir('/home/effe/voice_orders/Voz/') and see how it's represented in your system.

Federico Leoni Over a year ago

os.listdir didn't report anything out of place. Anyway THIS WAS a problem with the file. Deleting the file and creating a new one solve the problem. Please don't ask me why. Thank you to point it out, +1 on your reply because even if was not the case your method enlightened me on how encode/decode works.

Federico Leoni · Accepted Answer · 2015-06-27 00:37:55Z

0

Solved: Was a problem with the file. Deleting and build it again do the job.

answered Jun 27, 2015 at 0:37

Federico Leoni

1111 silver badge11 bronze badges

Collectives™ on Stack Overflow

python :same character, different behavior

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related