Converting character codes to unicode [Python]

Question

So I have a large csv of french verbs that I am using to make a program, in the csv, verbs with accent characters contain codes instead of the actual accents:

être is Ãªtre, for example (atleast when I open the file in Excel)

Here is the csv:

https://raw.githubusercontent.com/ianmackinnon/inflect/master/french-verb-conjugation.csv

In Chrome and Firefox atleast, the codes are converted to the correct accents. I was wondering if once the string is imported in python into a given a variable, ie.

...
for row in reader:
        inf_lst.append(row[0])
verb = inf_lst[2338]

#(verb = Ãªtre)

if there was a straightforward/built in method for printing it out with correct unicode to give "être"?

I am aware that you could do this by replacing the Ãª's with ê's in each string but since this would have to be done for each different possible accent, I was wondering if there was an easier way. Thanks,

Which version of python are you using? Is it in a file or online like your example? Can you boil this down to a single line example text holding the word you are interested in? — tdelaney
– tdelaney, Commented Dec 16, 2016 at 20:52

Sean Kennedy · Accepted Answer · 2016-12-16 20:48:03Z

1

You can use unicode encoding by prefixing a string with 'u'.

>>> foo = u'être' >>> print foo être

answered Dec 16, 2016 at 20:48

Sean Kennedy

4114 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

tdelaney Over a year ago

That's how you write unicode literals in python 2.x but I don't see how it helps when reading data from a file or the web.

tdelaney · Accepted Answer · 2016-12-16 21:25:14Z

It all comes down to the character encoding of the data. Its possible that it is utf-8 encoded and you are viewing it in a Windows tool that is using your local code page, which gives a different display for the stream. How to read/write with files is covered in the csv doc examples.

You've given us a zipped, utf-8 encoded web page and the requests modules is good at handling that sort of thing. So, you could read the csv with:

>>> import requests
>>> import csv
>>> resp=requests.get("https://raw.githubusercontent.com/ianmackinnon/inflect/master/french-verb-conjugation.csv",
...     stream=True)
>>> try:
...     inf_lst = list(csv.reader(resp.iter_lines(decode_unicode=True)))
... finally:
...     del resp
... 
>>> len(inf_list)
5362

Mark Tolonen · Accepted Answer · 2016-12-17 07:23:45Z

0

You have a UTF-8-encoded file. Excel likes that encoding to start with a byte order mark character (U+FEFF) or it assumes the default ANSI encoding for your version of Windows instead. To get UTF-8 with BOM, use a tool like Notepad++. Open the file in Notepad++. On the Encoding menu, select "Encode in UTF-8-BOM" and save. Now it will display correctly in Excel.

To write a file that Excel can open, use the encoding utf-8-sig and write Unicode strings:

import io
with io.open('out.csv','w',encoding='utf-8-sig') as f:
    f.write(u'être')

answered Dec 17, 2016 at 7:23

Mark Tolonen

181k26 gold badges182 silver badges278 bronze badges

Collectives™ on Stack Overflow

Converting character codes to unicode [Python]

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related