0

So I have a large csv of french verbs that I am using to make a program, in the csv, verbs with accent characters contain codes instead of the actual accents:

être is être, for example (atleast when I open the file in Excel)

Here is the csv:

https://raw.githubusercontent.com/ianmackinnon/inflect/master/french-verb-conjugation.csv

In Chrome and Firefox atleast, the codes are converted to the correct accents. I was wondering if once the string is imported in python into a given a variable, ie.

...
for row in reader:
        inf_lst.append(row[0])
verb = inf_lst[2338]

#(verb = être)

if there was a straightforward/built in method for printing it out with correct unicode to give "être"?

I am aware that you could do this by replacing the ê's with ê's in each string but since this would have to be done for each different possible accent, I was wondering if there was an easier way. Thanks,

2
  • have you read this? docs.python.org/3/howto/unicode.html Commented Dec 16, 2016 at 20:43
  • 1
    Which version of python are you using? Is it in a file or online like your example? Can you boil this down to a single line example text holding the word you are interested in? Commented Dec 16, 2016 at 20:52

3 Answers 3

1

You can use unicode encoding by prefixing a string with 'u'.

>>> foo = u'être' >>> print foo être

Sign up to request clarification or add additional context in comments.

1 Comment

That's how you write unicode literals in python 2.x but I don't see how it helps when reading data from a file or the web.
0

It all comes down to the character encoding of the data. Its possible that it is utf-8 encoded and you are viewing it in a Windows tool that is using your local code page, which gives a different display for the stream. How to read/write with files is covered in the csv doc examples.

You've given us a zipped, utf-8 encoded web page and the requests modules is good at handling that sort of thing. So, you could read the csv with:

>>> import requests
>>> import csv
>>> resp=requests.get("https://raw.githubusercontent.com/ianmackinnon/inflect/master/french-verb-conjugation.csv",
...     stream=True)
>>> try:
...     inf_lst = list(csv.reader(resp.iter_lines(decode_unicode=True)))
... finally:
...     del resp
... 
>>> len(inf_list)
5362

Comments

0

You have a UTF-8-encoded file. Excel likes that encoding to start with a byte order mark character (U+FEFF) or it assumes the default ANSI encoding for your version of Windows instead. To get UTF-8 with BOM, use a tool like Notepad++. Open the file in Notepad++. On the Encoding menu, select "Encode in UTF-8-BOM" and save. Now it will display correctly in Excel.

To write a file that Excel can open, use the encoding utf-8-sig and write Unicode strings:

import io
with io.open('out.csv','w',encoding='utf-8-sig') as f:
    f.write(u'être')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.