11

I need to analyse a textfile in tamil (utf-8 encoded). Im using nltk package of Python on the interface IDLE. when i try to read the text file on the interface, this is the error i get. how do i avoid this?

corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()
  File "C:\Users\Customer\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 33: character maps to <undefined>
3
  • I haven't fully read your question, but... If you have a load of bytes, you can decode them into a string using your_bytes.decode("UTF-8"). Commented Dec 1, 2016 at 19:08
  • 1
    Which Python version? Commented Dec 1, 2016 at 19:09
  • @AntonisChristofides - From the traceback, I infer Python3. Commented Dec 1, 2016 at 19:42

1 Answer 1

20

Since you are using Python 3, just add the encoding parameter to open():

corpus = open(
    r"C:\Users\Customer\Desktop\DISSERTATION\ettuthokai.txt", encoding="utf-8"
).read()
Sign up to request clarification or add additional context in comments.

1 Comment

Only works in Python 3+. For Python 2, use codecs.open.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.