How to read a utf-8 encoded text file using Python

Question

I need to analyse a textfile in tamil (utf-8 encoded). Im using nltk package of Python on the interface IDLE. when i try to read the text file on the interface, this is the error i get. how do i avoid this?

corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()
  File "C:\Users\Customer\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 33: character maps to <undefined>

I haven't fully read your question, but... If you have a load of bytes, you can decode them into a string using your_bytes.decode("UTF-8"). — byxor
– byxor, Commented Dec 1, 2016 at 19:08

Antonis Christofides · Accepted Answer · 2019-08-13 15:47:01Z

20

Since you are using Python 3, just add the encoding parameter to open():

corpus = open(
    r"C:\Users\Customer\Desktop\DISSERTATION\ettuthokai.txt", encoding="utf-8"
).read()

edited Aug 13, 2019 at 15:47

answered Dec 1, 2016 at 19:14

Antonis Christofides

7,0483 gold badges49 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mark Ransom Over a year ago

Only works in Python 3+. For Python 2, use codecs.open.

Collectives™ on Stack Overflow

How to read a utf-8 encoded text file using Python

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related