chardet in python3 and unknown file encoding

Question

I use chardet for recognize my file encoding, but this error happend :

fh= open("file", mode="r")
sc= chardet.detect(fh)

Traceback (most recent call last):
  File "/home/alireza/test.py", line 19, in <module>
    sc= chardet.detect(fh)
  File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 24, in detect
    u.feed(aBuf)
  File "/usr/lib/python3/dist-packages/chardet/universaldetector.py", line 65, in feed
    aLen = len(aBuf)
TypeError: object of type '_io.TextIOWrapper' has no len()

and i can't open file with out know the encoding,

fh= open("file", mode="r").read()
sc= chardet.detect(fh)

Traceback (most recent call last):
  File "/home/alireza/workspacee/makecdown/test.py", line 21, in <module>
    fh= open("910.srt", mode="r").read()
  File "/usr/lib/python3.2/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 34: invalid continuation byte

how to use chardet without open file ?! or any way to find out file encoding after/before opening ?

user1251007 · Accepted Answer · 2012-11-21 12:42:02Z

1

Try opening the file like this

fh= open("file", mode="rb")

Command Line Tool

If this does not work, try the command line tool of chardet. Description from https://github.com/erikrose/chardet:

chardet comes with a command-line script which reports on the encodings of one or more files:
% chardetect.py somefile someotherfile
somefile: windows-1252 with confidence 0.5
someotherfile: ascii with confidence 1.0

answered Nov 21, 2012 at 12:42

user1251007

17k14 gold badges52 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

alireza Over a year ago

i use "rb" mode and it's work but chardet detect wrong encoding! MacCyrillic (confidence: 0.30) and when encode to utf8 output not useful encoding But real encoding is windows-1256 and encode to utf8 worked. is there another way to findout encoding of file and change it to utf8?

user1251007 Over a year ago

What about your files that you are trying to convert. chardet guesses the encoding based on the language. So if you don't have meaningful text even with to correct encoding, chardet might fail

alireza Over a year ago

i use py3 and command line tools output same with inner chardet output(first comment), chardet not work on my language(persian) with windows-1256 or arabic encoding text. thanks for u'r support.is there another way to findout encoding of file and change it to utf8?

pepr · Accepted Answer · 2012-11-21 15:41:40Z

0

Not a direct answer, but you can find the description how it works in Python 3 here http://getpython3.com/diveintopython3/case-study-porting-chardet-to-python-3.html. After studying that, you may find the way how to detect another specific encoding.

The code was initially derived from Mozilla Seamonkey. You may find more information also there. Or look for some more advanced Python package related to Seamonkey.

answered Nov 21, 2012 at 15:41

pepr

21.1k15 gold badges83 silver badges148 bronze badges

Collectives™ on Stack Overflow

chardet in python3 and unknown file encoding

2 Answers 2

Command Line Tool

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Command Line Tool

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related