1

I use chardet for recognize my file encoding, but this error happend :

fh= open("file", mode="r")
sc= chardet.detect(fh)

Traceback (most recent call last):
  File "/home/alireza/test.py", line 19, in <module>
    sc= chardet.detect(fh)
  File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 24, in detect
    u.feed(aBuf)
  File "/usr/lib/python3/dist-packages/chardet/universaldetector.py", line 65, in feed
    aLen = len(aBuf)
TypeError: object of type '_io.TextIOWrapper' has no len()

and i can't open file with out know the encoding,

fh= open("file", mode="r").read()
sc= chardet.detect(fh)

Traceback (most recent call last):
  File "/home/alireza/workspacee/makecdown/test.py", line 21, in <module>
    fh= open("910.srt", mode="r").read()
  File "/usr/lib/python3.2/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 34: invalid continuation byte

how to use chardet without open file ?! or any way to find out file encoding after/before opening ?

2 Answers 2

1

Try opening the file like this

fh= open("file", mode="rb")

Command Line Tool

If this does not work, try the command line tool of chardet. Description from https://github.com/erikrose/chardet:

chardet comes with a command-line script which reports on the encodings of one or more files:

% chardetect.py somefile someotherfile
somefile: windows-1252 with confidence 0.5
someotherfile: ascii with confidence 1.0
Sign up to request clarification or add additional context in comments.

3 Comments

i use "rb" mode and it's work but chardet detect wrong encoding! MacCyrillic (confidence: 0.30) and when encode to utf8 output not useful encoding But real encoding is windows-1256 and encode to utf8 worked. is there another way to findout encoding of file and change it to utf8?
What about your files that you are trying to convert. chardet guesses the encoding based on the language. So if you don't have meaningful text even with to correct encoding, chardet might fail
i use py3 and command line tools output same with inner chardet output(first comment), chardet not work on my language(persian) with windows-1256 or arabic encoding text. thanks for u'r support.is there another way to findout encoding of file and change it to utf8?
0

Not a direct answer, but you can find the description how it works in Python 3 here http://getpython3.com/diveintopython3/case-study-porting-chardet-to-python-3.html. After studying that, you may find the way how to detect another specific encoding.

The code was initially derived from Mozilla Seamonkey. You may find more information also there. Or look for some more advanced Python package related to Seamonkey.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.