5

I want to read all files from a folder (with os.walk) and convert them to one encoding (UTF-8). The problem is those files don't have same encoding. They could be UTF-8, UTF-8 with BOM, UTF-16.

Is there any way to do read those files without knowing their encoding?

1
  • 1
    In the most general sense, no. But you can use various heuristics to have a good go at it, it's very dependant on your specific data set. Commented Dec 23, 2015 at 3:25

2 Answers 2

6

You can read those files in binary mode. Also, the chardet library can help you detect character encoding. Using chardet, you can detect the encoding of your files and decode the data you get. Though this module has limitations.

As an example:

from chardet import detect

with open('your_file.txt', 'rb') as ef:
    detect(ef.read())
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks Andrey. That does help.
0

If it is indeed always one of these 3 then it is easy. If you can read the file using UTF-8 then it is probably UTF-8. Otherwise it will be UTF-16. Python can also automatically discard the BOM if present.

You can use a try ... except block to try both:

try:
    tryToConvertMyFile(from, to, 'utf-8-sig')
except UnicodeDecodeError:
    tryToConvertMyFile(from, to, 'utf-16')

If other encodings are present as well (like ISO-8859-1) then forget it, there is no 100% reliable way of figuring out the encoding. But you can guess—see for example Is there a Python library function which attempts to guess the character-encoding of some bytes?

1 Comment

@ClaytonWahlstrom yes, that is also what the linked question says. But for this simple case it is not necessary.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.