Read file with Python without knowing encoding

Question

I want to read all files from a folder (with os.walk) and convert them to one encoding (UTF-8). The problem is those files don't have same encoding. They could be UTF-8, UTF-8 with BOM, UTF-16.

Is there any way to do read those files without knowing their encoding?

In the most general sense, no. But you can use various heuristics to have a good go at it, it's very dependant on your specific data set. — Tom Dalton
– Tom Dalton, Commented Dec 23, 2015 at 3:25

blong · Accepted Answer · 2023-06-27 14:26:03Z

6

You can read those files in binary mode. Also, the chardet library can help you detect character encoding. Using chardet, you can detect the encoding of your files and decode the data you get. Though this module has limitations.

As an example:

from chardet import detect

with open('your_file.txt', 'rb') as ef:
    detect(ef.read())

edited Jun 27, 2023 at 14:26

blong

2,7138 gold badges47 silver badges114 bronze badges

answered Dec 23, 2015 at 3:29

Andrey Mylnikov

866 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sam Black Over a year ago

Thanks Andrey. That does help.

Community · Accepted Answer · 2017-05-23 12:07:46Z

0

If it is indeed always one of these 3 then it is easy. If you can read the file using UTF-8 then it is probably UTF-8. Otherwise it will be UTF-16. Python can also automatically discard the BOM if present.

You can use a try ... except block to try both:

try:
    tryToConvertMyFile(from, to, 'utf-8-sig')
except UnicodeDecodeError:
    tryToConvertMyFile(from, to, 'utf-16')

If other encodings are present as well (like ISO-8859-1) then forget it, there is no 100% reliable way of figuring out the encoding. But you can guess—see for example Is there a Python library function which attempts to guess the character-encoding of some bytes?

edited May 23, 2017 at 12:07

CommunityBot

11 silver badge

answered Dec 23, 2015 at 3:25

roeland

5,8212 gold badges17 silver badges31 bronze badges

1 Comment

roeland Over a year ago

@ClaytonWahlstrom yes, that is also what the linked question says. But for this simple case it is not necessary.

Collectives™ on Stack Overflow

Read file with Python without knowing encoding

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related