Detecting char encoding on direct file uploads PHP

Question

on my site I allow for direct text file uploads. These files are then stored on the server, and displayed on the website. I use UTF-8 on the site.

Now I run into trouble when people upload non-UTF-8 files which contain special chars, such as é.

I've been doing some testing. Made 2 text files, both containing the same word fiancée. One encoded UTF-8 and one encoded ISO 8859-2.

The UTF-8 one uploads fine, and shows the text correct, but the ISO 8859-2 shows as fianc�e.

Now I've tried to detect the uploaded file content with mb_detect_encoding, but whatever file I throw at it, it always detect UTF-8.

I noticed that I can use utf8_encode to convert the ISO 8859-2 files to valid UTF-8, but this only works on non-UTF files. And as I currently cannot detect non-UTF files, I cannot use the utf8_encode function, as it messes up valid UTF-8 files.

Hope this makes sense :)

So my question is, how can I detect files that are for sure not UTF-8 encoded to start with, so that I can use the utf8_encode function on them.

deceze · Accepted Answer · 2016-11-10 10:46:26Z

3

You cannot. Welcome to encodings.

Seriously though, files are just binary blobs. The bits and bytes in the file could mean anything at all; it could be images, CAD data or, perhaps, text. It depends on how you interpret the bytes. For text files that specifically means with which encoding you interpret them. There's nothing in the files themselves that tells you the correct encoding, you have to know it. Typically you want to know it from metadata accompanying the file. In the case of random user uploads though, there is no metadata, and/or it wouldn't be reliable. So you cannot "know".

The next step would be to guess, but that is obviously not foolproof. You can rule out certain encodings, for example if a file does not validate as UTF-8 (mb_check_encoding($data, 'UTF-8') == false), then it cannot be UTF-8. However, any single byte encoding will validate as any other single byte encoding. It's impossible to distinguish ISO-8859-1 from ISO-8859-2 this way, the bytes are equally valid in both. It's just that the characters that show up may not be the ones you want. To detect that automatically you need a statistical language analyser which can tell you that this character probably shouldn't show up in that word for it to be grammatical. Obviously for that to work you need to know the language used in the file, or you need to detect that first… And even then this is hardly foolproof.

The sanest way is to ask the user. Accept the upload, perhaps do some upfront testing on which encodings can be ruled out, then ask the user which of a bunch of possible encodings the file is in. Present them the result, what the file looks like when interpreted as the chosen encoding, let the user confirm that it looks alright. Many decent text editors do this when you open a file with an ambiguous encoding.

answered Nov 10, 2016 at 10:46

deceze♦

525k89 gold badges806 silver badges954 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mr.Boon Over a year ago

Thank you. I was afraid of this. In the mean time I found: github.com/neitanod/forceutf8 which is actually working pretty well for me.

deceze Over a year ago

By definition that library cannot work correctly and you'll run into all sorts of edge cases where it won't. Just so you're aware of it…

Collectives™ on Stack Overflow

Detecting char encoding on direct file uploads PHP

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related