2

I need to handle strings in my php script using regular expressions. But there is a problem - different strings have different encodings. If string contains just ascii symbols, mb_detect_encoding function returns 'ASCII'. But if string contains russian symbols, for example, mb_detect_encoding returns 'UTF-8'. It's not good idea to check encoding of each string manually, I suppose. So the question is - is it correct to use preg_replace (with unicode modifier) for ascii strings? Is it right to write such code preg_replace ("/[^_a-z]/u","",$string); for both ascii and utf-8 strings?

3 Answers 3

3

This would be no problem if the two choices were "UTF-8" or "ASCII", but that's not the case.

If PHP doesn't use UTF-8, it uses ISO-8859-1, which is NOT ASCII (it's a superset of ASCII in that the first 127 characters . It's a superset of ASCII. Some characters, for example the Swedish ones å, ä and ö, can be represented in both ISO-8859-1 and Unicode, with different code points! I don't think this matter much for preg_* functions so it may not be applicable to your question, but please keep this in mind when working with different encodings.

You should really, really try to know which character set your strings are in, without the magic of mb_detect_encoding (mb_detect_encoding is not a guarantee, just a good guess). For example, strings fetched through HTTP does have a character set specified in the HTTP header.

Sign up to request clarification or add additional context in comments.

4 Comments

>strings fetched through HTTP does have a character set specified in the HTTP header. I send <meta http-equiv="content-type" content="text/html; charset=utf-8"> header. Anyway, if I use mb_detect_encoding for $_POST variables, it returns 'ascii'. Does that mean that guess is wrong and string is utf-8 encoded?
No, if it says ASCII, it's most probably ASCII, meaning that all characters have code points less than 128 (almost every encoding out there share these code points for backwards compatibility). This means that ASCII detection should be entirely correct, but other encodings may not. But be aware that there are other encodings as well and that the standard in PHP is ISO-8859-1 (a superset of ASCII defining characters 128-255) if not UTF-8. ISO-8859-1 is also the standard on the web at large if no encoding is specified.
Sorry, I still don't understand. Does sending above-mentioned header <meta http-equiv="content-type" content="text/html; charset=utf-8"> mean that my server recieves all user data utf-8 encoded? 1) If yes, why some of them are ascii encoded? If that is because php tries to allocate less memory when possible, I guess that string can be either ascii or utf-8 encoded, nothing else. If so, i have no more questions. 2) If no, how can I "disable" all encodings except unicode?
If you are talking about form posts, yes, the form posted should be in the same charset as your webpage (all the major browsers do that). 1) UTF-8 and ASCII overlap in their 127 first code points, so for example the letters a-z and number 0-9 have exactly the same codepoints in both encodings. mb_detect_encoding can thus not distinguish the two charsets because they give the exact same binary data. With all this said, if the strings are all coming from your own web forms, you can count on them being the same charset as your webpage.
0

Yes sure, you can always use Unicode modifier and it will not affect neither results nor performance.

3 Comments

Preg_replace automatically converts all ascii parameters to unicode?
ASCII characters (code points 0-127) is identical in UTF-8, so no conversion is needed.
Now I unserstand. I thought that comparison is not by characters' code points, but by each byte in case of ascii string and by each 2 bytes in case of utf-8.
0

The 7-bit ASCII character set is encoded identically in UTF-8. If you have an ASCII string you should be able to use the PREG "u" modifier on it.

However, if you have a "supplemented" 8-bit ASCII character set such as ISO-8859-1, Windows-1252 or HP-Roman8 the characters with the leftmost bit set on (values x80 - xff) are not encoded the same in UTF-8 and it would not be appropriate to use the PREG "u" modifier.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.