I need to handle strings in my php script using regular expressions. But there is a problem - different strings have different encodings. If string contains just ascii symbols, mb_detect_encoding function returns 'ASCII'. But if string contains russian symbols, for example, mb_detect_encoding returns 'UTF-8'. It's not good idea to check encoding of each string manually, I suppose.
So the question is - is it correct to use preg_replace (with unicode modifier) for ascii strings? Is it right to write such code preg_replace ("/[^_a-z]/u","",$string); for both ascii and utf-8 strings?
3 Answers
This would be no problem if the two choices were "UTF-8" or "ASCII", but that's not the case.
If PHP doesn't use UTF-8, it uses ISO-8859-1, which is NOT ASCII (it's a superset of ASCII in that the first 127 characters . It's a superset of ASCII. Some characters, for example the Swedish ones å, ä and ö, can be represented in both ISO-8859-1 and Unicode, with different code points! I don't think this matter much for preg_* functions so it may not be applicable to your question, but please keep this in mind when working with different encodings.
You should really, really try to know which character set your strings are in, without the magic of mb_detect_encoding (mb_detect_encoding is not a guarantee, just a good guess). For example, strings fetched through HTTP does have a character set specified in the HTTP header.
4 Comments
<meta http-equiv="content-type" content="text/html; charset=utf-8"> header. Anyway, if I use mb_detect_encoding for $_POST variables, it returns 'ascii'. Does that mean that guess is wrong and string is utf-8 encoded?<meta http-equiv="content-type" content="text/html; charset=utf-8"> mean that my server recieves all user data utf-8 encoded? 1) If yes, why some of them are ascii encoded? If that is because php tries to allocate less memory when possible, I guess that string can be either ascii or utf-8 encoded, nothing else. If so, i have no more questions. 2) If no, how can I "disable" all encodings except unicode?Yes sure, you can always use Unicode modifier and it will not affect neither results nor performance.
3 Comments
The 7-bit ASCII character set is encoded identically in UTF-8. If you have an ASCII string you should be able to use the PREG "u" modifier on it.
However, if you have a "supplemented" 8-bit ASCII character set such as ISO-8859-1, Windows-1252 or HP-Roman8 the characters with the leftmost bit set on (values x80 - xff) are not encoded the same in UTF-8 and it would not be appropriate to use the PREG "u" modifier.