php preg_replace: unicode modifier for ascii strings

Question

I need to handle strings in my php script using regular expressions. But there is a problem - different strings have different encodings. If string contains just ascii symbols, mb_detect_encoding function returns 'ASCII'. But if string contains russian symbols, for example, mb_detect_encoding returns 'UTF-8'. It's not good idea to check encoding of each string manually, I suppose. So the question is - is it correct to use preg_replace (with unicode modifier) for ascii strings? Is it right to write such code preg_replace ("/[^_a-z]/u","",$string); for both ascii and utf-8 strings?

Emil Vikström · Accepted Answer · 2012-04-02 14:46:06Z

3

This would be no problem if the two choices were "UTF-8" or "ASCII", but that's not the case.

If PHP doesn't use UTF-8, it uses ISO-8859-1, which is NOT ASCII (it's a superset of ASCII in that the first 127 characters . It's a superset of ASCII. Some characters, for example the Swedish ones å, ä and ö, can be represented in both ISO-8859-1 and Unicode, with different code points! I don't think this matter much for preg_* functions so it may not be applicable to your question, but please keep this in mind when working with different encodings.

You should really, really try to know which character set your strings are in, without the magic of mb_detect_encoding (mb_detect_encoding is not a guarantee, just a good guess). For example, strings fetched through HTTP does have a character set specified in the HTTP header.

answered Apr 2, 2012 at 14:46

Emil Vikström

92.3k17 gold badges144 silver badges178 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user1235446 Over a year ago

>strings fetched through HTTP does have a character set specified in the HTTP header. I send <meta http-equiv="content-type" content="text/html; charset=utf-8"> header. Anyway, if I use mb_detect_encoding for $_POST variables, it returns 'ascii'. Does that mean that guess is wrong and string is utf-8 encoded?

Emil Vikström Over a year ago

No, if it says ASCII, it's most probably ASCII, meaning that all characters have code points less than 128 (almost every encoding out there share these code points for backwards compatibility). This means that ASCII detection should be entirely correct, but other encodings may not. But be aware that there are other encodings as well and that the standard in PHP is ISO-8859-1 (a superset of ASCII defining characters 128-255) if not UTF-8. ISO-8859-1 is also the standard on the web at large if no encoding is specified.

user1235446 Over a year ago

Sorry, I still don't understand. Does sending above-mentioned header <meta http-equiv="content-type" content="text/html; charset=utf-8"> mean that my server recieves all user data utf-8 encoded? 1) If yes, why some of them are ascii encoded? If that is because php tries to allocate less memory when possible, I guess that string can be either ascii or utf-8 encoded, nothing else. If so, i have no more questions. 2) If no, how can I "disable" all encodings except unicode?

Emil Vikström Over a year ago

If you are talking about form posts, yes, the form posted should be in the same charset as your webpage (all the major browsers do that). 1) UTF-8 and ASCII overlap in their 127 first code points, so for example the letters a-z and number 0-9 have exactly the same codepoints in both encodings. mb_detect_encoding can thus not distinguish the two charsets because they give the exact same binary data. With all this said, if the strings are all coming from your own web forms, you can count on them being the same charset as your webpage.

Alex Amiryan · Accepted Answer · 2012-04-02 14:43:27Z

0

Yes sure, you can always use Unicode modifier and it will not affect neither results nor performance.

answered Apr 2, 2012 at 14:43

Alex Amiryan

1,3921 gold badge18 silver badges31 bronze badges

3 Comments

user1235446 Over a year ago

Preg_replace automatically converts all ascii parameters to unicode?

Emil Vikström Over a year ago

ASCII characters (code points 0-127) is identical in UTF-8, so no conversion is needed.

user1235446 Over a year ago

Now I unserstand. I thought that comparison is not by characters' code points, but by each byte in case of ascii string and by each 2 bytes in case of utf-8.

Murray McDonald · Accepted Answer · 2012-04-02 14:46:57Z

0

The 7-bit ASCII character set is encoded identically in UTF-8. If you have an ASCII string you should be able to use the PREG "u" modifier on it.

However, if you have a "supplemented" 8-bit ASCII character set such as ISO-8859-1, Windows-1252 or HP-Roman8 the characters with the leftmost bit set on (values x80 - xff) are not encoded the same in UTF-8 and it would not be appropriate to use the PREG "u" modifier.

answered Apr 2, 2012 at 14:46

Murray McDonald

6314 silver badges5 bronze badges

Collectives™ on Stack Overflow

php preg_replace: unicode modifier for ascii strings

3 Answers 3

4 Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related