PHP preg_replace with UTF-8 not working

Question

Why is this preg_replace not working?

FYI, I have the PHP script set to UTF8 Without BOM and I have the function here set to remove all matches of the pattern (instead of what I will actually do, which is remove all non-matches) because that is easier for testing. Note also that the ā character is not in my regex, so this should be the only character left behind.

$string='The Story of Jewād';
echo preg_replace('@([!"#$&’\(\)\*\+,\-\./0123456789:;<=>\?ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^_‘abcdefghijklmnopqrstuvwxyz\{\|\}~¡¢£⁄¥ƒ§¤“«‹›ﬁﬂ–†‡·¶•‚„”»…‰¿`´ˆ˜¯˘˙¨˚¸˝˛ˇ—ÆªŁØŒºæıłøœß÷¾¼¹×®Þ¦Ð½−çð±Çþ©¬²³™°µ ÁÂÄÀÅÃÉÊËÈÍÎÏÌÑÓÔÖÒÕŠÚÛÜÙÝŸŽáâäàåãéêëèíîïìñóôöòõšúûüùýÿž€\'])@u','',$string);

The result I get is $string unchanged. Why would this be?

Try with \pL+ instead of relisting accentuated letters individually. — mario
– mario, Commented Mar 16, 2013 at 15:54
might it not be easier to do a regex that matches the characters you do want to allow, rather than listing all those non-allowed characters. Also, for digits, you can use \d and for contiguous ranges, you can use things like A-Z. That will make the expression shorter and easier to manage. — Spudley
– Spudley, Commented Mar 16, 2013 at 15:56
@Spudley, yes that is what I am doing. The above example is inversed for easy testing. — Alasdair
– Alasdair, Commented Mar 16, 2013 at 16:09
@mario, I can't use \pL+ because this list is specific. It is all the characters I can use in a specific font I am using. — Alasdair
– Alasdair, Commented Mar 16, 2013 at 16:09

Mostafa Shahverdy · Accepted Answer · 2013-03-16 16:20:01Z

3

This works as reverse:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" >
<?php 

$string='The Story of Jewād';
echo preg_replace('@([ā])@','',$string);

?>

So, there is just a syntax problem somewhere ... This isn't a good idea to list all characters as a RegExp. You can do listings something like this:

ltrChars : 'A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02B8\u0300-\u0590\u0800-\u1FFF'+'\u2C00-\uFB1C\uFDFE-\uFE6F\uFEFD-\uFFFF';
rtlChars : '\u0591-\u07FF\uFB1D-\uFDFD\uFE70-\uFEFC';

edited Mar 16, 2013 at 16:20

answered Mar 16, 2013 at 16:08

Mostafa Shahverdy

2,7452 gold badges31 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Alasdair Over a year ago

I need to list all the characters out specifically because these are all the characters I have in a font.

Mostafa Shahverdy Over a year ago

Well, at least I can see some ranges out there; like A-Z or 0-9

Alasdair Over a year ago

Your method here did not work exactly, but with a small change it did:

@([^\x{0020}-\x{007E}\x{FB01}\x{FB02}\x{00A1}-\x{00AC}\x{00AE}-\x{00FF}\x{0160}\x{0161}\x{0192}\x{2013}\x{2018}-\x{201A}\x{2020}-\x{2022}\x{2026}\x{2030}\x{2039}\x{2044}\x{201C}-\x{201E}\x{203A}\x{02C6}\x{02D8}-\x{02DD}\x{02C7}\x{2014}\x{0141}\x{0142}\x{0131}\x{0152}\x{0153}\x{2212}\x{2122}\x{0178}\x{017D}\x{017E}\x{20AC}])@u

Collectives™ on Stack Overflow

PHP preg_replace with UTF-8 not working

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related