0

I have created a javascript regular expression in order to validate comments entered by users in my app. The regex allows letters, numbers some special symbols and a range of emojis

I received help here to correctly format my javascript regular expression and the final expression I am using is as follows:

Javascript Regex:

commentRegex =    /^(?:[A-Za-z0-9\u00C0-\u017F\u20AC\u2122\u2150\u00A9 \/.,\-_$!\'&*()="?#+%:;\<\[\]\r\r\n]|(?:\ud83c[\udf00-\udfff])|(?:\ud83d[\udc00-\ude4f\ude80-\udeff]))*$/;

I was advised to perform the same validation on the server side (with php) and so I am trying to perform a similar process using preg_replace().

So I would like to replace all characters (that are not contained in the regex), with the empty string. Here is my attempt however it is not working. thanks for any help

PHP

$commentText = preg_replace('#^(?:[A-Za-z0-9\u00C0-\u017F\u20AC\u2122\u2150\u00A9 \/.,\-_$!\'&*()="?#+%:;\<\[\]\r\r\n]|(?:\ud83c[\udf00-\udfff])|(?:\ud83d[\udc00-\ude4f\ude80-\udeff]))*$#', '', $commentText);

Edit:

After taking your advice in the comments I now have the following regex.

$postText = preg_replace('/^(?:[A-Za-z0-9\x{00C0}-\x{017F}\x{20AC}\x{2122}\x{2150}\x{00A9} \/.,\-_$!\'&*()="?\#\+%:;\<\[\]\r\n]|(?:\x{d83c}[\x{df00}-\x{dfff}])|(?:\x{d83d}[\x{dc00}-\x{de4f}\x{de80}-\x{deff}]))*$/', '', $postText);

However I am getting a warning

<b>Warning</b>:  preg_replace(): Compilation failed: character value in \x{} or \o{} is too large at offset 30 in <b>submit_post.php</b> on line <b>37
14
  • what are you excluding, and why? Commented Jan 15, 2017 at 21:52
  • It would help us if you would add $commentText on what you specifically want replaced. Commented Jan 15, 2017 at 21:53
  • 3
    In PHP PCRE, you need to turn all \uXXXX to \x{XXXX}. And also, there is no need to write \r twice in [\r\r\n]. BTW, it is not working is a poor problem description, you should always provide the exact behavior you get. Commented Jan 15, 2017 at 21:57
  • @Xorifelse I was hoping I could go with the approach where I have a whitelist of "allowed characters and then everything that is not in the whitelist gets replaced with the empty string. Is this possible? I can tell you the allowed characters then... Commented Jan 15, 2017 at 21:59
  • 1

3 Answers 3

1

In short: use

$re = '/[^A-Za-z0-9\x{00C0}-\x{017F}\x{20AC}\x{2122}\x{2150}\x{00A9} \/.,\-_$!\'&*()="?#+%:;<[\]\r\n\x{1F300}-\x{1F3FF}\x{1F400}-\x{1F64F}\x{1F680}-\x{1F6FF}]+/u';
$text = 'test>><<<®¥§';
echo preg_replace($re, '', $text);

See the PHP demo.

A bit of an explanation:

  • Escape only special regex metacharacters inside the pattern AND the regex delimiters (if you choose a # as a regex delimiter, escape the # in the pattern, and then there is no need to escape /)
  • \uXXXX in PCRE must be replaced with \x{XXXX} notation
  • Since the text to be processed is Unicode and the chars you have in your pattern are out of the ASCII range, you have to use /u UNICODE modifier
  • As most emojis come outside the BMP plane, and the string now treated as a chain of Unicode code points, these symbols must be written using the extended \x notation, not as two byte notation used in JavaScript
  • Your 3 alternatives can be merged into 1 big character class and then you want to negated it by adding ^ at its start to make it a negated character class.
Sign up to request clarification or add additional context in comments.

6 Comments

Stribizew Hiya. I have a question related to this and thought i'd ask here before posting another question. This regex works perfect. When I type a message, it allows for example double quotes... However I have discovered that If I copy and paste text (containing double or single quotes) from another webpage into the textarea in my form and submit it, the quotes are replaced with empty string. I suspect its to do with the font that i copied and pasted in or the encoding or something else. Do you know anymore than me about this behaviour? thanks
That just means the quotes are matched with the regex and are removed with preg_replace.
but remember our solution replaces everything BUT whats in the regex as I wanted. and it does allow quotes when I type them in but just not when I copy and paste. It's ok-I will read more about it.
thanks.. if you see here I added in some characters to your text. (I copied and pasted the quotes from a web page). ideone.com/trSge1
The curly single quotes are \x{2019}\x{2018}. Add them to the negated character class to avoid removing them - ideone.com/oNIk9p.
|
1

The regex in PHP has a character, which sourrounds the regex. In your case you are using the hash (#), but the character should not occour in the regex itslef, which it does...

You have to excape this character inside, or use another char. Why did you not use the same "/" as in the JS Version? The benefit is, it is already escaped.

I have not looked, if the rest would work, but I think so.

$commentText = preg_replace('/^(?:[A-Za-z0-9\u00C0-\u017F\u20AC\u2122\u2150\u00A9 \/.,\-_$!\'&*()="?#+%:;\<\[\]\r\r\n]|(?:\ud83c[\udf00-\udfff])|(?:\ud83d[\udc00-\ude4f\ude80-\udeff]))*$/', '', $commentText);

should work.

Comments

1

convert the \u.... sequences to \x{....}, and the result appears to be a valid PHP regular expression.

pattern: \\u(\w{4})

replace: \\x{$1}

regex101 demo

2 Comments

Thanks I have changed like you said but I am getting a warning now: <b>Warning</b>: preg_replace(): Compilation failed: character value in \x{} or \o{} is too large at offset 30 in <b>submit_post.php</b> on line <b>37
add utf-8 parsing with the u flag? preg_match('/pattern/u', input) check this answer stackoverflow.com/a/32375905/244811

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.