3

On one of my PHP sites, I use this regular expression to automatically remove phone numbers from strings:

$text = preg_replace('/\+?[0-9][0-9()-\s+]{4,20}[0-9]/', '[removed]', $text);

However, when users post long URL's that contain several numbers as part of their text, the URL also gets affected by the preg_replace, which breaks the URL.

How can I ensure the above preg_replace does not alter URLs contained in $text?

EDIT:

As requested, here is an example of a URL being broken by the preg_replace above:

$text = 'Please help me with my question here: https://stackoverflow.com/questions/20589314/  Thanks!';
$text = preg_replace('/\+?[0-9][0-9()-\s+]{4,20}[0-9]/', '[removed]', $text);
echo $text; 

//echoes: Please help me with my question here: https://stackoverflow.com/questions/[removed]/ Thanks!
7
  • Simply check that the offending text doesn't start with "http". Commented Dec 14, 2013 at 23:12
  • @nietonfir: But what if the URL is in the middle of the text? Commented Dec 14, 2013 at 23:17
  • I think you have to parse the url AND the phone number, like /(?: url \K | phone number)/ Commented Dec 14, 2013 at 23:39
  • Please, provide several examples of URLs with phone numbers, and how they are broken Commented Dec 15, 2013 at 0:01
  • @sln: How would I do that? If it helps, there is a URL regex here: stackoverflow.com/a/8234912/869849 Commented Dec 15, 2013 at 0:16

3 Answers 3

2

I think you have to parse the url AND the phone number, like /(?: url \K | phone number)/ - sln
@sln: How would I do that? If it helps, there is a URL regex here: stackoverflow.com/a/8234912/869849 – ProgrammerGirl

Here is an example using the provided regex for url and phone num:

Php test case

 $text = 'Please help me with my +44-83848-1234 question here: http://stackoverflow.com/+44-83848-1234questions/20589314/ phone #:+44-83848-1234-Thanks!';
 $str = preg_replace_callback('~((?:(?:[a-zA-Z]{3,9}:(?://)?)(?:[;:&=+$,\w-]+@)?[a-zA-Z0-9.-]+|(?:www\.|[;:&=+$,\w-]+@)[a-zA-Z0-9.-]+)(?:(?:/[+\~%/.\w-]*)?\??[+=&;%@.\w-]*\#?\w*)?)|(\+?[0-9][0-9()\s+-]{4,20}[0-9])~',
                   function( $matches ){
                        if ( $matches[1] != "" ) {
                             return $matches[1];
                        }
                        return '[removed]';
                   },
                   $text);

 print $str;

Output >>

 Please help me with my [removed] question here: http://stackoverflow.com/+44-83848-1234questions/20589314/ phone #:[removed]-Thanks!

Regex, processed with RegexFormat

 # '~((?:(?:[a-zA-Z]{3,9}:(?://)?)(?:[;:&=+$,\w-]+@)?[a-zA-Z0-9.-]+|(?:www\.|[;:&=+$,\w-]+@)[a-zA-Z0-9.-]+)(?:(?:/[+\~%/.\w-]*)?\??[+=&;%@.\w-]*\#?\w*)?)|(\+?[0-9][0-9()\s+-]{4,20}[0-9])~'

     (                                  # (1 start), URL
          (?:
               (?:
                    [a-zA-Z]{3,9} :
                    (?: // )?
               )
               (?: [;:&=+$,\w-]+ @ )?
               [a-zA-Z0-9.-]+ 
            |  
               (?: www \. | [;:&=+$,\w-]+ @ )
               [a-zA-Z0-9.-]+ 
          )
          (?:
               (?: / [+~%/.\w-]* )?
               \??
               [+=&;%@.\w-]* 
               \#?
               \w* 
          )?
     )                                  # (1 end)
  |  
     (                                  # (2 start), Phone Num
          \+? 
          [0-9] 
          [0-9()\s+-]{4,20} 
          [0-9] 
     )                                  # (2 end)
Sign up to request clarification or add additional context in comments.

4 Comments

Very interesting, thank you! Is there a way to do this using just 1 line of preg_replace?
Instead of 1 line of preg_replace_callback? Depends on what the replacement is. As I said earlier, preg_replace /(?: url \K | phone number)/ with "".
I tried what you had mentioned in your comment, and it correctly ignores URL's, however, it then appends "[removed]" to the end of the URL's. Do you know how to fix that?
There is the dilema. If you replace with the empty string, it could be done with a simple preg_replace. The URL must be consumed independently to pass by it because the phone number is a subset of it. There is no practical way to use assertions in this case. Within regex engines, a callback is a simple extra function call, really an imperceptable amount of overhead. If you want to get the job done, I suggest to use this method.
1

You should go with some more coding so rather than stroking your head, you'll go stroking your ego!

<?php
    $text = "This is my number20558789yes with no spaces
    and this is yours 254785961
    But this 20558474 is within http://stackoverflow.com/questions/20558474/
    So I don't remove it
    and this is another url http://stackoverflow.com/questions/20589314/ 
    Thanks!";
    $up = "(https?://[-.a-zA-Z0-9]+\.[a-zA-Z]{2,3}/\S*)"; // to catch urls
    $np = "(\+?[0-9][0-9()-\s+]{4,20}[0-9])"; // you know this pattern already
    preg_match_all("#{$up}|{$np}#", $text, $matches); // match all above patterns together ($matches[1] contains urls, $matches[2] contains numbers)
    preg_match_all("#{$np}#", print_r(array_filter($matches[1]), true), $urls_numbers); // extract numbers from urls, actually if we have any
    $diff = array_diff(array_filter($matches[2]), $urls_numbers[0]); // an array with numbers that we should replace
    $text = str_replace($diff, "[removed]", $text); // replacing
    echo $text; // here you are

And then The Output:

This is my number[removed]yes with no spaces
and this is yours [removed]
But this 20558474 is within http://stackoverflow.com/questions/20558474/
So I don't remove it
and this is another url http://stackoverflow.com/questions/20589314/ 
Thanks!

Comments

0

Would it be fair to assume that phone numbers are often preceded either by whitespace or are at the start of a line? If so, this would stop you from changing URLs accidentally, since neither whitespace nor newlines ever exist in the middle of URLs:

$text = preg_replace('/(^|\s)\+?[0-9][0-9()-\s+]{4,20}[0-9]/', '[removed]', $text);

1 Comment

The problem with your solution is that it can easily (and accidentally!) be circumvented by simply preceding a phone number with a letter. Ideally, I'm looking for a solution that will only ignore the regex if the sequence of numbers occurs inside a URL, but I have no idea how to do that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.