0

I am having trouble creating a regex in PHP whereby I need to extract all URLs beginning like

http://hello.hello/asefaesasef my name is 
https://aw3raw.com/asdfase/
www.aer.com/afseaegfefsesef\
domain.com/afsegaesga"

I need to basically extract the URL until I hit a white space, a backslash (\) or a double quote (").

I have the following code:

$column = "adsfahttp://hello.hello/asefaesas\"ef asefa aweoija weeij asd sa https://aw3raw.com/asdfase/ asdafewww.aer.com/afseaegfefsesef\ even ashafueh domain.com/afsegaesga\"asdfasda";
preg_match_all("/(http|https):\/\/\S+[^(\"|\\)]+/",$column,$urls);
echo "Url = \n";
print_r($urls);

So I need my to extract so I have:

http://hello.hello/asefaesasef
https://aw3raw.com/asdfase
www.aer.com/afseaegfefsesef
domain.com/afsegaesga

I'm struggling to get my head around it as my result is showing as:

Url =
Array
(
[0] => Array
    (
        [0] => http://hello.hello/asefaesas"ef asefa aweoija weeij asd sa https://aw3raw.com/asdfase/ asdafewww.aer.com/afseaegfefsesef\ even ashafueh domain.com/afsegaesga
    )

[1] => Array
    (
        [0] => http
    )

)
1
  • How do you know "domain.com/afsegaesga" is a url ? Commented Dec 9, 2016 at 23:18

2 Answers 2

1

First, you've got the syntax of character classes wrong. Within the square brackets, you don't need parentheses for grouping or pipes for alternation. Just list the characters you're interested in--or in this case, that you want to exclude.

What you're doing now is matching some non-whitespace characters (including \ and "), followed by some not-quote, non-backslash characters (including whitespace). You need to combine both criteria into one negated character class:

preg_match_all("~https?://[^\"\s\\\\]+~", $column, $urls);

Notice that this only matches the URLs starting with http:// or https://. You can' make the protocol optional ("~(?:https?://)?[^\"\s\\\\]+~"), but then the regex will match almost anything, making it useless. Are all your URLs at the beginning of a line, the way you showed them? If so, you can use an anchor instead:

preg_match_all('/(?m)^[^\"\s\\\\]+/', $column, $urls);
Sign up to request clarification or add additional context in comments.

Comments

1

You just need to add a \s to your regex: /(http|https):\/\/\S+[^(\"|\\)\s]+/ so it doesn't match a whitespace.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.