0

I'm new to stackoverflow and from South Korea.

I'm having difficulties with regex with php.

I want to select all the urls from user submitted html source.

The restrictions I want to make are following.

Select urls EXCEPT

  • urls are within tags for example if the html source is like below,

    <a href="http://aaa.com">http://aaa.com</a>

    Neither of http://aaa.com should be selected.

  • urls right after " or =

Here is my current regex stage.

/(?<![\"=])https?\:\/\/[^\"\s<>]+/i

but with this regex, I can't achieve the first rule.

I tried to add negative lookahead at the end of my current regex like

/(?<![\"=])https?\:\/\/[^<>\"\s]+(?!<\/a>)/i

It still chooses the second url in the a tag like below.

http://aaa.co

We don't have developers Q&A community like Stackoverflow in Korea, so I really hope someone can help this simplely looking regex issue!

5
  • 2
    The real issue you have is that you're not choosing the right tool for the job. Parsing HTML with regex is not a good idea, use a parser like DOMDocument Commented Aug 12, 2014 at 9:57
  • Thanks for the response Elias, but even if I use those kind fo parsing class, shouldn't I still have to retrieve url from text in some way? I'm kind of learning regex so I'm just looking for some help solving this issue with regex. Commented Aug 12, 2014 at 10:06
  • Well of course you extract the url from the markup: $links = $dom->getElementsByTagName('a'); gives you all the link elements. Then simply loop over them, and get the links by doing $link->getAttribute('href')->value;. If certain url's should be skipped, then that's where a regex fits in. To get the link text: $link->nodeValue should work Commented Aug 12, 2014 at 10:09
  • Elias, I guess my question is mistaken. I'm not choosing href withi the A tag. I'm want to select urls EXCEPT the urls within A tags. Commented Aug 12, 2014 at 10:12
  • Added answer: you can get at the textContent of a node through the textContent property of an instance of DOMNode, or you can simply strip away the markup tags of your HTML, by calling strip_tags Commented Aug 12, 2014 at 10:17

3 Answers 3

1

Seriously consider using PHP's DOMDocument class. It does reliable HTML parsing. Doing this with regular expressions is error prone, more work, and slower.

The DOM works just like in the browser and you can use getElementsByTagName to get all links.

I got your use case working with this code using the DOM (try it here: http://3v4l.org/5IFof):

<?php

$html = <<<HTML
<a href="http://aaa.com">http://aaa.com</a>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($html);

foreach ($dom->getElementsByTagName('a') as $link) {
    var_dump($link->getAttribute('href'));
    // Output: http://aaa.com
}
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the reply chh. but I guess my question is mistaken. I'm not choosing href withi the A tag. I'm want to select urls EXCEPT urls within A tags.
1

Don't use Regex. Use DOM

$html = '<a href="http://aaa.com">http://aaa.com</a>';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $a) {
    if($a->hasAttribute('href')){
        echo $a->getAttribute('href');
    }
    //$a->nodeValue; // If you want the text in <a> tag
}

1 Comment

I guess my question is mistaken. I'm not choosing href withi the A tag. I'm want to select urls EXCEPT urls within A tags.
0

Seeing as you're not trying to extract urls that are the href attribute of an a node, you'll want to start by getting the actual text content of the dom. This can be easily done like so:

$dom = new DOMDocument;
$dom->loadHTML($htmlString);
$root = $dom->getElementsByTagName('body')[0];//get outer tag, in case of a full dom, this is body
$text = $root->textContent;//no tags, no attributes, no nothing.

An alternative approach would be this:

$text = strip_tags($htmlString);//gets rid of makrup.

2 Comments

Thank you Elias! I couldn't come up with the strip_tags functionallity ! I think that could solve my situation. However, I will definitely look into the first one.
@user3859822: happy to help. Looking into the entire DOMDocument business is definitely worth while. The API feels clunky at times, but it is the only way to write reliable code that processes markup.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.