php regex selecting url from html source

Question

I'm new to stackoverflow and from South Korea.

I'm having difficulties with regex with php.

I want to select all the urls from user submitted html source.

The restrictions I want to make are following.

Select urls EXCEPT

urls are within tags for example if the html source is like below,

<a href="http://aaa.com">http://aaa.com</a>

Neither of http://aaa.com should be selected.
urls right after " or =

Here is my current regex stage.

/(?<![\"=])https?\:\/\/[^\"\s<>]+/i

but with this regex, I can't achieve the first rule.

I tried to add negative lookahead at the end of my current regex like

/(?<![\"=])https?\:\/\/[^<>\"\s]+(?!<\/a>)/i

It still chooses the second url in the a tag like below.

http://aaa.co

We don't have developers Q&A community like Stackoverflow in Korea, so I really hope someone can help this simplely looking regex issue!

The real issue you have is that you're not choosing the right tool for the job. Parsing HTML with regex is not a good idea, use a parser like DOMDocument — Elias Van Ootegem
– Elias Van Ootegem, Commented Aug 12, 2014 at 9:57
Thanks for the response Elias, but even if I use those kind fo parsing class, shouldn't I still have to retrieve url from text in some way? I'm kind of learning regex so I'm just looking for some help solving this issue with regex. — user3859822
– user3859822, Commented Aug 12, 2014 at 10:06
Well of course you extract the url from the markup: $links = $dom->getElementsByTagName('a'); gives you all the link elements. Then simply loop over them, and get the links by doing $link->getAttribute('href')->value;. If certain url's should be skipped, then that's where a regex fits in. To get the link text: $link->nodeValue should work — Elias Van Ootegem
– Elias Van Ootegem, Commented Aug 12, 2014 at 10:09
Elias, I guess my question is mistaken. I'm not choosing href withi the A tag. I'm want to select urls EXCEPT the urls within A tags. — user3859822
– user3859822, Commented Aug 12, 2014 at 10:12
Added answer: you can get at the textContent of a node through the textContent property of an instance of DOMNode, or you can simply strip away the markup tags of your HTML, by calling strip_tags — Elias Van Ootegem
– Elias Van Ootegem, Commented Aug 12, 2014 at 10:17

chh · Accepted Answer · 2014-08-12 10:07:49Z

1

Seriously consider using PHP's DOMDocument class. It does reliable HTML parsing. Doing this with regular expressions is error prone, more work, and slower.

The DOM works just like in the browser and you can use getElementsByTagName to get all links.

I got your use case working with this code using the DOM (try it here: http://3v4l.org/5IFof):

<?php

$html = <<<HTML
<a href="http://aaa.com">http://aaa.com</a>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($html);

foreach ($dom->getElementsByTagName('a') as $link) {
    var_dump($link->getAttribute('href'));
    // Output: http://aaa.com
}

answered Aug 12, 2014 at 10:07

chh

5933 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user3859822 Over a year ago

Thank you for the reply chh. but I guess my question is mistaken. I'm not choosing href withi the A tag. I'm want to select urls EXCEPT urls within A tags.

Linga · Accepted Answer · 2014-08-12 10:08:45Z

1

Don't use Regex. Use DOM

$html = '<a href="http://aaa.com">http://aaa.com</a>';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $a) {
    if($a->hasAttribute('href')){
        echo $a->getAttribute('href');
    }
    //$a->nodeValue; // If you want the text in <a> tag
}

answered Aug 12, 2014 at 10:08

Linga

10.6k10 gold badges57 silver badges109 bronze badges

1 Comment

user3859822 Over a year ago

I guess my question is mistaken. I'm not choosing href withi the A tag. I'm want to select urls EXCEPT urls within A tags.

Elias Van Ootegem · Accepted Answer · 2014-08-12 10:17:00Z

0

Seeing as you're not trying to extract urls that are the href attribute of an a node, you'll want to start by getting the actual text content of the dom. This can be easily done like so:

$dom = new DOMDocument;
$dom->loadHTML($htmlString);
$root = $dom->getElementsByTagName('body')[0];//get outer tag, in case of a full dom, this is body
$text = $root->textContent;//no tags, no attributes, no nothing.

An alternative approach would be this:

$text = strip_tags($htmlString);//gets rid of makrup.

answered Aug 12, 2014 at 10:17

Elias Van Ootegem

76.7k10 gold badges123 silver badges160 bronze badges

2 Comments

user3859822 Over a year ago

Thank you Elias! I couldn't come up with the strip_tags functionallity ! I think that could solve my situation. However, I will definitely look into the first one.

Elias Van Ootegem Over a year ago

@user3859822: happy to help. Looking into the entire DOMDocument business is definitely worth while. The API feels clunky at times, but it is the only way to write reliable code that processes markup.

Collectives™ on Stack Overflow

php regex selecting url from html source

3 Answers 3

1 Comment

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related