How to extract urls and text from html markup with regex

Question

<!-- This Div repeated in HTML with different properties value -->

<div style="position:absolute; overflow:hidden; left:220px; top:785px; width:347px; height:18px; z-index:36">

<!-- Only Unique Thing is This in few pages -->
<a href="http://link.domain.com/?id=123" target="_parent">

<!-- OR in some pages Only Unique Thing is This, ending with mp3 extension -->
<a href="http://domain.com/song-title.mp3" target="_parent">

    <!-- This Div also repeated multiple in HTML -->

    <FONT style="font-size:10pt" color=#000000 face="Tahoma">
        <DIV><B>Harjaiyaan</B> - Nandini Srikar</DIV>
    </FONT>
</a>

</DIV>

We have very dirty html markup, its generated by some program or application. We want to extract 'Urls' from this code and as well as 'Text'.

in a href we are using two types urls, Url 1 pattern: 'http://link.domain.com/id=123', Url 2 pattern: 'http://domain.com/sons-title.mp3'

in the first match, we are specific pattern for but in second url we have not pattern just urls ending with '.mp3' extension.

Is there some function to extract url from this pattern and text code as well?

Note: without DOM, is there any way to match a href and between text with regular expression ? preg_match ?

There's no magic function that does all the job for you. You'll have to write code that does what you want. Use a DOM parser such as DOMDocument to accomplish this. — Amal
– Amal, Commented Feb 8, 2014 at 10:58

Shankar Narayana Damodaran · Accepted Answer · 2014-02-08 11:39:33Z

2

Make use of DOMDocument Class and proceed like this.

$dom = new DOMDocument;
$dom->loadHTML($html); //<------- Pass ur HTML source here
foreach ($dom->getElementsByTagName('a') as $tag) {

        echo $tag->getAttribute('href');
        echo $tag->nodeValue; // to get the content in between of tags...

}

edited Feb 8, 2014 at 11:39

answered Feb 8, 2014 at 11:04

Shankar Narayana Damodaran

68.6k43 gold badges102 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Grant Over a year ago

Just tried this, it works great. Though you might want to change this line to: echo $tag->getAttribute('href');

Grant · Accepted Answer · 2014-02-08 11:38:28Z

1

Expanding on @Shankar Damodaran's answer:

$html = file_get_contents('source.htm');

$dom = new DOMDocument;
$dom->loadHTML($html); 
foreach ($dom->getElementsByTagName('a') as $tag) {

    if (strstr($tag->getAttribute('href'),'?id=') !== false) {
        echo $tag->getAttribute('href') . "<br>\n";
    }

}

Then do the same for the MP3:

$html = file_get_contents('source.htm');

$dom = new DOMDocument;
$dom->loadHTML($html); 
foreach ($dom->getElementsByTagName('a') as $tag) {

    if (strstr($tag->getAttribute('href'),'.mp3') !== false) {
        echo $tag->getAttribute('href') . "<br>\n";
    }

}

answered Feb 8, 2014 at 11:38

Grant

2,4412 gold badges31 silver badges41 bronze badges

4 Comments

Ahmed iqbal Over a year ago

thanks, but its show warning, like "Warning: DOMDocument::loadHTML(): Unexpected end tag : td Notice: DOMDocument::loadHTML(): Namespace prefix fb Warning: DOMDocument::loadHTML(): Tag fb:comment"

Grant Over a year ago

You need to load the $html file contents correctly.

Grant Over a year ago

Try saving the page you are reading the urls from as an .html file and open it with file_get_contents('source.htm') to debug it first. Delete the unnecessary stuff to make it simpler to debug.

Ahmed iqbal Over a year ago

without DOM, is there any way to match a href and between text with regular expression ? preg_match ?

Collectives™ on Stack Overflow

How to extract urls and text from html markup with regex

2 Answers 2

1 Comment

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related