1
<!-- This Div repeated in HTML with different properties value -->

<div style="position:absolute; overflow:hidden; left:220px; top:785px; width:347px; height:18px; z-index:36">

<!-- Only Unique Thing is This in few pages -->
<a href="http://link.domain.com/?id=123" target="_parent">

<!-- OR in some pages Only Unique Thing is This, ending with mp3 extension -->
<a href="http://domain.com/song-title.mp3" target="_parent">

    <!-- This Div also repeated multiple in HTML -->

    <FONT style="font-size:10pt" color=#000000 face="Tahoma">
        <DIV><B>Harjaiyaan</B> - Nandini Srikar</DIV>
    </FONT>
</a>

</DIV>

We have very dirty html markup, its generated by some program or application. We want to extract 'Urls' from this code and as well as 'Text'.

in a href we are using two types urls, Url 1 pattern: 'http://link.domain.com/id=123', Url 2 pattern: 'http://domain.com/sons-title.mp3'

in the first match, we are specific pattern for but in second url we have not pattern just urls ending with '.mp3' extension.

Is there some function to extract url from this pattern and text code as well?

Note: without DOM, is there any way to match a href and between text with regular expression ? preg_match ?

1
  • There's no magic function that does all the job for you. You'll have to write code that does what you want. Use a DOM parser such as DOMDocument to accomplish this. Commented Feb 8, 2014 at 10:58

2 Answers 2

2

Make use of DOMDocument Class and proceed like this.

$dom = new DOMDocument;
$dom->loadHTML($html); //<------- Pass ur HTML source here
foreach ($dom->getElementsByTagName('a') as $tag) {

        echo $tag->getAttribute('href');
        echo $tag->nodeValue; // to get the content in between of tags...

}
Sign up to request clarification or add additional context in comments.

1 Comment

Just tried this, it works great. Though you might want to change this line to: echo $tag->getAttribute('href');
1

Expanding on @Shankar Damodaran's answer:

$html = file_get_contents('source.htm');

$dom = new DOMDocument;
$dom->loadHTML($html); 
foreach ($dom->getElementsByTagName('a') as $tag) {

    if (strstr($tag->getAttribute('href'),'?id=') !== false) {
        echo $tag->getAttribute('href') . "<br>\n";
    }

}

Then do the same for the MP3:

$html = file_get_contents('source.htm');

$dom = new DOMDocument;
$dom->loadHTML($html); 
foreach ($dom->getElementsByTagName('a') as $tag) {

    if (strstr($tag->getAttribute('href'),'.mp3') !== false) {
        echo $tag->getAttribute('href') . "<br>\n";
    }

}

4 Comments

thanks, but its show warning, like "Warning: DOMDocument::loadHTML(): Unexpected end tag : td Notice: DOMDocument::loadHTML(): Namespace prefix fb Warning: DOMDocument::loadHTML(): Tag fb:comment"
You need to load the $html file contents correctly.
Try saving the page you are reading the urls from as an .html file and open it with file_get_contents('source.htm') to debug it first. Delete the unnecessary stuff to make it simpler to debug.
without DOM, is there any way to match a href and between text with regular expression ? preg_match ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.