PHP Regex HTML - Extract URL

Question

I am trying to extract multiple URLs from HTML file with regex. There are other URLs in the file, do the only pattern i have is "tableentries." and ""

HTML code example:

<tr class="tableentries2">
  <td>
    <a href="http://example.com/all-files/files/00000000789/">Click Here</a>
  </td>

PHP I wrote:

$html = "value of the code above"
if(preg_match_all('/<td>.*</td>/', $html, $match)){
foreach($match[0] as $x){

echo $x . "<br>";

}}

What is your question exactly? what does this code get you? why does it not work? — talnicolas
– talnicolas, Commented Nov 16, 2011 at 18:45
Quotes are missing around your HTML attributes. <tr class="tableentries2"> ... <a href="http://example.com/..."> (edited your question) — Maxime Pacary
– Maxime Pacary, Commented Nov 16, 2011 at 18:46
Maybe be use an DOM parser like simplehtmldom.sourceforge.net — Maxime Pacary
– Maxime Pacary, Commented Nov 16, 2011 at 18:48

sdleihssirhc · Accepted Answer · 2011-11-16 19:13:40Z

11

Why not just look for href values? (Updated because the edited code now has quotation marks.)

preg_match_all('/href="([^\s"]+)/', $html, $match);

Then the URI would be in $match[1][0].

edited Nov 16, 2011 at 19:13

answered Nov 16, 2011 at 18:44

sdleihssirhc

42.6k6 gold badges56 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Rajesh Muntari Over a year ago

Problem is there are also other URLs on the page, so the only pattern I have is "tableentries." and the beginning and "</a>" after the URL.Thanks for helping!

Jonathan Kuhn · Accepted Answer · 2011-11-16 19:13:36Z

5

You really shouldn't use regex to parse HTML. DOMDocument is actually very easy to use for this type of thing. here is a simple example.

<?php
error_reporting(E_ALL);
$html = "
<table>
    <tr>
        <td>
            <a href='http://www.test1-1.com'>test1-1</a>
        </td>
        <td>
            <a href='http://www.test1-2.com'>test1-2</a>
        </td>
        <td>
            <a href='http://www.test1-3.com'>test1-3</a>
        </td>
    </tr>
    <tr>
        <td>
            <a href='http://www.test2-1.com'>test2-1</a>
        </td>
        <td>
            <a href='http://www.test2-2.com'>test2-2</a>
        </td>
        <td>
            <a href='http://www.test2-3.com'>test2-3</a>
        </td>
    </tr>
</table>";

$DOM = new DOMDocument();
//load the html string into the DOMDocument
$DOM->loadHTML($html);
//get a list of all <A> tags
$a = $DOM->getElementsByTagName('a');
//loop through all <A> tags
foreach($a as $link){
    //echo out the href attribute of the <A> tag.
    echo $link->getAttribute('href').'<br />';
}
?>

This would output:

http://www.test1-1.com
http://www.test1-2.com
http://www.test1-3.com
http://www.test2-1.com
http://www.test2-2.com
http://www.test2-3.com

answered Nov 16, 2011 at 19:13

Jonathan Kuhn

15.3k3 gold badges34 silver badges43 bronze badges

3 Comments

Rajesh Muntari Over a year ago

Problem is there are also other URLs on the page, so the only pattern I have is "tableentries." and the beginning and "</a>" after the URL.Thanks for helping!

thevoipman Over a year ago

how do you also grab the test1-2 titles of the link as well?

Jonathan Kuhn Over a year ago

@thevoipman there is a nodeValue property you can use. something like $link->nodeValue. here is an example: codepad.viper-7.com/JBsfP1

Andrew W · Accepted Answer · 2015-04-11 12:10:39Z

0

<?php
preg_match_All("#<a\s[^>]*href\s*=\s*[\'\"]??\s*?(?'path'[^\'\"\s]+?)[\'\"\s]{1}[^>]*>(?'name'[^>]*)<#simU", $html, $hrefs, PREG_SET_ORDER);

foreach ($hrefs AS $urls){
 print $urls['path']."<br>";
}
?>

answered Apr 11, 2015 at 12:10

Andrew W

11 bronze badge

Collectives™ on Stack Overflow

PHP Regex HTML - Extract URL

3 Answers 3

1 Comment

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related