2

I am trying to extract multiple URLs from HTML file with regex. There are other URLs in the file, do the only pattern i have is "tableentries." and ""

HTML code example:

<tr class="tableentries2">
  <td>
    <a href="http://example.com/all-files/files/00000000789/">Click Here</a>
  </td>

PHP I wrote:

$html = "value of the code above"
if(preg_match_all('/<td>.*</td>/', $html, $match)){
foreach($match[0] as $x){

echo $x . "<br>";

}}
3
  • What is your question exactly? what does this code get you? why does it not work? Commented Nov 16, 2011 at 18:45
  • Quotes are missing around your HTML attributes. <tr class="tableentries2"> ... <a href="http://example.com/..."> (edited your question) Commented Nov 16, 2011 at 18:46
  • Maybe be use an DOM parser like simplehtmldom.sourceforge.net Commented Nov 16, 2011 at 18:48

3 Answers 3

11

Why not just look for href values? (Updated because the edited code now has quotation marks.)

preg_match_all('/href="([^\s"]+)/', $html, $match);

Then the URI would be in $match[1][0].

Sign up to request clarification or add additional context in comments.

1 Comment

Problem is there are also other URLs on the page, so the only pattern I have is "tableentries." and the beginning and "</a>" after the URL.Thanks for helping!
5

You really shouldn't use regex to parse HTML. DOMDocument is actually very easy to use for this type of thing. here is a simple example.

<?php
error_reporting(E_ALL);
$html = "
<table>
    <tr>
        <td>
            <a href='http://www.test1-1.com'>test1-1</a>
        </td>
        <td>
            <a href='http://www.test1-2.com'>test1-2</a>
        </td>
        <td>
            <a href='http://www.test1-3.com'>test1-3</a>
        </td>
    </tr>
    <tr>
        <td>
            <a href='http://www.test2-1.com'>test2-1</a>
        </td>
        <td>
            <a href='http://www.test2-2.com'>test2-2</a>
        </td>
        <td>
            <a href='http://www.test2-3.com'>test2-3</a>
        </td>
    </tr>
</table>";

$DOM = new DOMDocument();
//load the html string into the DOMDocument
$DOM->loadHTML($html);
//get a list of all <A> tags
$a = $DOM->getElementsByTagName('a');
//loop through all <A> tags
foreach($a as $link){
    //echo out the href attribute of the <A> tag.
    echo $link->getAttribute('href').'<br />';
}
?>

This would output:

http://www.test1-1.com
http://www.test1-2.com
http://www.test1-3.com
http://www.test2-1.com
http://www.test2-2.com
http://www.test2-3.com

3 Comments

Problem is there are also other URLs on the page, so the only pattern I have is "tableentries." and the beginning and "</a>" after the URL.Thanks for helping!
how do you also grab the test1-2 titles of the link as well?
@thevoipman there is a nodeValue property you can use. something like $link->nodeValue. here is an example: codepad.viper-7.com/JBsfP1
0
<?php
preg_match_All("#<a\s[^>]*href\s*=\s*[\'\"]??\s*?(?'path'[^\'\"\s]+?)[\'\"\s]{1}[^>]*>(?'name'[^>]*)<#simU", $html, $hrefs, PREG_SET_ORDER);

foreach ($hrefs AS $urls){
 print $urls['path']."<br>";
}
?>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.