3

I need to capture all links in a given html.

Here is sample code:

<div class="infobar">
    ... some code goes here ...
    <a href="/link/some-text">link 1</a>
    <a href="/link/another-text">link 2</a>
    <a href="/link/blabla">link 3</a>
    <a href="/link/whassup">link 4</a>
    ... some code goes here ...
</div>

I need to get all links inside div.infobar that starts with /link/

I tried this:

preg_match_all('#<div class="infobar">.*?(href="/link/(.*?)") .*?</div>#is', $raw, $x);

but it gives me the only first match.

Thanks for advices.

6
  • Maybe there's an html parser that will do this more easily for you? Commented Jun 23, 2011 at 23:33
  • I am already getting it first getting the inside of div.infobar with preg_match then getting the links with preg_match_all. but since regex offers more flexibility, why I shouldn't use it? I just need a good pattern. I want to know how to accomplish that with just 1 preg_match_all Commented Jun 23, 2011 at 23:35
  • 2
    You cannot do that with a single regex. You first need to isolate the div and then extract the desired links from it. -- What the stubby comments are about: you can extract the links easier with phpQuery or QueryPath using foreach (qp($html)->find("div.infobar a") as $a) { print $a->attr("href"); } Using a specific regex is really only appropriate for performance reasons, if it's a known coherent html input blob. Commented Jun 23, 2011 at 23:35
  • HTML is not a regular language, so it is unwise to use a regular expression to parse HTML. Commented Jun 24, 2011 at 0:28
  • @stereofrog, fair point; there's no way I can improve upon anubhava's answer for this specific case, and I think a little levity is a fantastic way to show that trying to use the wrong tool for the job can lead to incredible frustration. Commented Jun 24, 2011 at 1:51

4 Answers 4

7

I would suggest using DOMDocument for this very purpose rather than using regex. Consider following simple code:

$content = '
<div class="infobar">
    <a href="/link/some-text">link 1</a>
    <a href="/link/another-text">link 2</a>
    <a href="/link/blabla">link 3</a>
    <a href="/link/whassup">link 4</a>
</div>';
$dom = new DOMDocument();
$dom->loadHTML($content);

// To hold all your links...
$links = array();

// Get all divs
$divs = $dom->getElementsByTagName("div");
foreach($divs as $div) {
  // Check the class attr of each div
  $cl = $div->getAttribute("class");
  if ($cl == "infobar") {
    // Find all hrefs and append it to our $links array
    $hrefs = $div->getElementsByTagName("a");
    foreach ($hrefs as $href)
       $links[] = $href->getAttribute("href");
  }
}
var_dump($links);

OUTPUT

array(4) {
  [0]=>
  string(15) "/link/some-text"
  [1]=>
  string(18) "/link/another-text"
  [2]=>
  string(12) "/link/blabla"
  [3]=>
  string(13) "/link/whassup"
}
Sign up to request clarification or add additional context in comments.

3 Comments

Lets see if the op still thinks regex are better :d
what is the execution time between this and regex? I can do this with just 2 preg_match_all functions.
Execution time will be comparable (or even better) than regex based code but more importantly DOM based code will NOT break at unexpected time as compared to regex code.
2

Revising my previous answer. You'll need to do it in two steps:

//This first step grabs the contents of the div.
preg_match('#(?<=<div class="infobar">).*?(?=</div>)#is', $raw, $x);

//And here, we grab all of the links.
preg_match_all('#href="/link/(.*?)"#is', $x[0], $x);

2 Comments

Thanks. but this time it gets the last one :D
I split it into two steps. The div gets matched the first time, and then can't be matched again.
2

http://simplehtmldom.sourceforge.net/ :

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 

Comments

0

Try this (I added a +):

preg_match_all('#<div class="infobar">.*?(href="/link/(?:.*?)")+ .*?</div>#is', $raw, $x);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.