1

I am working with simple web crawler. Below is simple html code i used to learn.

input.php

<ul id="nav">
    <li>
        <a href="www.google.com">Google</a>
        <ul>
            <li>
                <a href="mail.gmail.com">Gmail</a>
            </li>
        </ul>
    </li>
    <li>
        <a href="www.yahoo.com">Yahoo</a>
        <ul>
            <li>
                <a href="mail.yahoo.com">Yahoo Mail</a>
            </li>
        </ul>
    </li>
</ul>

I need to crawl the first anchor tag in ul[id=nav]->li. The code i used to crawl input.php is

<?php
    include 'simple_html_dom.php';
    $html = file_get_html('input.php');

    foreach ($html->find('ul[id=nav]') as $navUL){
        foreach ($navUL->find('li') as $navUL_LI){
            echo $navUL_LI->find('a',0)->outertext."<br>";              
        }
    }
?>

It Displays all the anchor tag in my input.php. I need to display only google and yahoo. How can i achieve this?

6 Answers 6

1

In this case you can directly point it out with children() method. Example:

foreach($html->find('ul#nav') as $ul) {
    foreach($ul->children() as $li) {
        echo $li->children(0)->outertext . '<br/>';
    }
}

Alternatively, you can use DOMDocument + DOMXpath for this too:

$dom = new DOMDocument();
$dom->loadHTML($str);
$xpath = new DOMXpath($dom);
// directly target those links
$links = $xpath->query('//ul[@id="nav"]/li/a');

foreach($links as $a) {
    echo $a->nodeValue . '<br/>';
}
Sign up to request clarification or add additional context in comments.

Comments

1
<?php
    include 'simple_html_dom.php';
    $html = file_get_html('input.php');

    foreach ($html->find('ul[id=nav]') as $navUL){
        foreach ($navUL->find('li') as $navUL_LI){
            if(strpos($navUL_LI,'google')||strpos($navUL_LI,'google')){
                echo $navUL_LI->find('a',0)->outertext."<br>";
                       }

        }
    }
?>

Comments

0

i have done the same work in Objective-c.

You can use the XML or HTML api's to serialize your html object.

If you want to do this form cold hand... find open tag and the close tag.

After this get first child, then the second and so on...

Comments

0

Try this:

// get the children of the element #nav, i.e. the top level lis
$lis = $html->getElementById("#nav")->childNodes();
// for each child, find the first 'a' element
foreach ($lis as $li) {
    $a = $li->find('a',0);
    // retrieve the link text itself.
    echo "link text: " . $a->innertext() . "\n";
}

See the simple-html-dom manual for details of all these methods.

Comments

0

you can simply achieve that by:

<?php
      foreach ($html->find('ul[id=nav]') as $navUL){
        foreach ($navUL->find('li') as $navUL_LI){
            echo $navUL_LI->find('a',-2)->outertext."<br>";              
        }
    }
?>

3 Comments

Can you explain me what does the parameter -2 do in the method find()?
@DharanBro -2 parameter find latest two anchor and returns element object or null if not found (zero based).. for more info: simplehtmldom.sourceforge.net/manual.htm If it helps you, please do upVotes :)
I have n number of anchor tags..In such case how can i do this??
0
<?php
$in = '<style>      .catalog-product-view .product.attribute.overview ul {         margin-top: 10px;     } </style><img src="/media/wysiwyg/img/misc/made-in-the-usa-doh-blue4.png"><ul><li>Ships as (12) 40 fl oz bottles</li></ul>';

function parseTags($input, $callback) {
    $len = strlen($input);
    $stack = [];

    $tag = "";
    $data = "";
    $isTag = false;
    $isString = false;
    for ($i=0; $i<$len; $i++) {
       $char = $input[$i];
       if ($char == '<') {
           $isTag = true;
           $tag .= $char;
       } else if ($char == '>') {
           $tag .= $char;
           if (substr($tag, 0, 2) == '</') {
               $close = str_replace('>', '', str_replace('</', '', explode(' ', $tag, 1)[0]));
               $open = str_replace('>', '', str_replace('<', '', explode(' ', end($stack), 1)[0]));
               if ($open == $close) {
                   $callback($tag, $data, $stack, $i, false);
                   array_pop($stack);
               }
           } else if (substr($tag, -2) == '/>') {
               $callback($tag, $data, $stack, $i, false);
           } else {
               $callback($tag, $data, $stack, $i, true);
               $stack[] = $tag;
           }
           $tag = "";
           $data = "";
           $isTag = false;
       } else if ($char == '"' || $char == "'") {
           if ($isString == false) {
               $isString = $char;
           } else if ($isString == $char && $input[$i-1] != '\\') {
               $isString = false;
           }
       } else if ($isTag) { 
           $tag .= $char; 
       } else { 
           $data .= $char; 
       }
    }
}

parseTags($in, function($tag, $data, $stack, $position, $isOpen) use (&$out) {
    print_r(func_get_args());
});

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.