1

I'm mining data from site, but there it paginator, but I need to get all pages. Link to the next page is written in link tag with rel=next. If there are no more pages, the link tag is missing. I created function called getAll which should call self again and again until there is the link tag.

function getAll($url, &$links) {
    $dom = file_get_html ($url); // create dom object from $url
    $tmp = $dom->find('link[rel=next]', 0); // find link rel=next
    if(is_object($tmp)){ // is there the link tag?
        $link = $tmp->getAttribute('href'); // get url of next page - href attribute
        $links[] = $link; // insert url into array
        getAll($link, $links); // call self
    }else{
        return $links; // there are no more urls, return the array
    }
}

// usage
$links = array();
getAll('http://www.zbozi.cz/vyrobek/apple-iphone-5/', $links);
print_r($links); // dump the links

But I have a problem, when I run the script the message "No data received" appear in Chrome. I don't have any idea about error or something. The function should works, because when I don't use it again it-self it returns one link - to the second page.

I think the problem is in bad syntax or bad pointer usage.

Could you please help me?

2
  • 1
    Your function is getAll, you call getLinks inside and initially. Commented Dec 2, 2013 at 18:46
  • Sorry I renamed it while I was writing this. Edited. The problem isn't this. Commented Dec 2, 2013 at 18:49

2 Answers 2

1

I don't know what file_get_html or find should do, but this should work:

<?php

function getAll($url, &$links) {
    $dom = new DOMDocument();
    $dom->loadHTML(file_get_contents($url));
    $linkElements = $dom->getElementsByTagName('link');
    foreach ($linkElements as $link => $content) {
        if ($content->hasAttribute('rel') && $content->getAttribute('rel') === 'next') {
            $nextURL = $content->getAttribute('href');
            $links[] = $nextURL;
            getAll($nextURL, $links);
        }
    }
}

$links = array();
getAll('http://www.zbozi.cz/vyrobek/apple-iphone-5/', $links);
print_r($links);
Sign up to request clarification or add additional context in comments.

2 Comments

Warning: DOMDocument::loadHTML(): Unexpected end tag : a in Entity, line: 1 in E:\var\www\_github\BEOWULF\index.php on line 6
I suppose it's problem on zbozi.cz source right? your solution works well, thank you!
0

Firstly, this could be easier. Without an error message this could be anything from a DNS error to a corrupted space character inside your file. So if you haven't, try adding this to the top of your script:

error_reporting(E_ALL);
ini_set("display_errors", "1");

It should reveal any error that might have taken place. But if that doesn't work I have two ideas:

You can't have a syntax error because then the script wouldn't even run. You said that removing the recursion yielded a result so the script must work.

One possibility is that it's timing out. This depends on the server configuration. Try adding

echo $url, "<br>";
flush();

to the top of getAll. If you receive any of the links this is your problem. This can be fixed by calling a function like set_time_limit(0).

Another possibility is a connection error. This could be caused by coincidence or a server configuration limit. I can't be certain but I know some hosting providers limit file_get_contents and curl requests. There is a possibility your scripts are limited to one external request per execution.

Besides that there is nothing I could think of that can really go wrong with your script. You could remove the recursion and run the function in a while loop. But unless you expect a lot pages there is no need for such a modification.

And finally, the library you are using for DOM parsing will either return a DOM element object or null. So you can change if(is_object($tmp)){ to if($tmp){. And since you are passing the result by reference, returning a value is pointless. You can safely remove the else statement.

I wish you good luck.

1 Comment

thank you, i will use your advice in future debugging :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.