PHP pointers - no data received

Question

I'm mining data from site, but there it paginator, but I need to get all pages. Link to the next page is written in link tag with rel=next. If there are no more pages, the link tag is missing. I created function called getAll which should call self again and again until there is the link tag.

function getAll($url, &$links) {
    $dom = file_get_html ($url); // create dom object from $url
    $tmp = $dom->find('link[rel=next]', 0); // find link rel=next
    if(is_object($tmp)){ // is there the link tag?
        $link = $tmp->getAttribute('href'); // get url of next page - href attribute
        $links[] = $link; // insert url into array
        getAll($link, $links); // call self
    }else{
        return $links; // there are no more urls, return the array
    }
}

// usage
$links = array();
getAll('http://www.zbozi.cz/vyrobek/apple-iphone-5/', $links);
print_r($links); // dump the links

But I have a problem, when I run the script the message "No data received" appear in Chrome. I don't have any idea about error or something. The function should works, because when I don't use it again it-self it returns one link - to the second page.

I think the problem is in bad syntax or bad pointer usage.

Could you please help me?

Your function is getAll, you call getLinks inside and initially. — Tim Withers
– Tim Withers, Commented Dec 2, 2013 at 18:46
Sorry I renamed it while I was writing this. Edited. The problem isn't this. — Northys
– Northys, Commented Dec 2, 2013 at 18:49

Harri · Accepted Answer · 2013-12-02 20:14:26Z

1

I don't know what file_get_html or find should do, but this should work:

<?php

function getAll($url, &$links) {
    $dom = new DOMDocument();
    $dom->loadHTML(file_get_contents($url));
    $linkElements = $dom->getElementsByTagName('link');
    foreach ($linkElements as $link => $content) {
        if ($content->hasAttribute('rel') && $content->getAttribute('rel') === 'next') {
            $nextURL = $content->getAttribute('href');
            $links[] = $nextURL;
            getAll($nextURL, $links);
        }
    }
}

$links = array();
getAll('http://www.zbozi.cz/vyrobek/apple-iphone-5/', $links);
print_r($links);

answered Dec 2, 2013 at 20:14

Harri

2,7322 gold badges22 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Northys Over a year ago

Warning: DOMDocument::loadHTML(): Unexpected end tag : a in Entity, line: 1 in E:\var\www\_github\BEOWULF\index.php on line 6

Northys Over a year ago

I suppose it's problem on zbozi.cz source right? your solution works well, thank you!

Metod Medja · Accepted Answer · 2013-12-02 20:02:39Z

Firstly, this could be easier. Without an error message this could be anything from a DNS error to a corrupted space character inside your file. So if you haven't, try adding this to the top of your script:

error_reporting(E_ALL);
ini_set("display_errors", "1");

It should reveal any error that might have taken place. But if that doesn't work I have two ideas:

You can't have a syntax error because then the script wouldn't even run. You said that removing the recursion yielded a result so the script must work.

One possibility is that it's timing out. This depends on the server configuration. Try adding

echo $url, "<br>";
flush();

to the top of getAll. If you receive any of the links this is your problem. This can be fixed by calling a function like set_time_limit(0).

Another possibility is a connection error. This could be caused by coincidence or a server configuration limit. I can't be certain but I know some hosting providers limit file_get_contents and curl requests. There is a possibility your scripts are limited to one external request per execution.

Besides that there is nothing I could think of that can really go wrong with your script. You could remove the recursion and run the function in a while loop. But unless you expect a lot pages there is no need for such a modification.

And finally, the library you are using for DOM parsing will either return a DOM element object or null. So you can change if(is_object($tmp)){ to if($tmp){. And since you are passing the result by reference, returning a value is pointless. You can safely remove the else statement.

I wish you good luck.

Collectives™ on Stack Overflow

PHP pointers - no data received

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related