2

I have implemented a web crawler that crawls and retrieves content from .edu TLD. The html content is being inserted into MySQL tables as the source code of the page. The script can go on for hours on a decent internet connection when a large number of seed urls are fed to the crawler. Now, my problem is that the script halts after crawling a number of links without giving any errors. I have used exception handling to handle "MySQL Server has gone away error" and has already eliminated a lot of problems and implemented if conditions that echo the errors if they are encountered. However I am not getting any errors. The problem is the halting of the script, whether I run it in the browser, Eclipse PDT or the CLI. Though it is worthy to note that the number of links crawled are somewhat different in all the three methods of running the script. I have altered the php.ini max_execution_time and other directives but this is not helping in anyway.

I have coded the script so that it resumes the crawling from where it halted, but I want the script to continue without halting so that I don't have to monitor whether the script is running or not.

Should I make changes to my Apache httpd.conf files. If yes, then what those settings should be??

The description in these links for my web crawler may help.

This is the code that retrieves html from url. This is from simple_html_dom.

function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $defaultBRText);
// For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
$contents = file_get_contents($url, $use_include_path, $context, $offset);
// Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//    $contents = retrieve_url_contents($url);
if (empty($contents))
{
    return false;
}
// The second parameter can force the selectors to all be lowercase.
$dom->load($contents, $lowercase, $stripRN);
return $dom;
}

Here is the error log for the following links:

And the crawler stopped after crawling this link:

[01-Jan-2012 22:54:39] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:54:39] PHP Warning: file_get_contents(http://lms.nust.edu.pk) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:54:41] PHP Warning: file_get_contents(http://www.nust.edu.pk/#) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

... (same error repeated twice) ...

[01-Jan-2012 22:55:58] PHP Warning: file_get_contents(http://www.nust.edu.pk/usr/oricdic.aspx#ipo) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:55:58] PHP Warning: file_get_contents(http://www.nust.edu.pk/usr/oricdic.aspx#tto) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:55:59] PHP Warning: file_get_contents(http://www.nust.edu.pk/usr/oricdic.aspx#ilo) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:55:59] PHP Warning: file_get_contents(http://www.nust.edu.pk/usr/oricdic.aspx#mco) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:56:05] PHP Warning: file_get_contents(http://www.nust.edu.pk/#) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

... (same error repeated 18 times) ...

[01-Jan-2012 22:57:33] PHP Warning: file_get_contents(http://www.nust.edu.pk/#ctl00_SiteMapPath1_SkipLink) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:57:33] PHP Notice: Undefined variable: parts in D:\wamp\www\crawler1\AbsoluteUrl\url_to_absolute.php on line 330

[01-Jan-2012 22:57:55] PHP Warning: file_get_contents(http://www.harvard.edu/#skip) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:58:21] PHP Warning: file_get_contents(http://www.harvard.edu/admissions-aid#undergrad) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:58:22] PHP Warning: file_get_contents(http://www.harvard.edu/admissions-aid#grad) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:58:24] PHP Warning: file_get_contents(http://www.harvard.edu/admissions-aid#continue) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 22:58:25] PHP Warning: file_get_contents(http://www.harvard.edu/admissions-aid#summer) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:00:04] PHP Warning: file_get_contents(http://www.harvard.edu/#) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found

in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

... (same error repeated 1 time) ...

[01-Jan-2012 23:00:11] PHP Notice: Undefined variable: parts in D:\wamp\www\crawler1\AbsoluteUrl\url_to_absolute.php on line 330

[01-Jan-2012 23:00:41] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:00:41] PHP Warning: file_get_contents(http://directory.berkeley.edu) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:00:47] PHP Notice: Undefined variable: parts in D:\wamp\www\crawler1\AbsoluteUrl\url_to_absolute.php on line 330

[01-Jan-2012 23:01:53] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:01:53] PHP Warning: file_get_contents(http://students.berkeley.edu/uga/) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:01:57] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:01:57] PHP Warning: file_get_contents(http://publicservice.berkeley.edu/) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:00] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:00] PHP Warning: file_get_contents(http://students.berkeley.edu/osl/leadprogs.asp) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:17] PHP Notice: Undefined variable: parts in D:\wamp\www\crawler1\AbsoluteUrl\url_to_absolute.php on line 330

[01-Jan-2012 23:02:25] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:25] PHP Warning: file_get_contents(http://bearfacts.berkeley.edu/bearfacts) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:28] PHP Warning: file_get_contents() [streams.crypto]: this stream does not support SSL/crypto in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

[01-Jan-2012 23:02:28] PHP Warning: file_get_contents(http://career.berkeley.edu/) [function.file-get-contents]: failed to open stream: Cannot connect to HTTPS server through proxy in D:\wamp\www\crawler1\simplehtmldom_1_5\simple_html_dom.php on line 72

And this is the error log from php-cgi.exe:

Problem signature:
  Problem Event Name:   APPCRASH
  Application Name: php-cgi.exe
  Application Version:  5.3.8.0
  Application Timestamp:    4e537939
  Fault Module Name:    php5ts.dll
  Fault Module Version: 5.3.8.0
  Fault Module Timestamp:   4e537a04
  Exception Code:   c0000005
  Exception Offset: 0000c793
  OS Version:   6.1.7601.2.1.0.256.48
  Locale ID:    1033
  Additional Information 1: 0a9e
  Additional Information 2: 0a9e372d3b4ad19135b953a78882e789
  Additional Information 3: 0a9e
  Additional Information 4: 0a9e372d3b4ad19135b953a78882e789

Please help me in this regard.

5
  • Did you set error_reporting() and display_errors? Dod you changed recursion to flat list in your code? Commented Jan 1, 2012 at 22:59
  • Yes I did change it. And I have made sure that "MySQL server errors are eliminated. Commented Jan 1, 2012 at 23:01
  • Don't use file_get_contents(); use curl to get webpage contents because curl is better suited to do things like that. Insert erro_log('what's going on') inside your code and try again to see when exacly script crashed. You could also dump memory usage into error log. Commented Jan 1, 2012 at 23:22
  • @piotrekkr How am I supposed to change the code at this point to use curl instead of simple_html_dom?? Commented Jan 1, 2012 at 23:29
  • @piotrekkr yeah this is obviously my code but I am comfortable with simple_html_dom and has never used curl. I would need to alter a lot of code for that. If there are no other alternatives, can you explain what I need to know about curl. And seeing the log file, can you suggest what is causing this?? Commented Jan 1, 2012 at 23:35

1 Answer 1

2

you should check call stack of php process (if running as CGI or CLI) or apache httpd process(if run as mod_php).

Then you will see in which module/procedure are execution halted. Also you can check active TCP/IP connection made by your script, maybe there is some ongoing IO operation which caused your script to halted.

I hope this helps.

Sign up to request clarification or add additional context in comments.

9 Comments

Can you please elaborate further especially the TCP/IP connection part.
This depends which OS your server is running.Just a note, do you have this problem with all sites you're try to crawl, or just specific ones?
Well the trend of halting is somewhat constant I must say. Like the crawling stops after a certain link. But when I refresh the script in the browser, It resumes. And there are one or more such links in nearly every site. However, the exception is not thrown, otherwise I would have known which link caused it.I am still unclear whether there is a problem with links or is it somewhere in the configurations. I am waiting for the log to complete and I will post it here. It would be very kind of you to please check that out.
You are connecting using HTTP proxy?Which function you are using to contact URL? fopen? Maybe you should setup shorter time-out values.
I think you should check documentation of simple_html_dom especially about "context", I think there should be settings like time-out.Try to set one second or something like this.By the way, did you try to debug it?You can download Zend Studio trial (if you don't have it) and debug it there.Then you can debug and profile your code and found a bottleneck.I hope this helps you.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.