0

I'm trying to scrape a search form with curl (via PHP). I thought everything was correct, or close to it, but that doesn't seem the case. To give a little bit of background, I'm trying to collect (or scrape) data from a search form where the user inserts a date range and then submits the search. The results are then shown below the search inputs. The page is using AJAX/JavaScript to load data.

When I run the PHP script, I get nothing back. I've added print_r to see the results, but nothing shows.

Here's my script. All advice is welcome.

<?php
    class Scraper {

        // Class constructor method
        function __construct() {
            $this->useragent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3';
            $handle = fopen('cookie.txt', 'w') or exit('Unable to create or open cookie.txt file.'."\n");   // Opening or creating cookie file
            fclose($handle);    // Closing cookie file
            $this->cookie = 'cookie.txt';    // Setting a cookie file to store cookie
            $this->timeout = 30; // Setting connection timeout in seconds
        }

        // Method to search and scrape search details
        public function scrapePersons($searchString = '') {

            $searchUrl = 'https://virre.prh.fi/novus/publishedEntriesSearch';

            $postValues = array(
                'businessId' => '',
                'startDate' => '07072016',
                'endDate' => '08072016',
                'registrationTypeCode' => 'kltu.U',
                '_todayRegistered' => 'on',
                'domicileCode' => '091',
                '_domicileCode' => '1',
                '_eventId_search' => 'Search',
                'execution' => 'e2s1',
                '_defaultEventId' => '',
            );

            $search = $this->curlPostFields($searchUrl, $postValues);

            return $search;
        }

        // Method to make a POST request using form fields
        public function curlPostFields($postUrl, $postValues) {
            $_ch = curl_init(); // Initialising cURL session

            // Setting cURL options
            curl_setopt($_ch, CURLOPT_SSL_VERIFYPEER, FALSE);   // Prevent cURL from verifying SSL certificate
            curl_setopt($_ch, CURLOPT_FAILONERROR, TRUE);   // Script should fail silently on error
            curl_setopt($_ch, CURLOPT_COOKIESESSION, TRUE); // Use cookies
            curl_setopt($_ch, CURLOPT_FOLLOWLOCATION, TRUE);    // Follow Location: headers
            curl_setopt($_ch, CURLOPT_RETURNTRANSFER, TRUE);    // Returning transfer as a string
            curl_setopt($_ch, CURLOPT_COOKIEFILE, $this->cookie);    // Setting cookiefile
            curl_setopt($_ch, CURLOPT_COOKIEJAR, $this->cookie); // Setting cookiejar
            curl_setopt($_ch, CURLOPT_USERAGENT, $this->useragent);  // Setting useragent
            curl_setopt($_ch, CURLOPT_URL, $postUrl);   // Setting URL to POST to
            curl_setopt($_ch, CURLOPT_CONNECTTIMEOUT, $this->timeout);   // Connection timeout
            curl_setopt($_ch, CURLOPT_TIMEOUT, $this->timeout); // Request timeout
            curl_setopt($_ch, CURLOPT_POST, TRUE);  // Setting method as POST
            curl_setopt($_ch, CURLOPT_POSTFIELDS, $postValues); // Setting POST fields (array)

            $results = curl_exec($_ch); // Executing cURL session
            curl_close($_ch);   // Closing cURL session

            return $results;
        }

        // Class destructor method
        function __destruct() {
            // Empty
        }
    }

    $testScrape = new Scraper();   // Instantiating new object

    $data = json_decode($testScrape->scrapePersons());   // Scraping people records
    print_r($data);
?>

1 Answer 1

1

Firstly I'd check to ensure you are allowed to do this.

Assuming you are, the issue is that you are getting a security check form which if you were using a browser would automatically be submitted due to the javascript onload form submission, you'll need to replicate this to make it work.

The output I get is as follows.

<html>
<head>
  <title>Security Check</title></head>
<body onLoad="document.security_check_form.submit()">
<form name="security_check_form" action="j_security_check" method="POST">
<input type="hidden" value="prhanonymous" name="j_username"/>
<input type="hidden" value="*=AQICr82J28VvM2ljVarKvWv3LuibH7WPDyc8EVKuXdfytXrEv/o/MzMP3KfIEq+1Wda1ICP/pDLJQqniyBaRXTXnJGJCJhi2gVIoM0e+rwGEczxoah2+PsKOEnSI6A9j2MQO6/Q4i/vaXHVA7gfjjH7qvz0Fc+Pr7fPiBtJt+2YF3YghUN39cFhoK2O8mjRwTKORojRwcguq74B8Ttd0xyUlYld68t/mplsWv5npwMfT/wfv2XMidoVmB5k/p2rp3XbwlnHpJI3gJJcb5VV58M7frCB0vricZYv3xrKuco6qpMlX9wJeCqrhrMotY0+lisAvmEDIR3YpobE=" name="j_password"/>
</form>
</body>
</html>
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your reply! You mentioned replicating the above code. Do you mean adding that code into the php script or something else? Forgive me for being daft. My brain is fried.
You'll need to implement the security check within cURL also. So you'll need to get the action (which shouldn't change), the j_username, and j_password and POST that. In the end you'll do 2 POST requests, 1 which you have now to do the search and a 2nd to get past the security check.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.