0

I bought a book on web scraping with php. In it the author logins into https://www.packtpub.com/ . The book is out of date so I can't really test ideas out, because the webpage has changed since release. This is the modified code I am using, but the logins are unsuccessful, which I concluded from "Account Options" string not being in the $results variable. What should I change? I believe the error is coming from incorrectly specifying destination.

<?php
// Function to submit form using cURL POST method
function curlPost($postUrl, $postFields, $successString) {
    $useragent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5;
       en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3';  // Setting useragent of a popular browser
    $cookie = 'cookie.txt';  // Setting a cookie file to storecookie
    $ch = curl_init();  // Initialising cURL session
    // Setting cURL options
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);  // PreventcURL from verifying SSL certificate
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
    curl_setopt($ch, CURLOPT_FAILONERROR, TRUE);  // Script shouldfail silently on error
    curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);  // Use cookies
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);  // FollowLocation: headers
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);  // Returningtransfer as a string
    curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);  // Settingcookiefile
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);  // Settingcookiejar
    curl_setopt($ch, CURLOPT_USERAGENT, $useragent);  // Settinguseragent
    curl_setopt($ch, CURLOPT_URL, $postUrl);  // Setting URL to POSTto
    curl_setopt($ch, CURLOPT_POST, TRUE);  // Setting method as POST
    curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($postFields));  // Setting POST fields as array
            $results = curl_exec($ch);  // Executing cURL session
            $httpcode = curl_getinfo($ch,CURLINFO_HTTP_CODE);
                echo "$httpcode";
            curl_close($ch);  // Closing cURL session
            // Checking if login was successful by checking existence of string
            if (strpos($results, $successString)) {
                echo "I'm in.";
                return $results;
            } else {
                echo "Nope, sth went wrong.";
                return FALSE;
            } 
}

$userEmail = '[email protected]';  // Setting your email address for site login
$userPass = 'yourpass';  // Setting your password for sitelogin
$postUrl = 'https://www.packtpub.com';  // Setting URL toPOST to
// Setting form input fields as 'name' => 'value'
$postFields = array(
        'email' => $userEmail,
        'password' => $userPass,
        'destination' => 'https://www.packtpub.com',
        'form_id' => 'packt-user-login-form'
);
$successString = 'Account Options';
$loggedIn = curlPost($postUrl, $postFields, $successString);  //Executing curlPost login and storing results page in $loggedIn

EDIT: post request:

enter image description here

I replaced the line

'destination' => 'https://www.packtpub.com'
with    

'op' => 'Login'

,added

'form_build_id' => ''

and edited

$postUrl = 'https://www.packtpub.com/register';

since that is the URL I get when choosing copy as cURL and pasting in editor.

I am still getting "Nope, sth went wrong message". I think it is because $successString doesn't get stored in curl in the first place. What is the form-build-id supposed to be set to? It is changing every time I log in.

7
  • form_build_id may be a CSRF token. If it is, you will have to make a request to the login page (GET request), then parse the HTML to extract this value. It's likely in a hidden form field. Try to replay the request in Firefox with a blank form_build_id and check the response. Commented Feb 5, 2016 at 16:12
  • It appears form_build_id is a CSRF token. They seem to be using Drupal. I don't have time right now to write the cURL request in PHP. If I have time when I return home I will knock up an example for you. Here's some useful information on what a CSRF token is, and why they are used: owasp.org/index.php/Cross-Site_Request_Forgery_%28CSRF%29 Commented Feb 5, 2016 at 16:16
  • 1
    Also note, you've used - instead of _ in the form_id :p Commented Feb 5, 2016 at 16:20
  • ahh, my bad about the hyphen. I went on and inspect logins in other pages. Pretty much all of them use CSRF tokens. Commented Feb 5, 2016 at 19:48
  • The CSRF token isn't difficult to grab. Try phpQuery for parsing the HTML. Personally, I find Python is much more elegant when writing web automation tools and haven't really used PHP for this sort of thing in a while. cURL takes some getting used too, I often use it in C/C++ but those are generally for developing specialist tools. Keep in mind you don't always need to pass all of the headers (sometimes, such as FB, you don't need all of the post data either). Commented Feb 5, 2016 at 21:46

2 Answers 2

2

The book you're using is old, and Packt Publishing have changed their website. It now includes a CSRF token, and without passing this you will never be able to log in.

I've developed a working solution. It uses pQuery for parsing the HTML. You can install this using Composer, or download the package and include it into your application. If you do this, remove the require __DIR__ . '/vendor/autoload.php'; and replace with the location to the pquery package on your system.

To test via the command line simply run: php packt_example.php.

You will also notice that many headers are not even required, such as the useragent. I have left these out.

<?php

require __DIR__ . '/vendor/autoload.php';

$email = '[email protected]';
$password = 'mypassword';

# Initialize a cURL session.
$ch = curl_init('https://www.packtpub.com/register');

# Set the cURL options.
$options = [
    CURLOPT_COOKIEFILE      => 'cookies.txt',
    CURLOPT_COOKIEJAR       => 'cookies.txt',
    CURLOPT_RETURNTRANSFER  => 1
];

# Set the options
curl_setopt_array($ch, $options);

# Execute
$html = curl_exec($ch);

# Grab the CSRF token from the HTML source
$dom = pQuery::parseStr($html);
$csrfToken = $dom->query('[name="form_build_id"]')->val();

# Now we have the form_build_id (aka the CSRF token) we can
# proceed with making the POST request to login. First,
# lets create an array of post data to send with the POST
# request.
$postData = [
    'email'         => $email,
    'password'      => $password,
    'op'            => 'Login',
    'form_build_id' => $csrfToken,
    'form_id'       => 'packt_user_login_form'
];


# Convert the post data array to URL encoded string
$postDataStr = http_build_query($postData);

# Append some fields to the CURL options array to make a POST request.
$options[CURLOPT_POST] = 1;
$options[CURLOPT_POSTFIELDS] = $postDataStr;
$options[CURLOPT_HEADER] = 1;

curl_setopt_array($ch, $options);

# Execute
$response = curl_exec($ch);

# Extract the headers from the response
$headerSize = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
$headers = substr($response, 0, $headerSize);

# Close cURL handle
curl_close($ch);

# If login is successful, the headers will contain a location header
# to the url http://www.packtpub.com/index
if(!strpos($headers, 'packtpub.com/index'))
{
    print 'Login Failed';
    exit;
}

print 'Logged In';
Sign up to request clarification or add additional context in comments.

12 Comments

You should submit errata to that book! :P Thank you!
What is the title, and version, of the book and what page does the code example appear on. I'd be interesting in submitting an errata.
Instant php web scraping. I think there is only 1 version. Source code is free. packtpub.com/web-development/instant-php-web-scraping-instant
After initial trouble with pquery I eventually managed to run this script. I am getting "Login Failed" message. I echoed out the $headers and there is indeed no packtpub.com/account in there.
I replaced$ch = curl_init('packtpub.com/register'); with $ch = curl_init('packtpub.com');. Also, I think the target location indicating successful login is 'packtpub.com/index'.
|
2

I'm posting this answer as I think it may help you in the future when faced with such problems. I do this a lot when I am writing web scrapers.

  1. Open Firefox. Press CTRL + SHIFT + Q
  2. Press Network tab
  3. Go to website. You will notice the HTTP requests are being monitored
  4. Log in successfully whilst HTTP requests are being monitored
  5. Once logged in, right click on the HTTP request that was made to log you in, and copy as CURL.

Now you have the CURL request. Replicate the HTTP request using PHP's cURL. And test again.

For web scraping you should be very familiar with monitoring HTTP headers. You can use:

  • Network monitor (Chrome, Firefox)

  • Fiddler

  • Wiresharp

  • MITMProxy

  • Charles

etc ...

1 Comment

Thank you! Some really useful data. I added an image of what I'm currently observing.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.