2

I am trying to scrape a suppliers magento site in an effort to save some time because of there being around 2000 products I need to gather info for. I'm totally OK with writing a screen scraper for pretty much anything but i've encountered a major problem. Im using get_file_contentsto gather the html of the product page.

The problem is:

You need to be logged in, to view the product page. Its a standard magento login, so how can I get round this in my screen scraper? I don't require a full script, just advice on a method.

4
  • It is illegal. Ask them to send you a price list in any appropriate format. Commented Jan 4, 2011 at 3:40
  • 1
    It's not illegal. Totally depends on the source, and the statuatory permissions you are given, including in any terms of use, t+c's or express permission from the content creator. The screen scrape will be a way of automating product updates. Thanks anyway. Commented Jan 4, 2011 at 4:02
  • can you give us the url to the site so we can see its TOS that allow screen scraping? Commented Jan 4, 2011 at 4:13
  • 3
    He doesn't need to show you the site or the TOS. Regardless of whether he is allowed to scrape the site or not, his question is legitimate. He wants to know how to send cookies with the file_get_contents method. Edit: Plus he will learn something about headers and cookies if he doesn't already know how a login works. Commented Jan 4, 2011 at 4:15

2 Answers 2

2

Using stream_context_create you can specify headers to be sent when calling your file_get_contents.

What I'd suggest is, open your browser and login to the site. Open up Firebug (or your favorite Cookie viewer) and grab the cookies and send them with your request.

Edit: Here's an example from PHP.net:

<?php
// Create a stream
$opts = array(
  'http'=>array(
    'method'=>"GET",
    'header'=>"Accept-language: en\r\n" .
              "Cookie: foo=bar\r\n"
  )
);

$context = stream_context_create($opts);

// Open the file using the HTTP headers set above
$file = file_get_contents('http://www.example.com/', false, $context);
?>

Edit (2): This is out of the scope of your question, but if you are wondering how to scrape the website afterwards you could look into the DOMDocument::loadHTML method. This will essentially give you the required functions (i.e. XPath query, getElementsByTagName, getElementsById) to scrape what you need.

If you want to scrape something simple, you can also use RegEx with preg_match_all.

Sign up to request clarification or add additional context in comments.

3 Comments

Very usefull, thankyou. I'm using preg_match to parse the page, that's all I needed. Your a star!
I just thought, I'venot looked to closly at how the login works yet, but what if the login is registered using Session Variables?
If Magento uses sessions, the session id (in most cases) will be stored in the cookie PHPSESSID. So basically all you have to do is put 'header' => "Cookie: PHPSESSID=...\r\n" and it should log you in. Keep in mind sessions can expire, so if you are scraping for a long time you might need to update the cookie eventually.
0

If you're familiar with CURL this should be relatively simple to do in a day or so. I've created some similar apps to login to banks to retrieve data - which of course also require authentication.

Below is a link with an example of how to use CURL with cookies for authentication purposes:

http://coderscult.com/php/php-curl/2008/05/20/php-curl-cookies-example/

If you can grab the output of the page you can parse for your results with a regex. Alternatively, you can use a class like Snoopy to do this work for you:

http://sourceforge.net/projects/snoopy/

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.