PHP Magento Screen Scraping

Question

I am trying to scrape a suppliers magento site in an effort to save some time because of there being around 2000 products I need to gather info for. I'm totally OK with writing a screen scraper for pretty much anything but i've encountered a major problem. Im using get_file_contentsto gather the html of the product page.

The problem is:

You need to be logged in, to view the product page. Its a standard magento login, so how can I get round this in my screen scraper? I don't require a full script, just advice on a method.

It is illegal. Ask them to send you a price list in any appropriate format. — zerkms
– zerkms, Commented Jan 4, 2011 at 3:40
It's not illegal. Totally depends on the source, and the statuatory permissions you are given, including in any terms of use, t+c's or express permission from the content creator. The screen scrape will be a way of automating product updates. Thanks anyway. — gunwin
– gunwin, Commented Jan 4, 2011 at 4:02
can you give us the url to the site so we can see its TOS that allow screen scraping? — zerkms
– zerkms, Commented Jan 4, 2011 at 4:13
He doesn't need to show you the site or the TOS. Regardless of whether he is allowed to scrape the site or not, his question is legitimate. He wants to know how to send cookies with the file_get_contents method. Edit: Plus he will learn something about headers and cookies if he doesn't already know how a login works. — Christian Joudrey
– Christian Joudrey, Commented Jan 4, 2011 at 4:15

Community · Accepted Answer · 2023-11-17 19:22:09Z

2

Using stream_context_create you can specify headers to be sent when calling your file_get_contents.

What I'd suggest is, open your browser and login to the site. Open up Firebug (or your favorite Cookie viewer) and grab the cookies and send them with your request.

Edit: Here's an example from PHP.net:

<?php
// Create a stream
$opts = array(
  'http'=>array(
    'method'=>"GET",
    'header'=>"Accept-language: en\r\n" .
              "Cookie: foo=bar\r\n"
  )
);

$context = stream_context_create($opts);

// Open the file using the HTTP headers set above
$file = file_get_contents('http://www.example.com/', false, $context);
?>

Edit (2): This is out of the scope of your question, but if you are wondering how to scrape the website afterwards you could look into the DOMDocument::loadHTML method. This will essentially give you the required functions (i.e. XPath query, getElementsByTagName, getElementsById) to scrape what you need.

If you want to scrape something simple, you can also use RegEx with preg_match_all.

edited Nov 17, 2023 at 19:22

CommunityBot

11 silver badge

answered Jan 4, 2011 at 4:13

Christian Joudrey

3,46128 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

gunwin Over a year ago

Very usefull, thankyou. I'm using preg_match to parse the page, that's all I needed. Your a star!

gunwin Over a year ago

I just thought, I'venot looked to closly at how the login works yet, but what if the login is registered using Session Variables?

Christian Joudrey Over a year ago

If Magento uses sessions, the session id (in most cases) will be stored in the cookie PHPSESSID. So basically all you have to do is put 'header' => "Cookie: PHPSESSID=...\r\n" and it should log you in. Keep in mind sessions can expire, so if you are scraping for a long time you might need to update the cookie eventually.

Transition · Accepted Answer · 2011-01-04 04:37:21Z

0

If you're familiar with CURL this should be relatively simple to do in a day or so. I've created some similar apps to login to banks to retrieve data - which of course also require authentication.

Below is a link with an example of how to use CURL with cookies for authentication purposes:

http://coderscult.com/php/php-curl/2008/05/20/php-curl-cookies-example/

If you can grab the output of the page you can parse for your results with a regex. Alternatively, you can use a class like Snoopy to do this work for you:

http://sourceforge.net/projects/snoopy/

answered Jan 4, 2011 at 4:37

Transition

1307 bronze badges

Collectives™ on Stack Overflow

PHP Magento Screen Scraping

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related