PHP Simple HTML DOM Parser - blocked?

Question

Thanks for looking into this. Though it may be a simple problem, I am too new at scaping pages to understand why this simple code returns 'false'. Most examples I see online use the base url, but I am trying to scape a specific product page. Using 'http://www.google.com/' works fine. Could it be I am being blocked? If so, how would one get around it in php? In python one would rotate User-Auths and proxies. Any nuggets of knowldge will be appreciated. Here is the basic code with the specific link.

require_once($_SERVER['DOCUMENT_ROOT'].'/includes/simple_html_dom.php');

$url = 'https://www.lowes.com/pd/Frigidaire-Gallery-22-cu-ft-Counter-depth-Side-by-Side-Refrigerator-with-Ice-Maker-Fingerprint-Resistant-Black-Stainless-Steel/1000368269';
$html = file_get_html($url);

Thanks guys.

Using file_get_contents instead of file_get_html tells us that Lowes is returning a 403 Forbidden, they probably have some anti-scraping technology at play. — Jacob Mulquin
– Jacob Mulquin, Commented Oct 19, 2021 at 21:37

Jacob Mulquin · Accepted Answer · 2021-10-20 19:03:29Z

2

Lowes is implementing some anti-scraping technology so you cannot rely on file_get_html. However, you can make use of PHP's curl functions and then use str_get_html from Simple HTML DOM.

<?php

require_once($_SERVER['DOCUMENT_ROOT'].'/includes/simple_html_dom.php');

$url = 'https://www.lowes.com/pd/Frigidaire-Gallery-22-cu-ft-Counter-depth-Side-by-Side-Refrigerator-with-Ice-Maker-Fingerprint-Resistant-Black-Stainless-Steel/1000368269';

// From https://gist.github.com/fijimunkii/952acac988f2d25bef7e0284bc63c406
$user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0"
];

// Get random user agent
$user_agent = $user_agents[rand(0,count($user_agents)-1)];

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
$exec = curl_exec($ch);

$html = str_get_html($exec);

edited Oct 20, 2021 at 19:03

answered Oct 19, 2021 at 21:43

Jacob Mulquin

3,6081 gold badge22 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jacob Mulquin Over a year ago

That was a typo, it meant to say file_get_html

napierjohn Over a year ago

Thanks, mulquin. Thanks for pointing me in the right direction: 'curl'. I looked into curl and several tutorials so I can understand and use it. I added a random user_auth and use of the proxies we subscribe to. I am getting a response so you got me past that hurdle to the point where I know 'Access Denied'. Now my goal is to get past that. One edit on your code - the count needs to be 'count($user_agent)-1 Thanks, mulquin!

Jacob Mulquin Over a year ago

@napierjohn You're welcome, and thanks for the edit, I have updated the answer. Be sure the mark the answer as accepted if you are able :) Good luck with it!

Collectives™ on Stack Overflow

PHP Simple HTML DOM Parser - blocked?

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related