1

Thanks for looking into this. Though it may be a simple problem, I am too new at scaping pages to understand why this simple code returns 'false'. Most examples I see online use the base url, but I am trying to scape a specific product page. Using 'http://www.google.com/' works fine. Could it be I am being blocked? If so, how would one get around it in php? In python one would rotate User-Auths and proxies. Any nuggets of knowldge will be appreciated. Here is the basic code with the specific link.

require_once($_SERVER['DOCUMENT_ROOT'].'/includes/simple_html_dom.php');

$url = 'https://www.lowes.com/pd/Frigidaire-Gallery-22-cu-ft-Counter-depth-Side-by-Side-Refrigerator-with-Ice-Maker-Fingerprint-Resistant-Black-Stainless-Steel/1000368269';
$html = file_get_html($url);

Thanks guys.

1
  • Using file_get_contents instead of file_get_html tells us that Lowes is returning a 403 Forbidden, they probably have some anti-scraping technology at play. Commented Oct 19, 2021 at 21:37

1 Answer 1

2

Lowes is implementing some anti-scraping technology so you cannot rely on file_get_html. However, you can make use of PHP's curl functions and then use str_get_html from Simple HTML DOM.

<?php

require_once($_SERVER['DOCUMENT_ROOT'].'/includes/simple_html_dom.php');

$url = 'https://www.lowes.com/pd/Frigidaire-Gallery-22-cu-ft-Counter-depth-Side-by-Side-Refrigerator-with-Ice-Maker-Fingerprint-Resistant-Black-Stainless-Steel/1000368269';

// From https://gist.github.com/fijimunkii/952acac988f2d25bef7e0284bc63c406
$user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0"
];

// Get random user agent
$user_agent = $user_agents[rand(0,count($user_agents)-1)];

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
$exec = curl_exec($ch);

$html = str_get_html($exec);
Sign up to request clarification or add additional context in comments.

3 Comments

That was a typo, it meant to say file_get_html
Thanks, mulquin. Thanks for pointing me in the right direction: 'curl'. I looked into curl and several tutorials so I can understand and use it. I added a random user_auth and use of the proxies we subscribe to. I am getting a response so you got me past that hurdle to the point where I know 'Access Denied'. Now my goal is to get past that. One edit on your code - the count needs to be 'count($user_agent)-1 Thanks, mulquin!
@napierjohn You're welcome, and thanks for the edit, I have updated the answer. Be sure the mark the answer as accepted if you are able :) Good luck with it!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.