-1

I've got double foreach loop. Script takes urls from one file and tries to find it in html code of pages from another file. Of course that reading so many pages is pretty hard for server so I want to optimize script but how can I do it?

Here is the code:

<?php
$sites_raw = file('https://earnmoneysafe.com/script/sites.txt');
$sites = array_map('trim', $sites_raw);
$urls_raw = file('https://earnmoneysafe.com/script/4toiskatj.txt');
$urls = array_map('trim', $urls_raw);

function file_get_contents_curl($url) {
    $ch = curl_init();
    $config['useragent'] = 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0';

    curl_setopt($curl, CURLOPT_USERAGENT, $config['useragent']);
    curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);       

    $data = curl_exec($ch);
    curl_close($ch);

    return $data;
}

foreach ($sites as $site){
    $homepage = file_get_contents_curl($site);
    foreach ($urls as $url){
        $needle   = $url;
        if (strpos($homepage, $needle) !== false) {
            echo 'true';
        }
    }
}
?>

9
  • 1
    You could use curl_multi_exec() to fetch all the URLs in parallel. Commented Feb 6, 2023 at 17:11
  • 1
    FYI, if you're using trim() to remove the newlines, you can do that automatically with the FILE_IGNORE_NEW_LINES flag to the file() function. Commented Feb 6, 2023 at 17:12
  • @Barmar I'm newbie with cUrl. I tried to do it with cUrl but got 403 from every page Commented Feb 6, 2023 at 17:13
  • I can't see why the same request would return an error from curl but would work with file_get_contents(). Did you use the same user agent? Post your curl attempt. Commented Feb 6, 2023 at 17:15
  • 1
    Different sites use different techniques to prevent web scraping. They might be using a cookie. Commented Feb 6, 2023 at 17:35

2 Answers 2

1

Use curl_multi_exec() to fetch all the URLs in parallel.

$urls = file('https://earnmoneysafe.com/script/4toiskatj.txt', FILE_IGNORE_NEW_LINES);
$sites = file('https://earnmoneysafe.com/script/sites.txt', FILE_IGNORE_NEW_LINES);
foreach ($sites as $site) {
    $curl_handles[$site] = get_curl($site);
}
$mh = curl_multi_init();
foreach ($curl_handles as $ch) {
    curl_multi_add_handle($mh, $ch);
}

do {
    $mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);

foreach ($curl_handles as $site => $ch) {
    $homepage = curl_multi_getcontent($ch);
    foreach ($urls as $needle) {
        if (strpos($homepage, $needle) !== false) {
            echo 'true';
        }
    }
    curl_multi_remove_handle($mh, $ch);
}

curl_multi_close($mh);
    
function get_curl($url) {
    $ch = curl_init();
    $config['useragent'] = 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0';

    curl_setopt($ch, CURLOPT_USERAGENT, $config['useragent']); // edited  
    curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);       

    return $ch;
}
Sign up to request clarification or add additional context in comments.

9 Comments

Thank you for reply. But where should I put Urls which I have to find in these pages. In original code I had $urls_raw = file('https://earnmoneysafe.com/script/4toiskatj.txt');
Sorry, I confused $sites and $urls.
I added that to the answer.
I cannot see results though some sites have urls from that file. For example https://www.golo.com/ has https://www.googletagmanager.com/gtag/js in it's code
echo $homepage returns null. Something is wrong with the code
|
0

I think this, This code is cleaner

<?php

const SITES_URL = 'https://earnmoneysafe.com/script/sites.txt';
const URLS_URL = 'https://earnmoneysafe.com/script/4toiskatj.txt';

function readFileLines($url) {
    $file_contents = file_get_contents($url);
    $lines = explode("\n", $file_contents);
    $filtered_lines = array_filter($lines, function($line) {
        return !empty(trim($line));
    });

    return $filtered_lines;
}

function checkSiteUrls($site, $urls) {
    $homepage = file_get_contents($site);
    foreach ($urls as $url) {
        if (strpos($homepage, $url) !== false) {
            echo 'true';
        }
    }
}

$sites = readFileLines(SITES_URL);
$urls = readFileLines(URLS_URL);

foreach ($sites as $site) {
    checkSiteUrls($site, $urls);
}

?>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.