optimize foreach loop php

Question

I've got double foreach loop. Script takes urls from one file and tries to find it in html code of pages from another file. Of course that reading so many pages is pretty hard for server so I want to optimize script but how can I do it?

Here is the code:

<?php
$sites_raw = file('https://earnmoneysafe.com/script/sites.txt');
$sites = array_map('trim', $sites_raw);
$urls_raw = file('https://earnmoneysafe.com/script/4toiskatj.txt');
$urls = array_map('trim', $urls_raw);

function file_get_contents_curl($url) {
    $ch = curl_init();
    $config['useragent'] = 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0';

    curl_setopt($curl, CURLOPT_USERAGENT, $config['useragent']);
    curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);       

    $data = curl_exec($ch);
    curl_close($ch);

    return $data;
}

foreach ($sites as $site){
    $homepage = file_get_contents_curl($site);
    foreach ($urls as $url){
        $needle   = $url;
        if (strpos($homepage, $needle) !== false) {
            echo 'true';
        }
    }
}
?>

You could use curl_multi_exec() to fetch all the URLs in parallel. — Barmar
– Barmar, Commented Feb 6, 2023 at 17:11
FYI, if you're using trim() to remove the newlines, you can do that automatically with the FILE_IGNORE_NEW_LINES flag to the file() function. — Barmar
– Barmar, Commented Feb 6, 2023 at 17:12
@Barmar I'm newbie with cUrl. I tried to do it with cUrl but got 403 from every page — Regular User
– Regular User, Commented Feb 6, 2023 at 17:13
I can't see why the same request would return an error from curl but would work with file_get_contents(). Did you use the same user agent? Post your curl attempt. — Barmar
– Barmar, Commented Feb 6, 2023 at 17:15
Different sites use different techniques to prevent web scraping. They might be using a cookie. — Barmar
– Barmar, Commented Feb 6, 2023 at 17:35

Regular User · Accepted Answer · 2023-02-06 18:46:46Z

1

Use curl_multi_exec() to fetch all the URLs in parallel.

$urls = file('https://earnmoneysafe.com/script/4toiskatj.txt', FILE_IGNORE_NEW_LINES);
$sites = file('https://earnmoneysafe.com/script/sites.txt', FILE_IGNORE_NEW_LINES);
foreach ($sites as $site) {
    $curl_handles[$site] = get_curl($site);
}
$mh = curl_multi_init();
foreach ($curl_handles as $ch) {
    curl_multi_add_handle($mh, $ch);
}

do {
    $mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);

foreach ($curl_handles as $site => $ch) {
    $homepage = curl_multi_getcontent($ch);
    foreach ($urls as $needle) {
        if (strpos($homepage, $needle) !== false) {
            echo 'true';
        }
    }
    curl_multi_remove_handle($mh, $ch);
}

curl_multi_close($mh);
    
function get_curl($url) {
    $ch = curl_init();
    $config['useragent'] = 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0';

    curl_setopt($ch, CURLOPT_USERAGENT, $config['useragent']); // edited  
    curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);       

    return $ch;
}

edited Feb 6, 2023 at 18:46

Regular User

5114 silver badges18 bronze badges

answered Feb 6, 2023 at 18:11

Barmar

789k57 gold badges555 silver badges669 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Regular User Over a year ago

Thank you for reply. But where should I put Urls which I have to find in these pages. In original code I had $urls_raw = file('https://earnmoneysafe.com/script/4toiskatj.txt');

Barmar Over a year ago

Sorry, I confused $sites and $urls.

Barmar Over a year ago

I added that to the answer.

Regular User Over a year ago

I cannot see results though some sites have urls from that file. For example https://www.golo.com/ has https://www.googletagmanager.com/gtag/js in it's code

Regular User Over a year ago

echo $homepage returns null. Something is wrong with the code

|

Luis Flores · Accepted Answer · 2023-02-06 18:22:24Z

0

I think this, This code is cleaner

<?php

const SITES_URL = 'https://earnmoneysafe.com/script/sites.txt';
const URLS_URL = 'https://earnmoneysafe.com/script/4toiskatj.txt';

function readFileLines($url) {
    $file_contents = file_get_contents($url);
    $lines = explode("\n", $file_contents);
    $filtered_lines = array_filter($lines, function($line) {
        return !empty(trim($line));
    });

    return $filtered_lines;
}

function checkSiteUrls($site, $urls) {
    $homepage = file_get_contents($site);
    foreach ($urls as $url) {
        if (strpos($homepage, $url) !== false) {
            echo 'true';
        }
    }
}

$sites = readFileLines(SITES_URL);
$urls = readFileLines(URLS_URL);

foreach ($sites as $site) {
    checkSiteUrls($site, $urls);
}

?>

edited Feb 6, 2023 at 18:22

answered Feb 6, 2023 at 18:20

Luis Flores

11 bronze badge

Collectives™ on Stack Overflow

optimize foreach loop php

2 Answers 2

9 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related