7

I would appreciate any help, given.

I have 7 separate arrays with approx. 90,000 numbers in each array (let's call them arrays1-arrays7). There are no duplicate numbers within each array itself. BUT, there can be duplicates between the arrays. for example, array2 has no duplicates but it is possible to have numbers in common with arrays3 and arrays4.

The Problem: I am trying to identify all of the numbers that are duplicated 3 times once all 7 arrays are merged.

I must do this calculation 1000 times and it takes 15 mins but that is not ok because I have to run it 40 times -- The code:

if you know of another language that is best suited for this type of calculation please let me know. any extension suggestions such as redis or gearman are helpful.

for($kj=1; $kj<=1000; $kj++)
    {
$result=array_merge($files_array1,$files_array2,$files_array3,$files_array4,$files_array5,$files_array6,$files_array7);

$result=array_count_values($result);

$fp_lines = fopen("equalTo3.txt", "w");

foreach($result as $key => $val)
{
    if($result[$key]==3)
    {
    fwrite($fp_lines, $key."\r\n");
    }
}
fclose($fp_lines);
}

i have also tried the code below with strings but the array_map call and the array_count values call take 17 mins:

for($kj=1; $kj<=1000; $kj++)
    {

$result='';

for ($ii = 0; $ii< 7; $ii++) {
    $result .= $files_array[$hello_won[$ii]].'\r\n';
}

$result2=explode("\n",$result);//5mins
$result2=array_map("trim",$result2);//11mins
$result2=array_count_values($result2);//4-6mins

$fp_lines = fopen("equalTo3.txt", "w");

foreach($result2 as $key => $val)
{

    if($result2[$key]==3)
    {
    fwrite($fp_lines, $key."\r\n");
    }
}
fclose($fp_lines);

unset($result2);
5
  • You're writing to a file inside your loop. Isn't it what's taking time? How often do you need to sort these 90,000 elements? If it's just once, maybe just do it no matter how slow it is. If it needs to be done regularly, you need some way to optimize as you go (maybe by maintaining a counter for each element when they are added to the database). Commented Apr 28, 2014 at 18:20
  • I thought the same as you, that it was the numerous write calls but when I investigated, the write time was trivial. The issue belong entirely to array_merge. @this.lau_ Commented Apr 28, 2014 at 19:43
  • Any chance you can give us a bit more insight into your data to help us find the best solution? e.g. are there 7 arrays for (Sun, Mon, Tue, Wed, Thu, Fri, Sat)? How big are the numbers in the arrays? e.g. 34, 47, 21 or 337894294529243592, 38439434949238, 39859893922242 ? What percentage of the combined 7 arrays would you guess are duplicates? triplicates, etc.? And what do you mean by needing to do this 1,000 times, and 40 times? Commented Apr 28, 2014 at 20:54
  • @scunliffe Sure! Hope this info makes things clearer--The 7 arrays have numbers that are between 10 and 5 charachters long. Meaning a number could look like this 3456777890 or 34567. I would guess the percentage of duplicates, triplicates to be about 30%. The 7 array combinations are NEVER the same and there are 1000 combinations that have to be checked for their triplicates. The other 40 times are of no concern to the solution array_merge takes 40 mins, the rest of my code takes 1 min combined with the fwrite calls. fwrite is trivial and definately not the issue. Appreciate the help. Commented Apr 28, 2014 at 21:18
  • This question would be more clear with a minimal reproducible example. Commented Sep 8, 2022 at 23:20

3 Answers 3

14

array_merge() is significantly slower with more elements in the array because (from php.net):

If the input arrays have the same string keys, then the later value for that key will overwrite the previous one. If, however, the arrays contain numeric keys, the later value will not overwrite the original value, but will be appended.

Values in the input array with numeric keys will be renumbered with incrementing keys starting from zero in the result array.

So this function is actually making some conditional statements. You can replace array merge with normal adding, consisting of the loop (foreach or any other) and the [] operator. You can write a function imitating array_merge, like(using reference to not copy the array..):

function imitateMerge(&$array1, &$array2) {
    foreach($array2 as $i) {
        $array1[] = $i;
    }
}

And you will see the increase of speed really hard.

Sign up to request clarification or add additional context in comments.

9 Comments

@ailvenge-this is very interesting, I am trying to implement this now to see how it works but there are some things I am not clear on with the function, do I need to call it 7 times and $array1 will be the final array. Why do you not have a value such as $array1[0.....90000] in the loop? does the [] stay empty? Thank you!!
@ailvenge-ok, I think I got part of it...you are saying to pass the arrays in by reference.
@ailvenge- I think I got the second part, $arr[] this means array_push. Will try this.
[] adds the element on the end of the array. This function i wrote is merging 2 arrays only, so if you want to merge 7 arrays you can the function a bit so it merges X arrays. Also reference is important, because normally functions in PHP get the copy of variable, so you will copy the array with 90k numbers X times, and it slows the application also, so we are giving the reference to it to save the memory.
FOR THE RECORD: I benchmarked it (PHP 8 x64) and it is not faster (it's around x3 times slower). It also returns a different value than array_merge().
|
1

I suggest replacing

foreach($result as $key => $val)
{
    if($result[$key]==3)
    {
    fwrite($fp_lines, $key."\r\n");
    }
}

With something like

$res = array_keys(array_filter($result, function($val){return $val == 3;}));
fwrite($fp_lines, implode("\r\n", $res));

Comments

1

This is probably all wrong, look at last edit

I also think array_merge is the problem, but my suggestion would be to implement a function counting the values in several array directly instead of merging first. This depends a little bit on how much overlap you have in your arrays. If the overlap is very small then this might not be much faster then merging, but with significant overlap (rand(0, 200000) to fill arrays when i tried) this will be much faster.

function arrValues($arrs) {
    $values = array();

    foreach($arrs as $arr) {
        foreach($arr as $key => $val) {
            if(array_key_exists($key, $values)) {
                $values[$val]++;
            } else {
                $values[$val] = 1;
            }
        }
    }
    return $values;
}

var_dump(arrValues(array
    ($files_array1
    ,$files_array2
    ,$files_array3
    ,$files_array4
    ,$files_array5
    ,$files_array6
    ,$files_array7
    )));

The computation takes about 0.5s on my machine, then another 2s for printing the stuff.

-edit-

It's also not clear to me why you do the same thing 1000 times? Are the arrays different each time or something? Saying a bit about the reason might give people additional ideas...

-edit again-

After some more exploration I don't believe array_merge is at fault any more. You don't have enough overlap to benefit that much from counting everything directly. Have you investigated available memory on your machine? For me merging 7 arrays with 90k elements in each takes ~250M. If you have allowed php to use this much memory, which I assume you have since you do not get any allocation errors then maybe the problem is that the memory is simply not available and you get a lot of page faults? If this is not the problem then on what kind of machine and what php version are you using? I've tested your original code on 5.5 and 5.4 and fixing the memory issue it also runs in about 0.5s. That is per iteration mind you. Now if you do this a 1000 times in the same php script then it will take a while. Even more so considering that you allocate all this memory each time.

I believe you really should consider putting stuff in a database. Given your numbers it seems you have ~500M rows in total. That is an awful lot to handle in php. A database makes it easy.

3 Comments

Yes! I have to do it 1000 times because the arrays are different each time. the seven files are never the same. @monocell
So you have 7000 files with 90k lines in each and you want find values that occurs more than 3 times in groups of 10 files. This sounds like a database might be exactly what you want. They are really good at this stuff...
Thank you so much for your help with this, I wish I could have accepted two answers!!! I will take your suggestion and try to put this in a database. Much appreciated. @monocell

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.