Optimizing array merge operation

Question

I would appreciate any help, given.

I have 7 separate arrays with approx. 90,000 numbers in each array (let's call them arrays1-arrays7). There are no duplicate numbers within each array itself. BUT, there can be duplicates between the arrays. for example, array2 has no duplicates but it is possible to have numbers in common with arrays3 and arrays4.

The Problem: I am trying to identify all of the numbers that are duplicated 3 times once all 7 arrays are merged.

I must do this calculation 1000 times and it takes 15 mins but that is not ok because I have to run it 40 times -- The code:

if you know of another language that is best suited for this type of calculation please let me know. any extension suggestions such as redis or gearman are helpful.

for($kj=1; $kj<=1000; $kj++)
    {
$result=array_merge($files_array1,$files_array2,$files_array3,$files_array4,$files_array5,$files_array6,$files_array7);

$result=array_count_values($result);

$fp_lines = fopen("equalTo3.txt", "w");

foreach($result as $key => $val)
{
    if($result[$key]==3)
    {
    fwrite($fp_lines, $key."\r\n");
    }
}
fclose($fp_lines);
}

i have also tried the code below with strings but the array_map call and the array_count values call take 17 mins:

for($kj=1; $kj<=1000; $kj++)
    {

$result='';

for ($ii = 0; $ii< 7; $ii++) {
    $result .= $files_array[$hello_won[$ii]].'\r\n';
}

$result2=explode("\n",$result);//5mins
$result2=array_map("trim",$result2);//11mins
$result2=array_count_values($result2);//4-6mins

$fp_lines = fopen("equalTo3.txt", "w");

foreach($result2 as $key => $val)
{

    if($result2[$key]==3)
    {
    fwrite($fp_lines, $key."\r\n");
    }
}
fclose($fp_lines);

unset($result2);

You're writing to a file inside your loop. Isn't it what's taking time? How often do you need to sort these 90,000 elements? If it's just once, maybe just do it no matter how slow it is. If it needs to be done regularly, you need some way to optimize as you go (maybe by maintaining a counter for each element when they are added to the database). — laurent
– laurent, Commented Apr 28, 2014 at 18:20
I thought the same as you, that it was the numerous write calls but when I investigated, the write time was trivial. The issue belong entirely to array_merge. @this.lau_ — user3152377
– user3152377, Commented Apr 28, 2014 at 19:43
Any chance you can give us a bit more insight into your data to help us find the best solution? e.g. are there 7 arrays for (Sun, Mon, Tue, Wed, Thu, Fri, Sat)? How big are the numbers in the arrays? e.g. 34, 47, 21 or 337894294529243592, 38439434949238, 39859893922242 ? What percentage of the combined 7 arrays would you guess are duplicates? triplicates, etc.? And what do you mean by needing to do this 1,000 times, and 40 times? — scunliffe
– scunliffe, Commented Apr 28, 2014 at 20:54
@scunliffe Sure! Hope this info makes things clearer--The 7 arrays have numbers that are between 10 and 5 charachters long. Meaning a number could look like this 3456777890 or 34567. I would guess the percentage of duplicates, triplicates to be about 30%. The 7 array combinations are NEVER the same and there are 1000 combinations that have to be checked for their triplicates. The other 40 times are of no concern to the solution array_merge takes 40 mins, the rest of my code takes 1 min combined with the fwrite calls. fwrite is trivial and definately not the issue. Appreciate the help. — user3152377
– user3152377, Commented Apr 28, 2014 at 21:18
This question would be more clear with a minimal reproducible example. — mickmackusa
– mickmackusa ♦, Commented Sep 8, 2022 at 23:20

sunshinejr · Accepted Answer · 2014-04-28 18:42:35Z

14

array_merge() is significantly slower with more elements in the array because (from php.net):

If the input arrays have the same string keys, then the later value for that key will overwrite the previous one. If, however, the arrays contain numeric keys, the later value will not overwrite the original value, but will be appended.

Values in the input array with numeric keys will be renumbered with incrementing keys starting from zero in the result array.

So this function is actually making some conditional statements. You can replace array merge with normal adding, consisting of the loop (foreach or any other) and the [] operator. You can write a function imitating array_merge, like(using reference to not copy the array..):

function imitateMerge(&$array1, &$array2) {
    foreach($array2 as $i) {
        $array1[] = $i;
    }
}

And you will see the increase of speed really hard.

edited Apr 28, 2014 at 18:42

answered Apr 28, 2014 at 18:36

sunshinejr

4,8442 gold badges24 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

user3152377 Over a year ago

@ailvenge-this is very interesting, I am trying to implement this now to see how it works but there are some things I am not clear on with the function, do I need to call it 7 times and $array1 will be the final array. Why do you not have a value such as $array1[0.....90000] in the loop? does the [] stay empty? Thank you!!

user3152377 Over a year ago

@ailvenge-ok, I think I got part of it...you are saying to pass the arrays in by reference.

user3152377 Over a year ago

@ailvenge- I think I got the second part, $arr[] this means array_push. Will try this.

sunshinejr Over a year ago

[] adds the element on the end of the array. This function i wrote is merging 2 arrays only, so if you want to merge 7 arrays you can the function a bit so it merges X arrays. Also reference is important, because normally functions in PHP get the copy of variable, so you will copy the array with 90k numbers X times, and it slows the application also, so we are giving the reference to it to save the memory.

magallanes Over a year ago

FOR THE RECORD: I benchmarked it (PHP 8 x64) and it is not faster (it's around x3 times slower). It also returns a different value than array_merge().

|

Ziumin · Accepted Answer · 2014-04-28 18:27:10Z

1

I suggest replacing

foreach($result as $key => $val)
{
    if($result[$key]==3)
    {
    fwrite($fp_lines, $key."\r\n");
    }
}

With something like

$res = array_keys(array_filter($result, function($val){return $val == 3;}));
fwrite($fp_lines, implode("\r\n", $res));

answered Apr 28, 2014 at 18:27

Ziumin

4,8701 gold badge29 silver badges34 bronze badges

Comments

monocell · Accepted Answer · 2014-04-29 08:36:16Z

1

This is probably all wrong, look at last edit

I also think array_merge is the problem, but my suggestion would be to implement a function counting the values in several array directly instead of merging first. This depends a little bit on how much overlap you have in your arrays. If the overlap is very small then this might not be much faster then merging, but with significant overlap (rand(0, 200000) to fill arrays when i tried) this will be much faster.

function arrValues($arrs) {
    $values = array();

    foreach($arrs as $arr) {
        foreach($arr as $key => $val) {
            if(array_key_exists($key, $values)) {
                $values[$val]++;
            } else {
                $values[$val] = 1;
            }
        }
    }
    return $values;
}

var_dump(arrValues(array
    ($files_array1
    ,$files_array2
    ,$files_array3
    ,$files_array4
    ,$files_array5
    ,$files_array6
    ,$files_array7
    )));

The computation takes about 0.5s on my machine, then another 2s for printing the stuff.

-edit-

It's also not clear to me why you do the same thing 1000 times? Are the arrays different each time or something? Saying a bit about the reason might give people additional ideas...

-edit again-

After some more exploration I don't believe array_merge is at fault any more. You don't have enough overlap to benefit that much from counting everything directly. Have you investigated available memory on your machine? For me merging 7 arrays with 90k elements in each takes ~250M. If you have allowed php to use this much memory, which I assume you have since you do not get any allocation errors then maybe the problem is that the memory is simply not available and you get a lot of page faults? If this is not the problem then on what kind of machine and what php version are you using? I've tested your original code on 5.5 and 5.4 and fixing the memory issue it also runs in about 0.5s. That is per iteration mind you. Now if you do this a 1000 times in the same php script then it will take a while. Even more so considering that you allocate all this memory each time.

I believe you really should consider putting stuff in a database. Given your numbers it seems you have ~500M rows in total. That is an awful lot to handle in php. A database makes it easy.

edited Apr 29, 2014 at 8:36

answered Apr 28, 2014 at 19:00

monocell

1,44111 silver badges15 bronze badges

3 Comments

user3152377 Over a year ago

Yes! I have to do it 1000 times because the arrays are different each time. the seven files are never the same. @monocell

monocell Over a year ago

So you have 7000 files with 90k lines in each and you want find values that occurs more than 3 times in groups of 10 files. This sounds like a database might be exactly what you want. They are really good at this stuff...

user3152377 Over a year ago

Thank you so much for your help with this, I wish I could have accepted two answers!!! I will take your suggestion and try to put this in a database. Much appreciated. @monocell

Collectives™ on Stack Overflow

Optimizing array merge operation

3 Answers 3

9 Comments

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related