11

I have the following simple code to test against collision on a primary key I am creating:

$machine_ids = array();

for($i = 0; $i < 100000; $i++) {
    //Generate machine id returns a 15 character alphanumeric string
    $mid = Functions::generate_machine_id();

    if(in_array($mid, $machine_ids)) {
        die("Collision!");
    } else {
        $machine_ids[] = $mid;  
    }
}

die("Success!");

Any idea why this is taking many minutes to run? Anyway to speed it up?

2
  • 5
    Have you profiled that in_array is the culprit and not Functions::generate_machine_id()? Commented May 23, 2011 at 7:03
  • You have the code for Functions::generate_machine_id handy? Commented May 23, 2011 at 7:05

5 Answers 5

14
for($i = 0; $i < 100000; $i++) 
{
  //Generate machine id returns a 15 character alphanumeric string
  $mid = Functions::generate_machine_id();
  if (isset($machine_ids[$mid]))
  {
    die("Collision!");
  }
  $machine_ids[$mid] = true;
}
Sign up to request clarification or add additional context in comments.

Comments

12

For this, use $mid as keys, and dummy value as value. Specifically, instead of

if(in_array($mid, $machine_ids)) {
    die("Collision!");
} else {
    $machine_ids[] = $mid;  
}

use

if(isset($machine_ids[$mid])) {
    die("Collision!");
} else {
    $machine_ids[$mid] = 1;  
}

At the end you can extract the array you originally wanted with array_keys($machine_ids).

This should be much faster. If it is still slow, then your Functions::generate_machine_id() is slow.

EDITED to add isset as per comments.

8 Comments

Beat me to it. :) Though you should use isset($machine_ids[$mid]).
Agree with deceze, and I believe if($machine_ids[$mid]) will return false (not really a collision) if the value is integer 0, empty string, NULL, etc. In any case isset() is the way to go. However, I just noticed OP said it will always be a 15 char alphanum.
@deceze: Any reason? Is the performance hit of isset lower than performance hit of implicit conversion to boolean? Or is it theoretic - since in PHP 0 is also false, which would differ from isset? (in this case, all we have are empties and ones, so it's safe)
@Amadan: It will also generate a notice when $machine_ids[$mid] is not defined.
@Amadan As Wesley said, you'll get tons of notices this way, which take time to trigger as well. This triggering may slow down the script quite significantly and annoy the watching user to boot.
|
3

Checking for array membership is a O(n) operation, since you have to compare the value to every element in the array. After you add a whole bunch of stuff to the array, naturally it gets slower.

If you need to do a whole bunch of membership tests, as is the case here, you should use a different data structure that supports O(1) membership tests, such as a hash.

Comments

2

Refactor your code so that it uses a associated array to hold the machine IDs and use isset to check

if( isset($machine_id[$mid]) ) die("Collision");

$machine_ids[$mid] = $mid;

Using isset should be faster

Comments

1

If you need best performance for your case, you need store your data as array key and use isset or array_key_exists(since php >= 7.4 array_key_exists is now as fast as isset) instead in_array.

Attention. It is true that isset on a hash map is faster than searching through an array for a value (in_array), but keep in mind that converting an array of values, ["foo", "bar", "baz"], to a hash map, ["foo" => true, "bar" => true, "baz" => true], incurs a memory cost (as well as potentially constructing the hash map, depending on how and when you do it). As with all things, you'll have to weigh the pros & cons for each case to determine if a hash map or array (list) of values works best for your needs. This isn't specific to PHP but more of a general problem space of computer science.

And some performance tests from https://gist.github.com/alcaeus/536156663fac96744eba77b3e133e50a

<?php declare(strict_types = 1);

function testPerformance($name, Closure $closure, $runs = 1000000)
{
    $start = microtime(true);
    for (; $runs > 0; $runs--)
    {
        $closure();
    }
    $end = microtime(true);

    printf("Function call %s took %.5f seconds\n", $name, $end - $start);
}

$items = [1111111];
for ($i = 0; $i < 100000; $i++) {
    $items[] = rand(0, 1000000);
}
$items = array_unique($items);
shuffle($items);

$assocItems = array_combine($items, array_fill(0, count($items), true));

$in_array = function () use ($items) {
    in_array(1111111, $items);
};

$isset = function () use ($assocItems) {
    isset($items[1111111]);
};

$array_key_exists = function () use ($assocItems) {
    array_key_exists(1111111, $assocItems);
};

testPerformance('in_array', $in_array, 100000);
testPerformance('isset', $isset, 100000);
testPerformance('array_key_exists', $array_key_exists, 100000);

Output:

Function call in_array took 5.01030 seconds
Function call isset took 0.00627 seconds
Function call array_key_exists took 0.00620 seconds

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.