111

In a very tight loop I need to access tens of thousands of values in an array containing millions of elements. The key can be undefined: In that case it shall be legal to return NULL without any error message:

  • Array key exists: return value of element.
  • Array key does not exist: return null.

I do know multiple solutions:

if (isset($lookup_table[$key])) {
    return $lookup_table[$key];
} else {
    return;
}

or

@return $lookup_table[$key];

or

error_reporting(0);
$return = $lookup_table[$key];
error_reporting(E_ALL);
return $return;

All solutions are far from optimal:

  • The first one requires 2 lookup in the B-TREE: One to check existence, another to retrieve value. That effectively doubles the runtime.
  • The second one uses the error suppression operator, and thus creates a massive overhead on that line.
  • The third one calls the error handler (that will check error_reporting setting and then display nothing) and thereby creates an overhead.

Did I miss a way to avoid error handling and yet work with a single Btree lookup?

To answer some questions:

The array caches the results of a complex calculation - to complex to be done in real time. Out of billions of possible values, only millions yield a valid result. The array looks like 1234567 => 23457, 1234999 => 74361, .... That is saved to a PHP file of several megabyte, and include_once-d at the beginning of the execution. Initial load time does not matter.

If the key is not found, it simply means that this specific value will not return a valid result. The trouble is to get this done 50k+ per second.

17
  • 1
    isset($lookup_table[$key]) will return false if $lookup_table[$key] is null ... isn't this the opposite of what you want? not sure what its performance is like, but array_key_exists will return true if the key exists but its value is null. Commented May 21, 2013 at 17:16
  • @sgroves: The example did reflect my intention. I updated the question to clarify: If key exists, return value, otherwise return null. (Value of existent keys will never be null.) array_key_exist is, frankly, slower that the isset version, so I skipped it in my examples. Commented May 21, 2013 at 17:22
  • 2
    Quickhash might be good for your case. Commented May 21, 2013 at 17:24
  • 2
    Mike, Mathew, I updated my question to answer yours. Sean, that´s not a matter of believe. Of course native C is faster than PHP. I am simply unwilling to spend a month writing code I can write over night in PHP. :) Commented May 21, 2013 at 17:53
  • 1
    @sean: It´s 8pm in germany :) (In other words - the only constant when asking for help building a house is that you can´t use advise on how to build a boat) Commented May 21, 2013 at 18:01

8 Answers 8

178

First of all, arrays are not implemented as a B-tree, it's a hash table; an array of buckets (indexed via a hash function), each with a linked list of actual values (in case of hash collisions). This means that lookup times depend on how well the hash function has "spread" the values across the buckets, i.e. the number of hash collisions is an important factor.

PHP < 7

Your original statement would be the most performant code, simply because isset() is optimised as a VM opcode.

PHP >= 7

This statement is the most correct:

return array_key_exists($key, $table) ? $table[$key] : null;

Since this change was merged, it's no longer a function call (but a VM opcode) and is as performant (if not more) as isset(), but because you're accessing the array twice in the worst case (if the key exists), the performance is comparable to your original idea.

Shorthand alternative

You can also accomplish what you want with less code using the null coalescing operator (keeping in mind that it will bypass existing array element with null value, the same way isset() would):

return $table[$key] ?? null;

This should be the most performant between the three options (the third being your own suggestion).

Sign up to request clarification or add additional context in comments.

15 Comments

I was skeptical about this approach, so I generated a random array of about 5000 keys in the range of 0-100,000, and ran through the array from 0-100,000. I tried your methoed with reference, a method without reference, but still an auxiliary variable, and a simple approach without either auxiliary variable or reference. Your reference approach took more than twice as long as the other ones. The naive approach was marginally faster than the auxiliary variable approach.
@kba The answer mentions that it modifies the array (if the item doesn't exist), so obviously that takes time. It also mentions that without it, the first solution given by OP is likely the fastest.
In what way does your approach modify the array? I don't see you making any changes to $tmp or $lookup_table[$key].
@ZsoltSzilagy When you create a reference on a non-existent variable, PHP creates the necessary structures to make it a valid reference; in this case, it creates an array entry with value null. It works well if the item will always be there of course :)
Still, however, my tests imply that this doesn't improve the performance after all. Even after having run through every item in the array once, the lookup time is still worse. My tests could be bad.
|
7

I did some bench marking with the following code:

set_time_limit(100);

$count = 2500000;
$search_index_end = $count * 1.5;
$search_index_start = $count * .5;

$array = array();
for ($i = 0; $i < $count; $i++)
    $array[md5($i)] = $i;

$start = microtime(true);
for ($i = $search_index_start; $i < $search_index_end; $i++) {
    $key = md5($i);
    $test = isset($array[$key]) ? $array[$key] : null;
}
$end = microtime(true);
echo ($end - $start) . " seconds<br/>";

$start = microtime(true);
for ($i = $search_index_start; $i < $search_index_end; $i++) {
    $key = md5($i);
    $test = array_key_exists($key, $array) ? $array[$key] : null;
}
$end = microtime(true);
echo ($end - $start) . " seconds<br/>";


$start = microtime(true);
for ($i = $search_index_start; $i < $search_index_end; $i++) {
    $key = md5($i);
    $test = @$array[$key];
}
$end = microtime(true);
echo ($end - $start) . " seconds<br/>";

$error_reporting = error_reporting();
error_reporting(0);
$start = microtime(true);
for ($i = $search_index_start; $i < $search_index_end; $i++) {
    $key = md5($i);
    $test = $array[$key];
}
$end = microtime(true);
echo ($end - $start) . " seconds<br/>";
error_reporting($error_reporting);

$start = microtime(true);
for ($i = $search_index_start; $i < $search_index_end; $i++) {
    $key = md5($i);
    $tmp = &$array[$key];
    $test = isset($tmp) ? $tmp : null;
}
$end = microtime(true);
echo ($end - $start) . " seconds<br/>";

and I found that the fastest running test was the one that uses isset($array[$key]) ? $array[$key] : null followed closely by the solution that just disables error reporting.

4 Comments

Even with error reporting off, the errors are still generated.
Correct @Jack. Although contraintuitive, it is called error_reporting off, not error_generation.
I wonder if a try/catch and discarded exception is faster than turning off error reporting.
warnings aren't thrown so a try/catch wont do anything of use in this case
3

This Work for me

{{ isset($array['key']) ? $array['key']: 'Default' }} 

but this is fast

{{ $array['key'] or 'Default' }}

2 Comments

The second one generates an error
It's about speed, not about merely "working". Plus, it seems you use a template engine, which is not related to the question.
3

The @ operator and the error_reporting methods will both be slower than using isset. With both of these methods, it modifies the error reporting setting for PHP, but PHP's error handler will still be called. The error handler will check against the error_reporting setting and exit without reporting anything, however this has still taken time.

Comments

2

Just a sudden idea that would have to be tested, but did you try using array_intersect_key() to get the existing values and a array_merge to fill() the rest ? It would remove the need of a loop to access the data. Something like that :

$searched_keys = array ('key1' => null, 'key2' => null); // the list of the keys to find

$exiting_values = array_intersect_key($lookup_table, $searched_keys);
$all_values = array_merge($searched_keys, $exiting_keys);

Please note that I did not tried it performance-wise.

1 Comment

Nice approach. Unfortunately $all_values would be considered beeing a new array, thus it would build a new index.
1

There are two typical approaches to this.

  1. Define defaults for an undefined key.
  2. Check for undefined key.

Here is how to perform the first and as little code as possible.

$data = array_merge(array($key=>false),$data);
return $data[$key];

Here is how to perform the second.

return isset($data[$key]) ? $data[$key] : false;

4 Comments

Thank you. The second version with the ternary operator is a double-lookup like in my first snippet. The first version is exciting, but I have little hope: It temporaly creates a copy of a huge array, and while there is only one btree lookup, there is a complete (temporary) array to be balanced. Runtime raises from 2log(n) to 2n.
Do you have the option to re-organize this data for performance? An octree lookup would be more efficient than a linear search. You could group by the first letter of each key, and then by the second letter and so on. en.wikipedia.org/wiki/Octree
Great approach! PHP internally handles arrays as fully balanced trees. I assume that won´t be outperformed by an octree, but I am not that deep into algorithm analysis. Do you have experience on that comparison? Thanks!
The only way PHP could be faster is if the keys are sorted internally, and an optimized lookup is used to find a key. I've never read anything that says PHP uses optimized indexes for array keys.
1

I prefer using the isset function instead of escaping the error. I made a function to check the key exists and if not returns a default value, in the case of nested arrays you just need to add the other keys in order:

Nested array lookup:

/**
 * Lookup array value.
 *
 * @param array $array
 * @param array $keys
 * @param $defaultValue
 */
public static function array_key_lookup($array, $keys, $defaultValue)
{
    $value = $array;
    foreach ($keys as $key) {
        if (isset($value[$key])) {
            $value = $value[$key];
        } else {
            $value = $defaultValue;
            break;
        }
    }

    return $value;
}

Usage example:

$array = [
    'key1' => 'value1',
    'key2' => 'value2',
    'key3' => [
        'key3a' => 'value3a',
        'key3b' => 'value3b'
    ]
];

array_key_lookup($array, ['key3', 'key3a'], 'default')
'value3a'

array_key_lookup($array, ['key2', 'key2a'], 'default')
'default'

array_key_lookup($array, ['key2'], 'default')
'value2'

array_key_lookup($array, ['key5'], 'default')
'default'

Escaping the error:

$value = @$array[$key1][$key2] ?: $defaultValue;

1 Comment

Hey, that's a nice approach form the usability perspective, but performancewise it's actually worse than the alternatives. The function call implicitely creates an array, the foreach makes a copy of it, the first line duplicates a variable etc. It would be much faster to call $value = isset($array[$key1][$key2]) ? $array[$key1][$key2] : $defaultValue;
0

First, re-organize the data for performance by saving a new array where the data is sorted by the keys, but the new array contains a regular numeric index.

This part will be time consuming, but only done once.

 // first sort the array by it's keys
 ksort($data);

 // second create a new array with numeric index
 $tmp = new array();
 foreach($data as $key=>$value)
 {
    $tmp[] = array('key'=>$key,'value'=>$value);
 }
 // now save and use this data instead
 save_to_file($tmp);

Once that is done it should be quick to find the key using a Binary Search. Later you can use a function like this.

  function findKey($key, $data, $start, $end)
  { 
    if($end < $start) 
    { 
        return null; 
    } 

    $mid = (int)(($end - $start) / 2) + $start; 

    if($data[$mid]['key'] > $key) 
    { 
        return findKey($key, $data, $start, $mid - 1); 
    } 
    else if($data[$mid]['key'] < $key) 
    { 
        return findKey($key, $data, $mid + 1, $end); 
    } 

    return $data[$mid]['value'];
 }

To perform a search for a key you would do this.

 $result = findKey($key, $data, 0, count($data));
 if($result === null)
 {
      // key not found.
 }

If the count($data) is done all the time, then you could cache that in the file that you stored the array data.

I suspect this method will be a lot faster in performance then a regular linear search that is repeated against the $data. I can't promise it's faster. Only an octree would be quicker, but the time to build the octree might cancel out the search performance (I've experienced that before). It depends on how much searching in the data you have to do.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.