8

I have the following condition:

if(in_array($needle, $haystack) ||
    in_array($needle . "somePostfix", $haystack) ||
    in_array($needle . "someOtherPostfix", $haystack) ||
    // and some more) {
    // do something
}

My haystack contains more than 10k elements, and this check takes about 400ms. I know that in_array has to iterate over the whole array multiple times. In my case the common case is that the element is not found. and I tried to improve this by creating the following method that only iterates once over the haystack:

function wildcardInArray($needle, $haystack) {
    foreach ($haystack as $value) {
        if (true === fnmatch($needle . '*', $haystack)) {
            return true;
        }
    }
    return false;
}

But this decreases my performance even more, seems to me that fnmatch is the bottleneck.

Is there any improvement for this case of array search?

11
  • Would array_diff and/or array_intersect help here? Commented Feb 4, 2016 at 15:46
  • 1
    What if, instead of fnmatch you do: $value === $needle || $value === $needle . "somePostfix" || $value === $needle . "someOtherPostfix"? Any sort of pattern matching is going to be slower than a straight up === Commented Feb 4, 2016 at 15:46
  • Have you tried using the array as a map instead (with like 1 as value)? It was quite some time since I did php, but that would probably be a lot faster. Commented Feb 4, 2016 at 15:49
  • Have you tried array_reduce($array, function($found, $el){ return $found && ($el === 'val1' || $el === 'val2'); }) ? Commented Feb 4, 2016 at 15:54
  • @Halcyon you are right I didn't think about this, but is there also any generic solution? I have not tryed to use a map or array_reduce Commented Feb 4, 2016 at 15:57

2 Answers 2

4

This is a very interesting question that doesn't appear to have a great answer. I did some very unscientific bench-marking and I was not able to get any faster than in_array for a $haystack with 100000 elements.

PHP 5.5.9-1ubuntu4.14 (cli) (built: Oct 28 2015 01:34:46) 
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2014 Zend Technologies
    with Zend OPcache v7.0.3, Copyright (c) 1999-2014, by Zend Technologies
    with Xdebug v2.2.3, Copyright (c) 2002-2013, by Derick Rethans

Sorting Time*:    0.19367408752441
Imploding Time**: 0.0207359790802
preg_match:       0.10927486419678
needle ===:       0.083639144897461
in_array:         0.019428968429565
array_flip:       0.028955936431885
array_intersect:  0.15198707580566
array_diff:       0.15532493591309

//*sort without search (binary search wouldn't add much time)
//**time it took to implode the array 
//     (no search was performed, this search WOULD take significant time if implemented)

As you can see, only three of these methods took less than 100ms, needle ===, in_array and array_flip. And out of these three, in_array was clearly the fastest. Now the question is how many postfix-es do you have? The running time on in_array will be O(n*m) (n is size of your haystack, m is the number of postfixes), which is a problem if m is also very large. If m is significantly large, sorting the data once and performing a binary search on the sorted list will be O(m*log(n)) which grows much slower, but has a higher initial overhead as shown in the sorting time above. Even better, if you have a very large m would probably be array_flip as each search should only take O(1) lookup after the initial flip.

CODE

Haystack creation

$haystack = array();

function getRandomWord($len = 10) {
        $len = rand(3,10);
        $word = array_merge(range('a', 'z'), range('A', 'Z'));
            shuffle($word);
            return substr(implode($word), 0, $len);
}

$numWords = 100000;
for($i = 0; $i < $numWords; $i++) {
    $haystack[] = getRandomWord();
}

TESTs

//*Sorting*    
$copy = $haystack;
sort($copy);


//implode    
$copy = implode($haystack, " ");


//*preg_match_test*
function preg_match_test($regex, $haystack) {
    $matches = false;
    foreach($haystack as $value) {
        if (preg_match($regex, $value)) {
            $matches = true;
            break;
        }
    }
    return $matches;
}

//needle ===
function equalsNeedle($needles, $haystack) {
    $matches = false;
    foreach ($haystack as $value) {
        foreach($needles as $needle) {
            if ($needle === $value) {
                $matches = true;
                break 2;
            }
        }
    }
    return $matches;
}

//in_array
function baseCase($needles, $haystack) {
    $matches = false;
    foreach($needles as $needle) {
        if (in_array($needle, $haystack)) {
            $matches = true;
            break;
        }
    }
    return $matches;
}

//array_flip
function arrayFlipping($needles, $haystack) {
    $matches = false;
    $copy = array_flip($haystack);
    foreach ($needles as $needle) {
        if (array_key_exists($needle, $copy)) {
            $matches = true;
        }
    }
    return $matches;
}

//array_intersect
function arrayIntersect($needles, $haystack) {
    if (count(array_intersect($needles, $haystack)) > 0) {
        return true;
    }
    return false;
}

//array_diff
function arrayDiff($needles, $haystack) {
    if (count(array_diff($needles, $haystack)) !== count($needles)) {
        return true;
    }
    return false;
}

Calling Code

$array = array("foo","foobar","foobazz","foobuzz");
$base = "foo";
$regex = "/^$base(bizz|bazz|buzz|)$/";

echo "preg_match: ";
preg_match_test($regex, $haystack);
echo "needle === ";
equalsNeedle($array, $haystack);
echo "in_array:  ";
baseCase($array, $haystack);
echo "array_flip:  ";
arrayFlipping($array, $haystack);
echo "array_intersect:  ";
arrayIntersect($array, $haystack);
echo "array_diff:  ";
arrayDiff($array, $haystack);

All tests were wrapped with timing code using microtime(true).

Sign up to request clarification or add additional context in comments.

3 Comments

How should your function baseCase even work? I see return $t; but $t is never declared. And arrayFlipping will always return false.
@bpoiss, you are correct, I was a little sloppy when I copied them over. I wrote most of them as proof of concepts so there were a few little errors. I will make the edits for completeness. Thanks for bringing them to my attention.
You should disable both opcache as well as xdebug to even have some way of a benchmark that makes sense.
4

You can use your array as 'keys', i.e:

$arr = ['a', 'b', 'c', … ]; to $arr = ['a' => true, 'b' => true, …].

You will consume more memory but you will have an instant result with isset($arr[$key]);.

Fastest but biggest in memory, you can use stdClass and isset($obj->$key);

$obj = new stdClass();
$obj->{'a'} = true;
$obj->{'b'} = true;
$obj->{'…'} = true;

If you can't change your array structure, tell us if you can sort manually the array contents?

// generic
$length = strlen($needle);
$char = $needle[0];
$found = false;
$suffixes = [ false, 'somePostfix', 'someOtherPostfix' ];

foreach($haystack as $entry) {
  if ($char === $entry[0] && $needle === substr($entry, 0, $length)) {
    $suffix = substr($entry, $length);
    if (in_array($suffix, $suffixes, true)) {
      $found = true;
      break;
    }
  }
}

1 Comment

How much performance improvments will I get? Have you any benchmark tests for this?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.