13

I have a PHP script which reads a large CSV and performs certain actions, but only if the "username" field is unique. The CSV is used in more than one script, so changing the input from the CSV to only contain unique usernames is not an option.

The very basic program flow (which I'm wondering about) goes like this:

$allUsernames = array();
while($row = fgetcsv($fp)) {
    $username = $row[0];
    if (in_array($username, $allUsernames)) continue;
    $allUsernames[] = $username;
    // process this row
}

Since this CSV could actually be quite large, it's that in_array bit which has got me thinking. The most ideal situation when searching through an array for a member is if it is already sorted, so how would you build up an array from scratch, keeping it in order? Once it is in order, would there be a more efficient way to search it than using in_array(), considering that it probably doesn't know the array is sorted?

4 Answers 4

9

Not keeping the array in order, but how about this kind of optimization? I'm guessing isset() for an array key should be faster than in_array() search.

$allUsernames = array();
while($row = fgetcsv($fp)) {
  $username = $row[0];

  if (isset($allUsernames[$username])) {
    continue;
  } else {
    $allUsernames[$username] = true;

    // do stuff
  }
}
Sign up to request clarification or add additional context in comments.

3 Comments

True, you could do that too, but I'd still benchmark the differences between those two. You never know with PHP - one might be O(1), while the other O(n)... (referring to the "do array_flip() twice" trick)
I'd say it is "array_key_exists()". PHP arrays are hashes, they are optimized for this kind of random access stuff.
in_array has bad performance .. Using a hash as suggested here, is a good idea.
7

The way to build up an array from scratch in sorted order is an insertion sort. In PHP-ish pseudocode:

$list = []
for ($element in $elems_to_insert) {
     $index = binary_search($element, $list);
     insert_into_list($element, $list, $index);
}

Although, it might actually turn out to be faster to just create the array in unsorted order and then use quicksort (PHP's builtin sort functions use quicksort)

And to find an element in a sorted list:

function binary_search($list, $element) {
    $start = 0;
    $end = count($list);
    while ($end - $start > 1) {
        $mid = ($start + $end) / 2;
        if ($list[$mid] < $element){
            $start = $mid;
        }
        else{
            $end = $mid;
        }
    }
    return $end;
}

With this implementation you'd have to test $list[$end] to see if it is the element you want, since if the element isn't in the array, this will find the point where it should be inserted. I did it that way so it'd be consistent with the previous code sample. If you want, you could check $list[$end] === $element in the function itself.

4 Comments

I'm curious as to why you left out the insert_into_list implementation. I'm guessing it is because it is not simple in PHP. You'd have Roll Your Own list (ie linked list), because even SPLDoublyLinkedList doesn't support inserting at an index. And the PHP array, is a hashmap, not a linked list. Adding / inserting to an array, just adds to the end of the array's 'natural' order (ie the order it returns it's elements on iteration over them). I'm voting you up though coz you're the only one who began to address the 'Keep an array sorted' part of the question.
@Jason I left it out because it's irrelevant to the answer. The question wasn't asking how to insert an item into a list, and it's clear from the function name what it does; there's no need to have a reference implementation to understand the exact behavior.
For those looking to do the insert_into_list part; $list = array_splice($list, $index, 0, $element);
@AlexBarker array_splice runs in O(n), I believe, which defeats the whole purpose of binary search.
1

The array type in php is an ordered map (php array type). If you pass in either ints or strings as keys, you will have an ordered map...

Please review item #6 in the above link.

4 Comments

Do you mean example #6? The way I read that is that arrays are ordered maps, which does not necessarily equate to being sorted: they just have an order to them.
@nickf: PHP arrays are hash maps, indexed by the array key. Internal order is irrelevant for accessing the values.
ok that makes sense, but that's just the keys of the array, right? that won't help you try to find a particular value in the array.
Well, I would suggest that you check if the lookup in your stored values really takes so long that you need a different datastructure and/or some 'optimization' on the array. I think you'll find the array type fast enough.
0

in_array() does not benefit from having a sorted array. PHP just walks along the whole array as if it were a linked list.

1 Comment

i figured that - read the last sentence of the question again.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.