0

I am trying to remove duplicates from a set of tokens using unset (not considering array_unique for now), however I am getting a few issues.

$keywords = parseTweet ( $tweet );
$term_freq = array(count($keywords));

for($i = 0; $i < count($keywords); $i++){
    $term_freq[$i] = 1;
    for($j = 0; $j < count($keywords); $j++){
        if (($i != $j) && (strcmp($keywords[$i],$keywords[$j]) == 0)){
            unset ( $keywords [$j] );
            unset ( $term_freq [$j] );          
            $term_freq[$i]++; 
        }
    }
}

print_r ( $keywords );
print_r ( $term_freq );

I am aware of why I am getting an error; while the duplicate $j is removed, the for loop still has to reloop for the rest of the keywords and hence will fail when it encounters the missing $j. Checking the contents of the array, I found out that the index of the array skips the index $j. So it reads; [1], [2], [4], ... etc where $j = [3]

I thought that unset also rebalances the array index, am I doing something wrong or missing something completely? I am new to PHP so please bear with me!

7
  • This doesn't makes sense: unset ( $term_freq [$j] ); $term_freq[$j]++; you unset it and want to increase it? Commented Jan 17, 2015 at 1:26
  • 3
    Why are you initialzing the first element of $term_freq with count($keywords)? the first iteration of the for loop is going to overwrite this. Commented Jan 17, 2015 at 1:28
  • 1
    If you use foreach instead of for, it will just see the elements that still exist. Commented Jan 17, 2015 at 1:29
  • @Rizier123 Mistakenly referred to the deleted element rather than the intended one! Commented Jan 17, 2015 at 1:35
  • @Barmar Unfortunately I made a very short-sighted assumption and wrongly thought that I actually initiated the size of the array and not the first element. Commented Jan 17, 2015 at 1:36

2 Answers 2

2
  1. Check if the index is set or not.
  2. You're making needless, repetitive comparisons. Basically n² comparisons when at most n²/2 are required to compare every value in an array to every other value.

So:

$c = count($keywords)
for($i = 0; $i < $c; $i++){
    $term_freq[$i] = 1;
    for($j = $i+1; $j < $c; $j++){ // magic is $j = $i+1
        if( ! isset($keywords[$j]) { continue; } // skip unset indices
        else if ( strcmp($keywords[$i],$keywords[$j]) == 0 ){
            unset ( $keywords [$j] );
            unset ( $term_freq [$j] );          
            $term_freq[$i]++; 
        }
    }
}

Basically you know you've already checked everything prior to $i, so you can start your inner loop at $i+1 instead of zero.

Also, you only need to count $keywords once, not n² times.

Sign up to request clarification or add additional context in comments.

3 Comments

@cHao corrected. Re-considering it, this method will still scale quite terribly. It would be better to sort the array [with an efficient algo] and then walk over it to count/eliminate duplicate neighbours.
Might be better to just use array_count_values or array_unique, if i'm understanding the goal of the original code. But that's on them. :)
I ended up using 'array_count_values' instead, thanks a lot. I really need to get used to all these little predefined functions that php already has available. I imagine that these kind of predefined functions are fully optimized and it's no use making your own, right?
2

Use foreach instead of for.

foreach ($keywords as $i => $kw1){
    $term_freq[$i] = 1;
    foreach ($keywords as $j => $kw2){
        if (($i != $j) && ($kw1 == $kw2){
            unset ( $keywords [$j] );
            unset ( $term_freq [$j] );          
            $term_freq[$i]++; 
        }
    }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.