6

I'm having issues finding the most efficient way to remove duplicates from a list of strings (List).

My current implementation is a dual foreach loop checking the instance count of each object being only 1, otherwise removing the second.

I know there are MANY other questions out there, but they all the best solutions require above .net 2.0, which is the current build environment I'm working in. (GM and Chrysler are very resistant to changes ... :) )

This limits the possible results by not allowing any LINQ, or HashSets.

The code I'm using is Visual C++, but a C# solution will work just fine as well.

Thanks!

5 Answers 5

15

This probably isn't what you're looking for, but if you have control over this, the most efficient way would be to not add them in the first place...

Do you have control over this? If so, all you'd need to do is a myList.Contains(currentItem) call before you add the item and you're set

Sign up to request clarification or add additional context in comments.

9 Comments

Hah, I never thought about that, I do have control over the initial list generation!
Be aware this approach doesn't scale very well as the size of the list increases...
If size is a concern, I'd think you'd be fine doing the same method above, but using a SortedList opposed to a standard List
It's O(n^2) since List<T>.Contains is O(n). You need to borrow Jared's dictionary to keep track of the items that you've added, giving O(1) checks and O(n) overall.
Luckily, scale isn't a concern in this particular situation
|
9

You could do the following.

List<string> list = GetTheList();
Dictionary<string,object> map = new Dictionary<string,object>();
int i = 0;
while ( i < list.Count ) {
  string current = list[i];
  if ( map.ContainsKey(current) ) {
    list.RemoveAt(i);
  } else {
    i++;
    map.Add(current,null);
  }
}

This has the overhead of building a Dictionary<TKey,TValue> object which will duplicate the list of unique values in the list. But it's fairly efficient speed wise.

2 Comments

+1 First thing that popped into mind was compare each value to every other while removing duplicates as they're found but the complexity on that is N^2. Jared's solution is much nicer since by using a Dicitonary data structure will make use of hashing and therefore very fast lookups. Complexity = N(log N) ?
If speed matters, you'd be better creating a new list of the unique values rather than removing the duplicates from the original list, since RemoveAt is O(n) but Add is O(1) when you know the maximum length in advance.
1

I'm no Comp Sci PhD, but I'd imagine using a dictionary, with the items in your list as the keys would be fast.

Since a dictionary doesn't allow duplicate keys, you'd only have unique strings at the end of iteration.

Comments

1

Just remember when providing a custom class to override the Equals() method in order for the Contains() to function as required.

Example

List<CustomClass> clz = new List<CustomClass>()

public class CustomClass{

    public bool Equals(Object param){
        //Put equal code here...
    }
}

Comments

1

If you're going the route of "just don't add duplicates", then checking "List.Contains" before adding an item works, but its O(n^2) where n is the number strings you want to add. Its no different from your current solution using two nested loops.

You'll have better luck using a hashset to store items you've already added, but since you're using .NET 2.0, a Dictionary can substitute for a hash set:

static List<T> RemoveDuplicates<T>(List<T> input)
{
    List<T> result = new List<T>(input.Count);
    Dictionary<T, object> hashSet = new Dictionary<T, object>();
    foreach (T s in input)
    {
        if (!hashSet.ContainsKey(s))
        {
            result.Add(s);
            hashSet.Add(s, null);
        }
    }
    return result;
}

This runs in O(n) and uses O(2n) space, it will generally work very well for up to 100K items. Actual performance depends on the average length of the strings -- if you really need to maximum performance, you can exploit some more powerful data structures like tries make inserts even faster.

4 Comments

HashSet's are .net 3.5+, which is out of the scope of this question.
My codes doesn't use HashSet, it uses a dictionary which substitutes as a HashSet.
I should have read your code more thoroughly, I just saw the word HashSet, and skipped over it.
What about NULL values? Dictionary will throws a ArgumentNullException for the key.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.