Removing duplicate string from List (.NET 2.0!)

Question

I'm having issues finding the most efficient way to remove duplicates from a list of strings (List).

My current implementation is a dual foreach loop checking the instance count of each object being only 1, otherwise removing the second.

I know there are MANY other questions out there, but they all the best solutions require above .net 2.0, which is the current build environment I'm working in. (GM and Chrysler are very resistant to changes ... :) )

This limits the possible results by not allowing any LINQ, or HashSets.

The code I'm using is Visual C++, but a C# solution will work just fine as well.

Thanks!

John · Accepted Answer · 2009-08-26 15:34:17Z

15

This probably isn't what you're looking for, but if you have control over this, the most efficient way would be to not add them in the first place...

Do you have control over this? If so, all you'd need to do is a myList.Contains(currentItem) call before you add the item and you're set

answered Aug 26, 2009 at 15:34

John

17.6k17 gold badges69 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

greggorob64 Over a year ago

Hah, I never thought about that, I do have control over the initial list generation!

Lee Over a year ago

Be aware this approach doesn't scale very well as the size of the list increases...

John Over a year ago

If size is a concern, I'd think you'd be fine doing the same method above, but using a SortedList opposed to a standard List

stevemegson Over a year ago

It's O(n^2) since List<T>.Contains is O(n). You need to borrow Jared's dictionary to keep track of the items that you've added, giving O(1) checks and O(n) overall.

greggorob64 Over a year ago

Luckily, scale isn't a concern in this particular situation

|

JaredPar · Accepted Answer · 2009-08-26 15:34:20Z

9

You could do the following.

List<string> list = GetTheList();
Dictionary<string,object> map = new Dictionary<string,object>();
int i = 0;
while ( i < list.Count ) {
  string current = list[i];
  if ( map.ContainsKey(current) ) {
    list.RemoveAt(i);
  } else {
    i++;
    map.Add(current,null);
  }
}

This has the overhead of building a Dictionary<TKey,TValue> object which will duplicate the list of unique values in the list. But it's fairly efficient speed wise.

answered Aug 26, 2009 at 15:34

JaredPar

759k152 gold badges1.3k silver badges1.5k bronze badges

2 Comments

Paul Sasik Over a year ago

+1 First thing that popped into mind was compare each value to every other while removing duplicates as they're found but the complexity on that is N^2. Jared's solution is much nicer since by using a Dicitonary data structure will make use of hashing and therefore very fast lookups. Complexity = N(log N) ?

stevemegson Over a year ago

If speed matters, you'd be better creating a new list of the unique values rather than removing the duplicates from the original list, since RemoveAt is O(n) but Add is O(1) when you know the maximum length in advance.

Alan · Accepted Answer · 2009-08-26 15:36:15Z

1

I'm no Comp Sci PhD, but I'd imagine using a dictionary, with the items in your list as the keys would be fast.

Since a dictionary doesn't allow duplicate keys, you'd only have unique strings at the end of iteration.

answered Aug 26, 2009 at 15:36

Alan

47k20 gold badges118 silver badges138 bronze badges

Comments

Koekiebox · Accepted Answer · 2009-08-26 15:54:21Z

1

Just remember when providing a custom class to override the Equals() method in order for the Contains() to function as required.

Example

List<CustomClass> clz = new List<CustomClass>()

public class CustomClass{

    public bool Equals(Object param){
        //Put equal code here...
    }
}

answered Aug 26, 2009 at 15:54

Koekiebox

5,97114 gold badges56 silver badges88 bronze badges

Comments

Juliet · Accepted Answer · 2009-08-26 16:01:58Z

1

If you're going the route of "just don't add duplicates", then checking "List.Contains" before adding an item works, but its O(n^2) where n is the number strings you want to add. Its no different from your current solution using two nested loops.

You'll have better luck using a hashset to store items you've already added, but since you're using .NET 2.0, a Dictionary can substitute for a hash set:

static List<T> RemoveDuplicates<T>(List<T> input)
{
    List<T> result = new List<T>(input.Count);
    Dictionary<T, object> hashSet = new Dictionary<T, object>();
    foreach (T s in input)
    {
        if (!hashSet.ContainsKey(s))
        {
            result.Add(s);
            hashSet.Add(s, null);
        }
    }
    return result;
}

This runs in O(n) and uses O(2n) space, it will generally work very well for up to 100K items. Actual performance depends on the average length of the strings -- if you really need to maximum performance, you can exploit some more powerful data structures like tries make inserts even faster.

answered Aug 26, 2009 at 16:01

Juliet

81.7k46 gold badges201 silver badges230 bronze badges

4 Comments

greggorob64 Over a year ago

HashSet's are .net 3.5+, which is out of the scope of this question.

Juliet Over a year ago

My codes doesn't use HashSet, it uses a dictionary which substitutes as a HashSet.

greggorob64 Over a year ago

I should have read your code more thoroughly, I just saw the word HashSet, and skipped over it.

navossoc Over a year ago

What about NULL values? Dictionary will throws a ArgumentNullException for the key.

Collectives™ on Stack Overflow

Removing duplicate string from List (.NET 2.0!)

5 Answers 5

9 Comments

2 Comments

Comments

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

9 Comments

2 Comments

Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related