Finding duplicates in List<string>

Question

In a list with some hundred thousand entries, how does one go about comparing each entry with the rest of the list for duplicates? For example, List fileNames contains both "00012345.pdf" and "12345.pdf" and are considered duplicte. What is the best strategy to flagging this kind of a duplicate?

Thanks

Update: The naming of files is restricted to numbers. They are padded with zeros. Duplicates are where the padding is missing. Thus, "123.pdf" & "000123.pdf" are duplicates.

First you have to create a list of 'duplication' rules. Depending on the rules the complex or simple answer can be given. Whatever you do you should scan the list as little as possible! — CodingBarfield
– CodingBarfield, Commented Sep 27, 2011 at 10:07
It's probably fastest to sort the list and then go through it and compare with the previous item. — Stefan Steinegger
– Stefan Steinegger, Commented Sep 27, 2011 at 10:11
@Stefan Steinegger: Question seems to imply that 00012345.pdf and 12345.pdf should be considered duplicates (I might be misunderstanding). Ordering would not do the trick in this case. — InBetween
– InBetween, Commented Sep 27, 2011 at 10:17
@aateeque I think that you should tell if "00012345.pdf" and "12345.pdf" are considered duplicates.... Hate to get answers downgraded because of badly formulated questions... and with no comments for that matter. — Lysgaard
– Lysgaard, Commented Sep 27, 2011 at 11:26
Previous comment didn't make it....edited question for clarity — aateeque
– aateeque, Commented Sep 27, 2011 at 11:33

Neil Fenwick · Accepted Answer · 2011-09-27 13:55:01Z

You probably want to implement your own substring comparer to test equality based on whether a substring is contained within another string.

This isn't necessarily optimised, but it will work. You could also possibly consider using Parallel Linq if you are using .NET 4.0.

EDIT: Answer updated to reflect refined question after it was edited

void Main()
{
    List<string> stringList = new List<string> { "00012345.pdf","12345.pdf","notaduplicate.jpg","3453456363234.jpg"};

    IEqualityComparer<string> comparer = new NumericFilenameEqualityComparer ();

    var duplicates = stringList.GroupBy (s => s, comparer).Where(grp => grp.Count() > 1);

    // do something with grouped duplicates...

}

// Not safe for null's !
// NB do you own parameter / null checks / string-case options etc !
public class NumericFilenameEqualityComparer : IEqualityComparer<string> {

   private static Regex digitFilenameRegex = new Regex(@"\d+", RegexOptions.Compiled);

   public bool Equals(string left, string right) {

        Match leftDigitsMatch = digitFilenameRegex.Match(left);
        Match rightDigitsMatch = digitFilenameRegex.Match(right);

        long leftValue = leftDigitsMatch.Success ? long.Parse(leftDigitsMatch.Value) : long.MaxValue;
        long rightValue = rightDigitsMatch.Success ? long.Parse(rightDigitsMatch.Value) : long.MaxValue;

        return leftValue == rightValue;
   }

   public int GetHashCode(string value) {
        return base.GetHashCode();
   }

}

InBetween · Accepted Answer · 2011-09-27 10:24:28Z

1

I understand you are looking for duplicates in order to remove them?

One way to go about it could be the following:

Create a class MyString which takes care of duplication rules. That is, overrides Equals and GetHashCode to recreate exactly the duplication rules you are considering. (I'm understanding from your question that 00012345.pdf and 12345.pdf should be considered duplicates?)

Make this class explicitly or implictly convertible to string (or override ToString() for that matter).

Create a HashCode<MyString> and fill it up iterating through your original List<String> checking for duplicates.

Might be dirty but it will do the trick. The only "hard" part here is correctly implementing your duplication rules.

edited Sep 27, 2011 at 10:24

answered Sep 27, 2011 at 10:14

InBetween

32.9k3 gold badges52 silver badges98 bronze badges

3 Comments

Neil Fenwick Over a year ago

Is overriding Equals really a good idea if they are not really "equal"?

InBetween Over a year ago

@Neil Fenwick: As long as you document it correctly...And after all what is the meaning of Equals? If 00012345.pdf and 12345.pdf are the same file, should Equals return false? Obviously MyString is not a great choice for a meaningful name (understatement of the year) which can be misleading.

Jeremy McGee Over a year ago

I'm with @NeilFenwick here. His solution makes it explicitly clear that there are different rules for grouping and comparing in this particular case.

Zubair · Accepted Answer · 2019-03-14 08:39:51Z

0

I have a simple solution for everyone to find a duplicate string word and cahracter For word

public class Test { 
    public static void main(String[] args) {
        findDuplicateWords("i am am a a learner learner learner");
    }
    private static void findDuplicateWords(String string) {
        HashMap<String,Integer> hm=new HashMap<>();
        String[] s=string.split(" ");
        for(String tempString:s){
            if(hm.get(tempString)!=null){
                hm.put(tempString, hm.get(tempString)+1);
            }
            else{
            hm.put(tempString,1);
        }
        }
        System.out.println(hm);
    }
}

for character use for loop, get array length and use charAt()

answered Mar 14, 2019 at 8:39

Zubair

311 silver badge10 bronze badges

Comments

Lysgaard · Accepted Answer · 2011-09-27 10:14:53Z

-1

Maybe somthing like this:

List<string> theList = new List<string>() { "00012345.pdf", "00012345.pdf", "12345.pdf", "1234567.pdf", "12.pdf" };

theList.GroupBy(txt => txt)
        .Where(grouping => grouping.Count() > 1)
        .ToList()
        .ForEach(groupItem => Console.WriteLine("{0} duplicated {1} times with these     values {2}",
                                                 groupItem.Key,
                                                 groupItem.Count(),
                                                 string.Join(" ", groupItem.ToArray())));

answered Sep 27, 2011 at 10:14

Lysgaard

2343 silver badges11 bronze badges

Collectives™ on Stack Overflow

Finding duplicates in List<string>

4 Answers 4

Comments

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related