1

In a list with some hundred thousand entries, how does one go about comparing each entry with the rest of the list for duplicates? For example, List fileNames contains both "00012345.pdf" and "12345.pdf" and are considered duplicte. What is the best strategy to flagging this kind of a duplicate?

Thanks

Update: The naming of files is restricted to numbers. They are padded with zeros. Duplicates are where the padding is missing. Thus, "123.pdf" & "000123.pdf" are duplicates.

5
  • First you have to create a list of 'duplication' rules. Depending on the rules the complex or simple answer can be given. Whatever you do you should scan the list as little as possible! Commented Sep 27, 2011 at 10:07
  • 1
    It's probably fastest to sort the list and then go through it and compare with the previous item. Commented Sep 27, 2011 at 10:11
  • 4
    @Stefan Steinegger: Question seems to imply that 00012345.pdf and 12345.pdf should be considered duplicates (I might be misunderstanding). Ordering would not do the trick in this case. Commented Sep 27, 2011 at 10:17
  • @aateeque I think that you should tell if "00012345.pdf" and "12345.pdf" are considered duplicates.... Hate to get answers downgraded because of badly formulated questions... and with no comments for that matter. Commented Sep 27, 2011 at 11:26
  • Previous comment didn't make it....edited question for clarity Commented Sep 27, 2011 at 11:33

4 Answers 4

4

You probably want to implement your own substring comparer to test equality based on whether a substring is contained within another string.

This isn't necessarily optimised, but it will work. You could also possibly consider using Parallel Linq if you are using .NET 4.0.

EDIT: Answer updated to reflect refined question after it was edited

void Main()
{
    List<string> stringList = new List<string> { "00012345.pdf","12345.pdf","notaduplicate.jpg","3453456363234.jpg"};

    IEqualityComparer<string> comparer = new NumericFilenameEqualityComparer ();

    var duplicates = stringList.GroupBy (s => s, comparer).Where(grp => grp.Count() > 1);

    // do something with grouped duplicates...

}

// Not safe for null's !
// NB do you own parameter / null checks / string-case options etc !
public class NumericFilenameEqualityComparer : IEqualityComparer<string> {

   private static Regex digitFilenameRegex = new Regex(@"\d+", RegexOptions.Compiled);

   public bool Equals(string left, string right) {

        Match leftDigitsMatch = digitFilenameRegex.Match(left);
        Match rightDigitsMatch = digitFilenameRegex.Match(right);

        long leftValue = leftDigitsMatch.Success ? long.Parse(leftDigitsMatch.Value) : long.MaxValue;
        long rightValue = rightDigitsMatch.Success ? long.Parse(rightDigitsMatch.Value) : long.MaxValue;

        return leftValue == rightValue;
   }

   public int GetHashCode(string value) {
        return base.GetHashCode();
   }

}
Sign up to request clarification or add additional context in comments.

Comments

1

I understand you are looking for duplicates in order to remove them?

One way to go about it could be the following:

Create a class MyString which takes care of duplication rules. That is, overrides Equals and GetHashCode to recreate exactly the duplication rules you are considering. (I'm understanding from your question that 00012345.pdf and 12345.pdf should be considered duplicates?)

Make this class explicitly or implictly convertible to string (or override ToString() for that matter).

Create a HashCode<MyString> and fill it up iterating through your original List<String> checking for duplicates.

Might be dirty but it will do the trick. The only "hard" part here is correctly implementing your duplication rules.

3 Comments

Is overriding Equals really a good idea if they are not really "equal"?
@Neil Fenwick: As long as you document it correctly...And after all what is the meaning of Equals? If 00012345.pdf and 12345.pdf are the same file, should Equals return false? Obviously MyString is not a great choice for a meaningful name (understatement of the year) which can be misleading.
I'm with @NeilFenwick here. His solution makes it explicitly clear that there are different rules for grouping and comparing in this particular case.
0

I have a simple solution for everyone to find a duplicate string word and cahracter For word

public class Test { 
    public static void main(String[] args) {
        findDuplicateWords("i am am a a learner learner learner");
    }
    private static void findDuplicateWords(String string) {
        HashMap<String,Integer> hm=new HashMap<>();
        String[] s=string.split(" ");
        for(String tempString:s){
            if(hm.get(tempString)!=null){
                hm.put(tempString, hm.get(tempString)+1);
            }
            else{
            hm.put(tempString,1);
        }
        }
        System.out.println(hm);
    }
}

for character use for loop, get array length and use charAt()

Comments

-1

Maybe somthing like this:

List<string> theList = new List<string>() { "00012345.pdf", "00012345.pdf", "12345.pdf", "1234567.pdf", "12.pdf" };

theList.GroupBy(txt => txt)
        .Where(grouping => grouping.Count() > 1)
        .ToList()
        .ForEach(groupItem => Console.WriteLine("{0} duplicated {1} times with these     values {2}",
                                                 groupItem.Key,
                                                 groupItem.Count(),
                                                 string.Join(" ", groupItem.ToArray())));

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.