4

I have a jagged array of strings and I need to find all the rows which are unique. For e.g.,

[ 
 ["A","B"] , 
 ["C","D","E"], 
 ["B", "A"],
 ["E","A"] 
]

This should return row 1 and row 3 as row 0 and row 2 are duplicates. How can this be done? Can I make use of hashets ?

5
  • Taken as arrays, row 0 and row 2 are not duplicates. They just have the same set of elements. Commented Dec 26, 2012 at 22:32
  • Yes, you can use HashSet. Either create a wrapper type for each Row or use an IEqualityComparer with the HashSet constructor. (Make sure to use the desired business rules: e.g. sort first before computing hash value or checking sequence equality.) Commented Dec 26, 2012 at 22:32
  • (Even if not using a HashSet, creating an IEqualityComparer is likely advisable and can be used with other approaches that require testing for the "equality" per the business rules.) Commented Dec 26, 2012 at 22:39
  • BTW, are rows [A, B, B] and [A, B, A] equal or not? Commented Dec 26, 2012 at 22:40
  • I have edited your title. Please see, "Should questions include “tags” in their titles?", where the consensus is "no, they should not". Commented Dec 26, 2012 at 22:51

4 Answers 4

2

First of all, taken as arrays, row 0 and row 2 are not duplicates. They just have the same set of elements. However, if you just want to remove those kind of rows, you could do something like:

string[][] GetNonDuplicates(string[][] jagged)
{
  //not a hashset, but a dictionary. A value of false means that the row 
  //is not duplicate, a value of true means that at least one dulicate was found
  Dictionary<string[], bool> dict = 
          new Dictionary<string[], bool>(new RowEqualityComparer());

  foreach(string[] row in jagged)
  {
    //if a duplicate is found - using the hash and the compare method
    if (dict.ContainsKey(row)) 
    {
       dict[row] = true;  //set value to true
    }
    else
    {
      dict.Add(row, false);  //first time we see this row, add it
    }
  }

  //just pop out all the keys which have a value of false
  string[][] result = dict.Where(item => !item.Value)
                          .Select(item => item.Key)
                          .ToArray();
  return result;
}

...
string[][] jagged = new []{new []{"A","B"} , 
                           new []{"C","D","E"}, 
                           new []{"B", "A"},
                           new []{"E","A"}};

string[][] nonDuplicates = GetNonDuplicates(jagged);

where RowEqualityComparer is:

class RowEqualityComparer : IEqualityComparer<string[]>
{
    public bool Equals(string[] first, string[] second)
    {
        // different legths - different rows
        if (first.Length != second.Length)
          return false;

        //we need to copy the arrays because Array.Sort 
        //will change the original rows
        var flist = first.ToList();
        flist.Sort();
        var slist = second.ToList();
        slist.Sort();

        //loop and compare one by one
        for (int i=0; i < flist.Count; i++)
        {
            if (flist[i]!=slist[i])
              return false;
        }
        return true;
    }

    public int GetHashCode(string[] row)
    {
       //I have no idea what I'm doing, just some generic hash code calculation
       if (row.Length == 0)
         return 0;
       int hash = row[0].GetHashCode();
       for (int i = 1; i < row.Length; i++)
         hash ^= row[i].GetHashCode();
       return hash;
    }

}
Sign up to request clarification or add additional context in comments.

2 Comments

I assume that not only the order is irrelevant but also the Length is not meaningful when duplicates in the arrays are not counted(a HashSet would eliminate them).
I'm working on the assumption that [A,B,B] and [A,A,B] would be considered different. In that case, this comparison makes sense. Else, a HashSet would be a correct approach.
2

Assuming that you want to ignore the order, duplicates(since you've already mentioned a HashSet) and the result should contain only arrays which don't have duplicates.

You could implement a custom IEqualityComparer<String[]> for Enumerable.GroupBy and select only arrays which are unique(group-count==1):

class IgnoreOrderComparer : IEqualityComparer<string[]>
{
    public bool Equals(string[] x, string[] y)
    {
        if (x == null || y == null) return false;
        return !x.Distinct().Except(y.Distinct()).Any();
    }

    public int GetHashCode(string[] arr)
    {
        if (arr == null) return int.MinValue;
        int hash = 19;
        foreach (string s in arr.Distinct())
        {
            hash = hash + s.GetHashCode();
        }
        return hash;
    }
}

The rest is straightforward:

String[][] uniques = arrays.GroupBy(arr => arr, new IgnoreOrderComparer())
                           .Where(g => g.Count() == 1)
                           .Select(g => g.First())
                           .ToArray();

Edit: Here's a possibly more efficient version using the same comparer:

IEqualityComparer<string[]> comparer = new IgnoreOrderComparer();
String[][] uniques = arrays.Where(a1 =>
    !arrays.Any(a2 => a1 != a2 && comparer.Equals(a1, a2)))
                           .ToArray();

Comments

1

As far as the algorithmic solution goes, I'd

  1. sort your rows (you can use any sorting metric you like so long as it distinguishes any 2 different rows.)
  2. pick the rows that don't have an identical adjacent row.

If you do that, you should be able to complete your requirement in O(m*n*lg(n)) where m is the length of your rows, and n is the number of rows

Given that sets of values implies equality, you can sort the cells of each row to help you sort the list of rows. this would result in O(n*m*lg(m) + m*n*lg(n))

1 Comment

Can you post some example?
0

I would compute the hash of each row as follows:

[ 
 ["A","B"] , // hash of this row :10 as example 
 ["C","D","E"], // hash of this row  : 20
 ["B", "A"], // hash of this row would be 10 as well
 ["E","A"] 
]

Since they are all string, you can calculate the hash values and create a hash per row.

The way you can use HashSet can be as follow, create hashset per row and then find difference of a row with every other row, if difference is empty then they are same.

You can use intersection as well, if the intersection is not empty that row is not unique.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.