Improve algorithm execution time

Question

I am working on Data mining project and I've chosen Apriori algorithm for Association rules task. Briefly saying I am not satisfied with execution time the way I've implemented it. I will describe just problematic part of my code.

I have two Lists of lists.

List<List<int>> one;

List<List<int>> two;

I've to iterate through elements of list one and check if one[i] is subset of two[j]

foreach(List<int> items in one)
{

    foreach(List<int> items2 in two)
    {

        if(items2.ContainsSetOf(items1))
        {
            //do something
        }
}

I was thinking if there are way to reduce execution time for such apporoach. (Parallel execution, using dictionaries, etc)

Do you guys have any idea how it's possible to reduce it?

Thanks!

Isaac · Accepted Answer · 2012-12-02 03:15:06Z

4

Make them lists of sets, and use set operations to find if a set of subset of another.

Example

HashSet<int> set1 = new HashSet<int>();
set1.Add(1);
set1.Add(2);

HashSet<int> set2 = new HashSet<int>();
set2.Add(1);
set2.Add(2);
set2.Add(3);

List<HashSet<int>> one = new List<HashSet<int>>();
one.add(set1);
one.add(set2);

List<HashSet<int>> two = new List<HashSet<int>>();
two.add(set1);
two.add(set2);

foreach(Set<int> setA in one) {
    foreach(Set<int> setB in two) {
        if(setA.IsSubsetOf(setB)) {
            // do something
        }
    }
}

edited Dec 2, 2012 at 3:15

answered Dec 2, 2012 at 2:30

Isaac

2,7214 gold badges34 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

John Latham Over a year ago

Yes, I can use IsSubSet() method, but problem is not here. But I still have to compare every element with another which is N^2. Maybe I understand wrong your solution. Could you provide code example?

Mike Bailey Over a year ago

@JohnLatham: Having two List<HashSet<T>> will improve the execution time. Subset is much cheaper on proper sets than lists. You can also consider using indexing.

John Latham Over a year ago

@Ibrahim, Thanks, do you know how much times iteration approximately should be faster?

Isaac Over a year ago

If I understand your question correctly, the number of iterations is the same as before because the requirement is to compare all sets in one to all sets in two, but you gain speed in the check itself because sets are implemented in a way to make the subset test more efficient. Having said that, there is probably room to make the number of iterations less than n^2, but that's hard to know without giving more details about the problem and the nature of the data stored in the sets.

i.adri · Accepted Answer · 2012-12-04 23:12:02Z

C# code snippet

var dict = new Dictionary<int, HashSet<List<int>>>();

foreach (List<int> list2 in two) {
   foreach (int i in list2) {
      if(dict.ContainsKey(i) == FALSE) {
         //create empty HashSet dict[i]
         dict.Add(i, new HashSet<List<int>>());
      }
      //add reference to list2 to the HashSet dict[i]
      dict[i].Add(list2); 
   }
}

foreach (List<int> list1 in one) {
   HashSet<List<int>> listsInTwoContainingList1 = null;
   foreach (int i in list1) {
      if (listsInTwoContainingList1 == null) {
         listsInTwoContainingList1 = new HashSet<List<int>>(dict[i]);
      } else {
         listsInTwoContainingList1.IntersectWith(dict[i]);
      }
      if(listsInTwoContainingList1.Count == 0) {   //optimization :p
         break;
      }
   }
   foreach (List<int> list2 in listsInTwoContainingList1) {
      //list2 contains list1
      //do something
   }   
}

Example

L2= {
L2a = {10, 20, 30, 40}
L2b = {30, 40, 50, 60}
L2c = {10, 25, 30, 40}
}

L1 = {
L1a = {10, 30, 40}
L1b = {30, 25, 50}
}

After the first part of the code:

dict[10] = {L2a, L2c}
dict[20] = {L2a}
dict[25] = {L2c}
dict[30] = {L2a, L2b, L2c}
dict[40] = {L2a, L2b, L2c}
dict[50] = {L2c}
dict[60] = {L2c}

In the second part of the code:

L1a: dict[10] n dict[30] n dict[40] = {L2a, L2c}
L1b: dict[30] n dict[25] n dict[50] = { }

So L1a is included in L2a and L2c, but L1b in none.

Complexity

Now regarding the algorithm complexity, suppose L1 has n1 elements, L2 has n2 elements, the average number of elements of the sublists of L1 is m1 and the average number of elements of the sublists of L2 is m2. Then:

the original solution is: O(n1 x n2 x m1 x m2), if the containsSetOf method does a nested loop, or, at best, O(n1 x n2 x (m1 + m2)), if it uses a HashSet. Is7aq's solution is also O(n1 x n2 x (m1 + m2)).
the proposed solution is: O(n2 x m2 + n1 x (m1 x nd + n2)), where nd is the average number of elements of the sets dict[i].

The efficiency of the proposed solution depends a lot on this nd:

If nd is large - close to n2 (when every integer is part of every sublist of L2), then it is just as slow as the original one.
However, if nd is expected to be small (i.e. the sublists of L2 are quite different from one another), then the proposed solution would typically be much faster, especially if n1 and n2 are large.

Igor · Accepted Answer · 2012-12-02 02:58:56Z

1

If you want to reduce the number of checks "Is list in list" (or is set a subset), one way is to build a hierarchy (tree) of lists. Of course, the performance improvement (if any) depends on the data - if none of the lists contain other lists, you will have to do all checks as now.

answered Dec 2, 2012 at 2:58

Igor

15.9k1 gold badge29 silver badges34 bronze badges

1 Comment

John Latham Over a year ago

Thank you, Igor. I was thinking about something similar. But still had hope that its possible to make it in O(N) time.

Collectives™ on Stack Overflow

Improve algorithm execution time

3 Answers 3

4 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related