1

I have a file containing space separated numbers in each line. Each line corresponds to a list of numbers.
Now there are around 300,000 such lines (each line containing around 100 numbers on an average).
I want to find mutual intersection of all such lists i.e. first list intersected with all other lists, then second list intersected with all other lists and so on.
I am using

set(a) & set(b)  

where a and b are lists I get iterating in a double loop.
But this is taking too much time. For ex : for first list intersected with all other lists, it took about 3 minutes.
How can I do this efficiently? (may be with some other language/tool)

5
  • Are you finding a intersect b intersect .... ? What do you mean by mutual intersection? Commented Jan 22, 2013 at 10:36
  • 1
    300,000 x 300,000 = 90 billion lists. Even if you manage to calculate all possible combinations, I'm wondering how you're going to store them. Commented Jan 22, 2013 at 11:06
  • @thg435 : I don't want to store lists. I just find intersection, do some computation, and throw away that list. Commented Jan 22, 2013 at 11:11
  • use set(a).intersection(b) Commented Jan 22, 2013 at 11:13
  • What fraction of the intersections do you expect to be empty? Commented Jan 22, 2013 at 11:18

3 Answers 3

5

You should use generator expressions here, they do lazy evaluation and save a lot of memory:

In [46]: from itertools import imap

In [47]: a = [[1,2,3], [2,3,4], [3,4,5]]

In [48]: reduce(set.intersection,imap(set,a))
Out[48]: set([3])

considering your file looks like this:

1 2 3
2 3 4
3 4 5

code: Using itertools.combinations():

with open("abc.txt") as f:
    lines=(map(int,x.split()) for x in f)
    for x in combinations(lines,2):
        print x,'-->',reduce(set.intersection,imap(set,x))
   ....:         
([1, 2, 3], [2, 3, 4]) --> set([2, 3])
([1, 2, 3], [3, 4, 5]) --> set([3])
([2, 3, 4], [3, 4, 5]) --> set([3, 4])
Sign up to request clarification or add additional context in comments.

3 Comments

I don't want intersection of all at once. I want intersection of two lists only at a time. So For ex : list1 & list2, then list1 & list3, then list2 & list1, then list2 & list3, then list3 & list1, and then list3 & list2.
@HappyMittal you're looking for itertools.combinations here.
@HappyMittal list1 & list2 and list2 & list1 are actually same thing.
1

First idea to come would be to first build all the sets once, if it all fit in memory, then intersect them.

If you really need all the intersections of 300000 lines with 300000 lines, it will take time anyway. Maybe you should rethink your problem.

Comments

1

I think you can optimize this by creating an inverted index, that is, a mapping number => list of rows that contain this number. For example, if 10 occurs on rows 5, 100, 200 then you'll have

10: [5, 100, 200]

To further optimize this, you can store the rowlist as a set of pairs:

10: set( (5,100), (5,200), (100,200) )

Then, to compute an intersection of list_a + list_b, just find all numbers whose associated rowlist contains (list_a, list_b).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.