python array intersection efficiently

Question

I don't know how to make an intersect between these two arrays:

a = [[125, 1], [193, 1], [288, 23]]
b = [[108, 1], [288, 1], [193, 11]]

result = [[288,24], [193, 12]]

So the intersection is by the first element, the second element of the array is summed, any ideas how to do this efficiently?

Ok i made a mistake for not explaining what i mean about efficient, sorry. Consider the following naive implementation:

a = [[125, 1], [193, 1], [288, 23]]
b = [[108, 1], [288, 1], [193, 11]]
result = {}
for i, j in a:
    for k, l in b:
        if i == k:
            result[i] = j + l
print result

So i was trying to find a way to make more efficient solution to my problem, more pythonic in a way. So that's why i needed your help.

Try this test cases (my code is also on it):

Test Case

Running time: 28.6980509758

What should the behavior be if there are two matching elements in a? Should they be summed too? For example, a = [[100, 1], [100, 2]], b = [[50, 1]] — Brionius
– Brionius, Commented Aug 21, 2013 at 15:37
I don't know that I would keep this data as list of lists. Perhaps a dictionary or better yet a counter would make more sense. — cmd
– cmd, Commented Aug 21, 2013 at 16:11
If you are using the word "efficiently" seriously, what alternatives have you tried, and why their performance was inadequate? Also, and very importantly, what is the size of the dataset? — Mario Rossi
– Mario Rossi, Commented Aug 21, 2013 at 16:14
@badc0re There seems to be an argument about using expressive (and easy to write, test, and understand) language features v/s efficiency (very important too). Why don't you help us settling this in this particular case? Can you run and time a test of the different answers and publish it? — Mario Rossi
– Mario Rossi, Commented Aug 21, 2013 at 16:25
@badc0re And what are your expectations? Why 29 sec is not good? This is very important. Most experienced developers will incrementally improve performance (and decrease readability) to the point performance is acceptable and not a single bit more. They value their time and the time of other developers who could come across their code. — Mario Rossi
– Mario Rossi, Commented Aug 21, 2013 at 17:06

cmd · Accepted Answer · 2014-08-14 19:59:48Z

4

This data seems like it would be better stored as a dictionary

da = dict(a)
db = dict(b)

once you have it like that you can just:

result = [[k, da[k] + db[k]] for k in set(da.keys()).intersection(db.keys())]

or in python 2.7 you can also just use viewkeys instead of a set

result = [[k, da[k] + db[k]] for k in da.viewkeys() & db]

edited Aug 14, 2014 at 19:59

answered Aug 21, 2013 at 16:17

cmd

5,88819 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

badc0re Over a year ago

I haven't downvoted, your solution has so far best performances calculated in 0.0124080181122.

FUD Over a year ago

nice! pythonic and efficient

FUD · Accepted Answer · 2013-08-21 17:33:20Z

3

result = []
ms, mb = (dict(a),dict(b)) if len(a)<len(b) else (dict(b),dict(a))
for k in ms:
  if k in mb:
    result.append([k,ms[k]+mb[k]])

edited Aug 21, 2013 at 17:33

answered Aug 21, 2013 at 15:34

FUD

5,2147 gold badges42 silver badges63 bronze badges

3 Comments

FUD Over a year ago

sorry i realized i missed the [] on last statement.

FUD Over a year ago

Also added some nice optimization in case you have data whereone array is really small, just my 2 cents

Kirk Strauser Over a year ago

That's pretty quick at 0.0067 seconds on my laptop, with the results verified to match the original. I think it's nicely readable, too. Well done!

guyrt · Accepted Answer · 2013-08-21 15:37:57Z

2

Use a counter:

c_sum = Counter()
c_len = Counter()
for elt in a:
    c_sum[elt[0]] += elt[1]
    c_len[elt[0]] += 1

for elt in b:
    c_sum[elt[0]] += elt[1]
    c_len[elt[0]] += 1

print [[k, c_sum[k]] for k, v in c_len.iteritems() if v > 1]

answered Aug 21, 2013 at 15:37

guyrt

1,0278 silver badges12 bronze badges

Comments

mavili · Accepted Answer · 2013-08-21 15:37:57Z

2

Here you go

a = [[125, 1], [193, 1], [288, 23]]
b = [[108, 1], [288, 1], [193, 11]]
for e in a:
    for e2 in b:
        if e[0] == e2[0]:
            inter.append([e[0], e[1]+e2[1]])
print inter

Outputs

[[193, 12], [288, 24]]

answered Aug 21, 2013 at 15:37

mavili

3,4344 gold badges33 silver badges46 bronze badges

3 Comments

Mark Ransom Over a year ago

Maybe you missed the word "efficiently" in the question? I don't think an O(n^2) algorithm qualifies.

Mario Rossi Over a year ago

I think the OP used the word "efficiently" just out of habit. Or ignorance. If he really had meant it, he would have specified the size of the lists, the general time constraints, what other solutions he had tried with timing information, and even why he's using Python instead of other languages.

mavili Over a year ago

It's quite possible that the OP isn't that much concerned about the time complexity of the solution. As they haven't shown any attempts to improve on a basic solution and rather just ask a general question for their problem.

Brionius · Accepted Answer · 2013-08-21 15:38:48Z

2

This solution works if it you also want duplicate items within the lists to be counted.

from collections import defaultdict

a = [[125, 1], [193, 1], [288, 23]]
b = [[108, 1], [288, 1], [193, 11]]

d = defaultdict(int)

for value, num in a+b:
    d[value] += num
result = filter(lambda x:x[1]>1, d.items())
result = map(list, result)  # If it's important that the result be a list of lists rather than a list of tuples
print result
# [[288, 24], [193, 12]]

answered Aug 21, 2013 at 15:38

Brionius

14.2k3 gold badges41 silver badges50 bronze badges

Comments

Mario Rossi · Accepted Answer · 2013-08-21 15:49:02Z

2

In first place, Python does not have arrays. It has lists. Just a matter of name, but it can be confusing. The one-liner for this is:

[ [ae[0],ae[1]+be[1]] for be in b for ae in a if be[0] == ae[0] ]

PS: As you say "intersection", I assume the lists are set-like ("bags", really), and that , as bags, they are properly normalized (i.e. they don't have repeated elements/keys).

answered Aug 21, 2013 at 15:49

Mario Rossi

7,82929 silver badges38 bronze badges

8 Comments

FUD Over a year ago

One liners are overrated, this is particularly inefficient

Mario Rossi Over a year ago

When you understand them, they have an important place. And if you are so worried about performance, what are you doing developing in Python? Its main advantage over other languages is expressiveness, not performance. Please don't go around discrediting an important part of the language and preventing people from learning about it. Let them draw their own conclusions. In this particular case, I don't think the performance difference is much. Did you time them?

cmd Over a year ago

This fails to answer the question posed by the OP. This is not efficient. Making arguments against the language does not mitigate that your solution is an inefficient algorithm.

Mario Rossi Over a year ago

Efficiency is relative to the problem. If for you "expressive" and "not as efficient as other languages" are arguments against Python, then you are right. For me, the first is in favor of Python and the second just statement of a fact.

cmd Over a year ago

I would agree that for small data sets your solution works fine and it is expressive. However the OP wrote several times that an efficient algorithm was requested. Leading me to believe that perhaps these list can be very large. The big O notation of your algorithm is n*m. Better can be done and still remain expressive. I also disagree with the statement that just because someone is coding in python they don't care about performance.

|

Faisal · Accepted Answer · 2013-08-21 16:56:07Z

2

Here is how I would approach it, assuming uniqueness on a and b:

k = {} # store totals
its = {} # store intersections
for i in a + b:
    if i[0] in k:
        its[i[0]] = True
        k[i[0]] += i[1]
    else:
        k[i[0]] = i[1]
# then loop through intersections for results
result = [[i, k[i]] for i in its]

answered Aug 21, 2013 at 16:56

Faisal

2,27615 silver badges19 bronze badges

2 Comments

badc0re Over a year ago

Also great performances (0.0174601078033) but i don't get the its[i[0]] = True part.

Faisal Over a year ago

When it hit the key/id the second time, it means its intersected, then we store it instead of doing another loop just to find out if its an intersection.

Kirk Strauser · Accepted Answer · 2013-08-21 17:11:33Z

2

I got:

from collections import defaultdict
d = defaultdict(list)
for series in a, b:
    for key, value in series:
        d[key].append(value)
result2 = [(key, sum(values)) for key, values in d.iteritems() if len(values) > 1]

which runs in O(len(a)+len(b)), or about 0.02 seconds on my laptop vs 18.79 for yours. I also confirmed that it returned the same results as result.items() from your algorithm.

answered Aug 21, 2013 at 17:11

Kirk Strauser

31.1k5 gold badges53 silver badges69 bronze badges

Comments

shx2 · Accepted Answer · 2013-08-21 17:16:07Z

1

This solution might not be the fastest, but it's probably the simplest implementation, so I decided to post it, for completeness.

aa = Counter(dict(a))
bb = Counter(dict(b))
cc = aa + bb
cc
=> Counter({288: 24, 193: 12, 108: 1, 125: 1})

list(cc.items())
=> [(288, 24), (193, 12), (108, 1), (125, 1)]

If you must only include the common keys:

[ (k, cc[k]) for k in set(aa).intersection(bb) ]
=> [(288, 24), (193, 12)]

answered Aug 21, 2013 at 17:16

shx2

64.8k17 gold badges139 silver badges166 bronze badges

Comments

CT Zhu · Accepted Answer · 2013-08-21 17:21:37Z

1

numpy serachsorted(), argsort(), and intersect1d() are possible alternatives and can be quite fast. This example should also take care of non-unique first element issue.

>>> import numpy as np
>>> a=np.array([[125, 1], [193, 1], [288, 23]])
>>> b=np.array([[108, 1], [288, 1], [193, 11]])
>>> aT=a.T
>>> bT=b.T
>>> aS=aT[0][np.argsort(aT[0])]
>>> bS=bT[0][np.argsort(bT[0])]
>>> i=np.intersect1d(aT[0], bT[0])
>>> cT=np.hstack((aT[:,np.searchsorted(aS, i)], bT[:,np.searchsorted(bS, i)]))
>>> [[item,np.sum(cT[1,np.argwhere(cT[0]==item).flatten()])] for item in i]
[[193, 12], [288, 24]] #not quite happy about this, can someone comes up with a vectorized way of doing it?

answered Aug 21, 2013 at 17:21

CT Zhu

54.6k18 gold badges125 silver badges136 bronze badges

1 Comment

CT Zhu Over a year ago

Performance will depend on the dimension of a and b arrays, see this post: stackoverflow.com/questions/3650194/…

Collectives™ on Stack Overflow

python array intersection efficiently

10 Answers 10

2 Comments

3 Comments

Comments

3 Comments

Comments

8 Comments

2 Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

2 Comments

3 Comments

Comments

3 Comments

Comments

8 Comments

2 Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related