4

I don't know how to make an intersect between these two arrays:

a = [[125, 1], [193, 1], [288, 23]]
b = [[108, 1], [288, 1], [193, 11]]

result = [[288,24], [193, 12]]

So the intersection is by the first element, the second element of the array is summed, any ideas how to do this efficiently?

Ok i made a mistake for not explaining what i mean about efficient, sorry. Consider the following naive implementation:

a = [[125, 1], [193, 1], [288, 23]]
b = [[108, 1], [288, 1], [193, 11]]
result = {}
for i, j in a:
    for k, l in b:
        if i == k:
            result[i] = j + l
print result

So i was trying to find a way to make more efficient solution to my problem, more pythonic in a way. So that's why i needed your help.

Try this test cases (my code is also on it):

Test Case

Running time: 28.6980509758

9
  • What should the behavior be if there are two matching elements in a? Should they be summed too? For example, a = [[100, 1], [100, 2]], b = [[50, 1]] Commented Aug 21, 2013 at 15:37
  • 1
    I don't know that I would keep this data as list of lists. Perhaps a dictionary or better yet a counter would make more sense. Commented Aug 21, 2013 at 16:11
  • 1
    If you are using the word "efficiently" seriously, what alternatives have you tried, and why their performance was inadequate? Also, and very importantly, what is the size of the dataset? Commented Aug 21, 2013 at 16:14
  • 1
    @badc0re There seems to be an argument about using expressive (and easy to write, test, and understand) language features v/s efficiency (very important too). Why don't you help us settling this in this particular case? Can you run and time a test of the different answers and publish it? Commented Aug 21, 2013 at 16:25
  • 1
    @badc0re And what are your expectations? Why 29 sec is not good? This is very important. Most experienced developers will incrementally improve performance (and decrease readability) to the point performance is acceptable and not a single bit more. They value their time and the time of other developers who could come across their code. Commented Aug 21, 2013 at 17:06

10 Answers 10

4

This data seems like it would be better stored as a dictionary

da = dict(a)
db = dict(b)

once you have it like that you can just:

result = [[k, da[k] + db[k]] for k in set(da.keys()).intersection(db.keys())]

or in python 2.7 you can also just use viewkeys instead of a set

result = [[k, da[k] + db[k]] for k in da.viewkeys() & db]
Sign up to request clarification or add additional context in comments.

2 Comments

I haven't downvoted, your solution has so far best performances calculated in 0.0124080181122.
nice! pythonic and efficient
3
result = []
ms, mb = (dict(a),dict(b)) if len(a)<len(b) else (dict(b),dict(a))
for k in ms:
  if k in mb:
    result.append([k,ms[k]+mb[k]])

3 Comments

sorry i realized i missed the [] on last statement.
Also added some nice optimization in case you have data whereone array is really small, just my 2 cents
That's pretty quick at 0.0067 seconds on my laptop, with the results verified to match the original. I think it's nicely readable, too. Well done!
2

Use a counter:

c_sum = Counter()
c_len = Counter()
for elt in a:
    c_sum[elt[0]] += elt[1]
    c_len[elt[0]] += 1

for elt in b:
    c_sum[elt[0]] += elt[1]
    c_len[elt[0]] += 1

print [[k, c_sum[k]] for k, v in c_len.iteritems() if v > 1]

Comments

2

Here you go

a = [[125, 1], [193, 1], [288, 23]]
b = [[108, 1], [288, 1], [193, 11]]
for e in a:
    for e2 in b:
        if e[0] == e2[0]:
            inter.append([e[0], e[1]+e2[1]])
print inter

Outputs

[[193, 12], [288, 24]]

3 Comments

Maybe you missed the word "efficiently" in the question? I don't think an O(n^2) algorithm qualifies.
I think the OP used the word "efficiently" just out of habit. Or ignorance. If he really had meant it, he would have specified the size of the lists, the general time constraints, what other solutions he had tried with timing information, and even why he's using Python instead of other languages.
It's quite possible that the OP isn't that much concerned about the time complexity of the solution. As they haven't shown any attempts to improve on a basic solution and rather just ask a general question for their problem.
2

This solution works if it you also want duplicate items within the lists to be counted.

from collections import defaultdict

a = [[125, 1], [193, 1], [288, 23]]
b = [[108, 1], [288, 1], [193, 11]]

d = defaultdict(int)

for value, num in a+b:
    d[value] += num
result = filter(lambda x:x[1]>1, d.items())
result = map(list, result)  # If it's important that the result be a list of lists rather than a list of tuples
print result
# [[288, 24], [193, 12]]

Comments

2

In first place, Python does not have arrays. It has lists. Just a matter of name, but it can be confusing. The one-liner for this is:

[ [ae[0],ae[1]+be[1]] for be in b for ae in a if be[0] == ae[0] ]

PS: As you say "intersection", I assume the lists are set-like ("bags", really), and that , as bags, they are properly normalized (i.e. they don't have repeated elements/keys).

8 Comments

One liners are overrated, this is particularly inefficient
When you understand them, they have an important place. And if you are so worried about performance, what are you doing developing in Python? Its main advantage over other languages is expressiveness, not performance. Please don't go around discrediting an important part of the language and preventing people from learning about it. Let them draw their own conclusions. In this particular case, I don't think the performance difference is much. Did you time them?
This fails to answer the question posed by the OP. This is not efficient. Making arguments against the language does not mitigate that your solution is an inefficient algorithm.
Efficiency is relative to the problem. If for you "expressive" and "not as efficient as other languages" are arguments against Python, then you are right. For me, the first is in favor of Python and the second just statement of a fact.
I would agree that for small data sets your solution works fine and it is expressive. However the OP wrote several times that an efficient algorithm was requested. Leading me to believe that perhaps these list can be very large. The big O notation of your algorithm is n*m. Better can be done and still remain expressive. I also disagree with the statement that just because someone is coding in python they don't care about performance.
|
2

Here is how I would approach it, assuming uniqueness on a and b:

k = {} # store totals
its = {} # store intersections
for i in a + b:
    if i[0] in k:
        its[i[0]] = True
        k[i[0]] += i[1]
    else:
        k[i[0]] = i[1]
# then loop through intersections for results
result = [[i, k[i]] for i in its]

2 Comments

Also great performances (0.0174601078033) but i don't get the its[i[0]] = True part.
When it hit the key/id the second time, it means its intersected, then we store it instead of doing another loop just to find out if its an intersection.
2

I got:

from collections import defaultdict
d = defaultdict(list)
for series in a, b:
    for key, value in series:
        d[key].append(value)
result2 = [(key, sum(values)) for key, values in d.iteritems() if len(values) > 1]

which runs in O(len(a)+len(b)), or about 0.02 seconds on my laptop vs 18.79 for yours. I also confirmed that it returned the same results as result.items() from your algorithm.

Comments

1

This solution might not be the fastest, but it's probably the simplest implementation, so I decided to post it, for completeness.

aa = Counter(dict(a))
bb = Counter(dict(b))
cc = aa + bb
cc
=> Counter({288: 24, 193: 12, 108: 1, 125: 1})

list(cc.items())
=> [(288, 24), (193, 12), (108, 1), (125, 1)]

If you must only include the common keys:

[ (k, cc[k]) for k in set(aa).intersection(bb) ]
=> [(288, 24), (193, 12)]

Comments

1

numpy serachsorted(), argsort(), and intersect1d() are possible alternatives and can be quite fast. This example should also take care of non-unique first element issue.

>>> import numpy as np
>>> a=np.array([[125, 1], [193, 1], [288, 23]])
>>> b=np.array([[108, 1], [288, 1], [193, 11]])
>>> aT=a.T
>>> bT=b.T
>>> aS=aT[0][np.argsort(aT[0])]
>>> bS=bT[0][np.argsort(bT[0])]
>>> i=np.intersect1d(aT[0], bT[0])
>>> cT=np.hstack((aT[:,np.searchsorted(aS, i)], bT[:,np.searchsorted(bS, i)]))
>>> [[item,np.sum(cT[1,np.argwhere(cT[0]==item).flatten()])] for item in i]
[[193, 12], [288, 24]] #not quite happy about this, can someone comes up with a vectorized way of doing it?

1 Comment

Performance will depend on the dimension of a and b arrays, see this post: stackoverflow.com/questions/3650194/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.