Adding Numpy arrays like Counters

Question

Since collections.Counter is so slow, I am pursuing a faster method of summing mapped values in Python 2.7. It seems like a simple concept and I'm kind of disappointed in the built-in Counter method.

Basically, I need to be able to take arrays like this:

array([[ 0.,  2.],
       [ 2.,  2.],
       [ 3.,  1.]])

array([[ 0.,  3.],
       [ 1.,  1.],
       [ 2.,  5.]])

And then "add" them so they look like this:

array([[ 0.,  5.],
       [ 1.,  1.],
       [ 2.,  7.],
       [ 3.,  1.]])

If there isn't a good way to do this quickly and efficiently, I'm open to any other ideas that will allow me to do something similar to this, and I'm open to modules other than Numpy.

Thanks!

Edit: Ready for some speedtests? Intel win 64bit machine. All of the following values are in seconds; 20000 loops.

collections.Counter results: 2.131000, 2.125000, 2.125000

Divakar's union1d + masking results: 1.641000, 1.633000, 1.625000

Divakar's union1d + indexing results: 0.625000, 0.625000, 0.641000

Histogram results: 1.844000, 1.938000, 1.858000

Pandas results: 16.659000, 16.686000, 16.885000

Conclusions: union1d + indexing wins, the array size is too small for Pandas to be effective, and the histogram approach blew my mind with its simplicity but I'm guessing it takes too much overhead to create. All of the responses I received were very good, though. This is what I used to get the numbers. Thanks again!

Edit: And it should be mentioned that using Counter1.update(Counter2.elements()) is terrible despite doing the same exact thing (65.671000 sec).

Later Edit: I've been thinking about this a lot, and I've came to realize that, with Numpy, it might be more effective to fill each array with zeros so that the first column isn't even needed since we can just use the index, and that would also make it much easier to add multiple arrays together as well as do other functions. Additionally, Pandas makes more sense than Numpy since there would be no need to 0-fill, and it would definitely be more effective with large data sets (however, Numpy has the advantage of being compatible on more platforms, like GAE, if that matters at all). Lastly, the answer I checked was definitely the best answer for the exact question I asked--adding the two arrays in the way I showed--but I think what I needed was a change in perspective.

Because the rows of the resultant equals the number of unique first indexes in the union of the all first indexes of the arrays. Since in the top arrays, only the top one has a "2" and only the middle one has a "3", the bottom one has both a two and a three. — Zoojay
– Zoojay, Commented Jul 10, 2017 at 20:38
Yes, its presumed that I know the number range. For what I'm doing, the numbers are going to be a sort of ID. — Zoojay
– Zoojay, Commented Jul 10, 2017 at 20:47
Out of curiosity, what exactly are you doing with collections.Counter? — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Jul 10, 2017 at 20:51

Divakar · Accepted Answer · 2017-07-10 21:17:32Z

2

Here's one approach with np.union1d and masking -

def app1(a,b):
    c0 = np.union1d(a[:,0],b[:,0])

    out = np.zeros((len(c0),2))
    out[:,0] = c0

    mask1 = np.in1d(c0,a[:,0])
    out[mask1,1] = a[:,1]

    mask2 = np.in1d(c0,b[:,0])
    out[mask2,1] += b[:,1]
    return out

Sample run -

In [174]: a
Out[174]: 
array([[  0.,   2.],
       [ 12.,   2.],
       [ 23.,   1.]])

In [175]: b
Out[175]: 
array([[  0.,   3.],
       [  1.,   1.],
       [ 12.,   5.]])

In [176]: app1(a,b)
Out[176]: 
array([[  0.,   5.],
       [  1.,   1.],
       [ 12.,   7.],
       [ 23.,   1.]])

Here's another with np.union1d and indexing -

def app2(a,b):
    n = np.maximum(a[:,0].max(), b[:,0].max())+1
    c0 = np.union1d(a[:,0],b[:,0])
    out0 = np.zeros((int(n), 2))
    out0[a[:,0].astype(int),1] = a[:,1]

    out0[b[:,0].astype(int),1] += b[:,1]

    out = out0[c0.astype(int)]
    out[:,0] = c0
    return out

For the case where all indices are covered by the first column values in a and b -

def app2_specific(a,b):
    c0 = np.union1d(a[:,0],b[:,0])
    n = c0[-1]+1
    out0 = np.zeros((int(n), 2))
    out0[a[:,0].astype(int),1] = a[:,1]        
    out0[b[:,0].astype(int),1] += b[:,1]
    out0[:,0] = c0
    return out0

Sample run -

In [234]: a
Out[234]: 
array([[ 0.,  2.],
       [ 2.,  2.],
       [ 3.,  1.]])

In [235]: b
Out[235]: 
array([[ 0.,  3.],
       [ 1.,  1.],
       [ 2.,  5.]])

In [236]: app2_specific(a,b)
Out[236]: 
array([[ 0.,  5.],
       [ 1.,  1.],
       [ 2.,  7.],
       [ 3.,  1.]])

edited Jul 10, 2017 at 21:17

answered Jul 10, 2017 at 20:47

Divakar

222k19 gold badges273 silver badges374 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Divakar Over a year ago

@juanpa.arrivillaga Would love to do that too, if and when OP posts that version.

Daniel F · Accepted Answer · 2017-07-10 21:05:00Z

1

If you know the number of fields, use np.bincount.

c = np.vstack([a, b])
counts = np.bincount(c[:, 0], weights = c[:, 1], minlength = numFields)
out = np.vstack([np.arange(numFields), counts]).T

This works if you're getting all your data at once. Make a list of your arrays and vstack them. If you're getting data chunks sequentially, you can use np.add.at to do the same thing.

out = np.zeros(2, numFields)
out[:, 0] = np.arange(numFields)
np.add.at(out[:, 1], a[:, 0], a[:, 1])
np.add.at(out[:, 1], b[:, 0], b[:, 1])

edited Jul 10, 2017 at 21:05

answered Jul 10, 2017 at 20:52

Daniel F

14.5k2 gold badges34 silver badges59 bronze badges

2 Comments

Divakar Over a year ago

My last one was dangerously close to your second one. Think I will rollback to previous version of it. That range was great idea!

Divakar Over a year ago

Think you need to modify your solutions as OP says he/she might have gaps and they are looking to get an "union" version in the output.

Jan Christoph Terasa · Accepted Answer · 2017-07-10 22:34:43Z

1

You can use a basic histogram, this will deal with gaps, too. You can filter out zero-count entries if need be.

import numpy as np

x = np.array([[ 0.,  2.],
              [ 2.,  2.],
              [ 3.,  1.]])

y = np.array([[ 0.,  3.],
              [ 1.,  1.],
              [ 2.,  5.],
              [ 5.,  3.]])

c, w = np.vstack((x,y)).T
h, b = np.histogram(c, weights=w, 
                    bins=np.arange(c.min(),c.max()+2))
r = np.vstack((b[:-1], h)).T
print(r)
# [[ 0.  5.]
#  [ 1.  1.]
#  [ 2.  7.]
#  [ 3.  1.]
#  [ 4.  0.]
#  [ 5.  3.]]
r_nonzero = r[r[:,1]!=0]

answered Jul 10, 2017 at 22:34

Jan Christoph Terasa

6,00327 silver badges35 bronze badges

Comments

ilmarinen · Accepted Answer · 2017-07-10 21:07:04Z

0

Pandas have some functions doing exactly what you intend

import pandas as pd
pda = pd.DataFrame(a).set_index(0)
pdb = pd.DataFrame(b).set_index(0)
result = pd.concat([pda, pdb], axis=1).fillna(0).sum(axis=1)

Edit: If you actually need the data back in numpy format, just do

array_res = result.reset_index(name=1).values

edited Jul 10, 2017 at 21:07

answered Jul 10, 2017 at 20:47

ilmarinen

5,9173 gold badges19 silver badges14 bronze badges

Comments

Eelco Hoogendoorn · Accepted Answer · 2017-07-10 21:33:09Z

0

This is a quintessential grouping problem, which numpy_indexed (disclaimer: I am its author) was created to solve elegantly and efficiently:

import numpy_indexed as npi
C = np.concatenate([A, B], axis=0)
labels, sums = npi.group_by(C[:, 0]).sum(C[:, 1])

Note: its cleaner to maintain your label arrays as a seperate int array; floats are finicky when it comes to labeling things, with positive and negative zeros, and printed values not relaying all binary state. Better to use ints for that.

answered Jul 10, 2017 at 21:33

Eelco Hoogendoorn

10.8k1 gold badge46 silver badges43 bronze badges

4 Comments

Zoojay Over a year ago

is your numpy_indexed compatible with pyinstaller? Edit: and is it implemented in Python or are there C extensions?

Eelco Hoogendoorn Over a year ago

No c extensions, so in principle you can just copy the source to your project and run it; though I strongly recommend conda for python package management

Zoojay Over a year ago

I'm not sure where else to ask so: I get an error when I try to download numpy_indexed using pip, and when trying to download Conda using pip. With numpy_indexed, my error is: "Command "python setup.py egg_info" failed with error code 1" and the end of the path it gives is "appdata\local\temp\pip-build-u6xyi9\numpy-indexed\". When trying to download Conda, I get "Could not find a version that satisfies the requirement menuinst (from conda) (from versions: )" and under that "No matching distribution found for menuinst (from conda)". Please help? I'd like to test its speed.

Eelco Hoogendoorn Over a year ago

with pip, try in installing pyyaml first before numpy_indexed. That conda error I am not familiar with. Have you tried downloading miniconda3 from their website? conda.io/miniconda.html

Collectives™ on Stack Overflow

Adding Numpy arrays like Counters

5 Answers 5

1 Comment

2 Comments

Comments

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

2 Comments

Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related