4

Since collections.Counter is so slow, I am pursuing a faster method of summing mapped values in Python 2.7. It seems like a simple concept and I'm kind of disappointed in the built-in Counter method.

Basically, I need to be able to take arrays like this:

array([[ 0.,  2.],
       [ 2.,  2.],
       [ 3.,  1.]])

array([[ 0.,  3.],
       [ 1.,  1.],
       [ 2.,  5.]])

And then "add" them so they look like this:

array([[ 0.,  5.],
       [ 1.,  1.],
       [ 2.,  7.],
       [ 3.,  1.]])

If there isn't a good way to do this quickly and efficiently, I'm open to any other ideas that will allow me to do something similar to this, and I'm open to modules other than Numpy.

Thanks!

Edit: Ready for some speedtests? Intel win 64bit machine. All of the following values are in seconds; 20000 loops.

collections.Counter results: 2.131000, 2.125000, 2.125000

Divakar's union1d + masking results: 1.641000, 1.633000, 1.625000

Divakar's union1d + indexing results: 0.625000, 0.625000, 0.641000

Histogram results: 1.844000, 1.938000, 1.858000

Pandas results: 16.659000, 16.686000, 16.885000

Conclusions: union1d + indexing wins, the array size is too small for Pandas to be effective, and the histogram approach blew my mind with its simplicity but I'm guessing it takes too much overhead to create. All of the responses I received were very good, though. This is what I used to get the numbers. Thanks again!

Edit: And it should be mentioned that using Counter1.update(Counter2.elements()) is terrible despite doing the same exact thing (65.671000 sec).

Later Edit: I've been thinking about this a lot, and I've came to realize that, with Numpy, it might be more effective to fill each array with zeros so that the first column isn't even needed since we can just use the index, and that would also make it much easier to add multiple arrays together as well as do other functions. Additionally, Pandas makes more sense than Numpy since there would be no need to 0-fill, and it would definitely be more effective with large data sets (however, Numpy has the advantage of being compatible on more platforms, like GAE, if that matters at all). Lastly, the answer I checked was definitely the best answer for the exact question I asked--adding the two arrays in the way I showed--but I think what I needed was a change in perspective.

15
  • Why does the result have 4 rows? Commented Jul 10, 2017 at 20:36
  • 1
    Because the rows of the resultant equals the number of unique first indexes in the union of the all first indexes of the arrays. Since in the top arrays, only the top one has a "2" and only the middle one has a "3", the bottom one has both a two and a three. Commented Jul 10, 2017 at 20:38
  • Do you know what the maximum value in the first column is? Commented Jul 10, 2017 at 20:45
  • Yes, its presumed that I know the number range. For what I'm doing, the numbers are going to be a sort of ID. Commented Jul 10, 2017 at 20:47
  • 1
    Out of curiosity, what exactly are you doing with collections.Counter? Commented Jul 10, 2017 at 20:51

5 Answers 5

2

Here's one approach with np.union1d and masking -

def app1(a,b):
    c0 = np.union1d(a[:,0],b[:,0])

    out = np.zeros((len(c0),2))
    out[:,0] = c0

    mask1 = np.in1d(c0,a[:,0])
    out[mask1,1] = a[:,1]

    mask2 = np.in1d(c0,b[:,0])
    out[mask2,1] += b[:,1]
    return out

Sample run -

In [174]: a
Out[174]: 
array([[  0.,   2.],
       [ 12.,   2.],
       [ 23.,   1.]])

In [175]: b
Out[175]: 
array([[  0.,   3.],
       [  1.,   1.],
       [ 12.,   5.]])

In [176]: app1(a,b)
Out[176]: 
array([[  0.,   5.],
       [  1.,   1.],
       [ 12.,   7.],
       [ 23.,   1.]])

Here's another with np.union1d and indexing -

def app2(a,b):
    n = np.maximum(a[:,0].max(), b[:,0].max())+1
    c0 = np.union1d(a[:,0],b[:,0])
    out0 = np.zeros((int(n), 2))
    out0[a[:,0].astype(int),1] = a[:,1]

    out0[b[:,0].astype(int),1] += b[:,1]

    out = out0[c0.astype(int)]
    out[:,0] = c0
    return out

For the case where all indices are covered by the first column values in a and b -

def app2_specific(a,b):
    c0 = np.union1d(a[:,0],b[:,0])
    n = c0[-1]+1
    out0 = np.zeros((int(n), 2))
    out0[a[:,0].astype(int),1] = a[:,1]        
    out0[b[:,0].astype(int),1] += b[:,1]
    out0[:,0] = c0
    return out0

Sample run -

In [234]: a
Out[234]: 
array([[ 0.,  2.],
       [ 2.,  2.],
       [ 3.,  1.]])

In [235]: b
Out[235]: 
array([[ 0.,  3.],
       [ 1.,  1.],
       [ 2.,  5.]])

In [236]: app2_specific(a,b)
Out[236]: 
array([[ 0.,  5.],
       [ 1.,  1.],
       [ 2.,  7.],
       [ 3.,  1.]])
Sign up to request clarification or add additional context in comments.

1 Comment

@juanpa.arrivillaga Would love to do that too, if and when OP posts that version.
1

If you know the number of fields, use np.bincount.

c = np.vstack([a, b])
counts = np.bincount(c[:, 0], weights = c[:, 1], minlength = numFields)
out = np.vstack([np.arange(numFields), counts]).T

This works if you're getting all your data at once. Make a list of your arrays and vstack them. If you're getting data chunks sequentially, you can use np.add.at to do the same thing.

out = np.zeros(2, numFields)
out[:, 0] = np.arange(numFields)
np.add.at(out[:, 1], a[:, 0], a[:, 1])
np.add.at(out[:, 1], b[:, 0], b[:, 1])

2 Comments

My last one was dangerously close to your second one. Think I will rollback to previous version of it. That range was great idea!
Think you need to modify your solutions as OP says he/she might have gaps and they are looking to get an "union" version in the output.
1

You can use a basic histogram, this will deal with gaps, too. You can filter out zero-count entries if need be.

import numpy as np

x = np.array([[ 0.,  2.],
              [ 2.,  2.],
              [ 3.,  1.]])

y = np.array([[ 0.,  3.],
              [ 1.,  1.],
              [ 2.,  5.],
              [ 5.,  3.]])

c, w = np.vstack((x,y)).T
h, b = np.histogram(c, weights=w, 
                    bins=np.arange(c.min(),c.max()+2))
r = np.vstack((b[:-1], h)).T
print(r)
# [[ 0.  5.]
#  [ 1.  1.]
#  [ 2.  7.]
#  [ 3.  1.]
#  [ 4.  0.]
#  [ 5.  3.]]
r_nonzero = r[r[:,1]!=0]

Comments

0

Pandas have some functions doing exactly what you intend

import pandas as pd
pda = pd.DataFrame(a).set_index(0)
pdb = pd.DataFrame(b).set_index(0)
result = pd.concat([pda, pdb], axis=1).fillna(0).sum(axis=1)

Edit: If you actually need the data back in numpy format, just do

array_res = result.reset_index(name=1).values

Comments

0

This is a quintessential grouping problem, which numpy_indexed (disclaimer: I am its author) was created to solve elegantly and efficiently:

import numpy_indexed as npi
C = np.concatenate([A, B], axis=0)
labels, sums = npi.group_by(C[:, 0]).sum(C[:, 1])

Note: its cleaner to maintain your label arrays as a seperate int array; floats are finicky when it comes to labeling things, with positive and negative zeros, and printed values not relaying all binary state. Better to use ints for that.

4 Comments

is your numpy_indexed compatible with pyinstaller? Edit: and is it implemented in Python or are there C extensions?
No c extensions, so in principle you can just copy the source to your project and run it; though I strongly recommend conda for python package management
I'm not sure where else to ask so: I get an error when I try to download numpy_indexed using pip, and when trying to download Conda using pip. With numpy_indexed, my error is: "Command "python setup.py egg_info" failed with error code 1" and the end of the path it gives is "appdata\local\temp\pip-build-u6xyi9\numpy-indexed\". When trying to download Conda, I get "Could not find a version that satisfies the requirement menuinst (from conda) (from versions: )" and under that "No matching distribution found for menuinst (from conda)". Please help? I'd like to test its speed.
with pip, try in installing pyyaml first before numpy_indexed. That conda error I am not familiar with. Have you tried downloading miniconda3 from their website? conda.io/miniconda.html

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.