Python: faster way of counting occurences in numpy arrays (large dataset)

Question

I am new to Python. I have a numpy.array which size is 66049x1 (66049 rows and 1 column). The values are sorted smallest to largest and are of float type, with some of them being repeated.

I need to determine the frequency of occurrences of each value (the number of times a given value is equalled but not surpassed, e.g. X<=x in statistical terms), in order to later plot the Sample Cumulative Distribution Function.

The code I am currently using is as follows, but it is extremely slow, as it has to loop 66049x66049=4362470401 times. Is there any way to augment the speed of such piece of code? Will perhaps the use of dictionaries help in any way? Unfortunately I cannot change the size of the arrays I am working with.

+++Function header+++
...
...
directoryPath=raw_input('Directory path for native csv file: ')
csvfile = numpy.genfromtxt(directoryPath, delimiter=",")
x=csvfile[:,2]
x1=numpy.delete(x, 0, 0)
x2=numpy.zeros((x1.shape[0]))
x2=sorted(x1)
x3=numpy.around(x2, decimals=3)
count=numpy.zeros(len(x3))

#Iterates over the x3 array to find the number of occurrences of each value
for i in range(len(x3)):
    temp=x3[i]
    for j in range(len(x3)):
       if (temp<=x3[j]):
           count[j]=count[j]+1

#Creates a 2D array with (value, occurrences)
    x4=numpy.zeros((len(x3), 2))
    for i in range(len(x3)):
    x4[i,0]=x3[i]
    x4[i,1]=numpy.around((count[i]/x1.shape[0]),decimals=3)
...
...
+++Function continues+++

B. M. · Accepted Answer · 2015-10-07 17:21:12Z

3

If efficiency counts, you can use the numpy function bincount, which need integers :

import numpy as np
a=np.random.rand(66049).reshape((66049,1)).round(3)
z=np.bincount(np.int32(1000*a[:,0]))

it takes about 1ms.

Regards.

answered Oct 7, 2015 at 17:21

B. M.

18.7k2 gold badges40 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Leb · Accepted Answer · 2015-10-07 17:10:36Z

2

import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt

arr = np.random.randint(0, 100, (100000,1))

df = pd.DataFrame(arr)

cnt = Counter(df[0])

df_p = pd.DataFrame(cnt, index=['data'])

df_p.T.plot(kind='hist')

plt.show()

That whole script took a very short period to execute (~2s) for (100,000x1) array. I didn't time, but if you provide the time it took to do yours we can compare.

I used [Counter][2] from collections to count the number of occurrences, my experiences with it have always been great (timewise). I converted it into DataFrame to plot and used T to transpose.

Your data does replicate a bit, but you can try and refine it some more. As it is, it's pretty fast.

Edit

Create CDF using cumsum()

import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt

arr = np.random.randint(0, 100, (100000,1))

df = pd.DataFrame(arr)

cnt = Counter(df[0])

df_p = pd.DataFrame(cnt, index=['data']).T


df_p['cumu'] = df_p['data'].cumsum()

df_p['cumu'].plot(kind='line')

plt.show()

Edit 2

For scatter() plot you must specify the (x,y) explicitly. Also, calling df_p['cumu'] will result in a Series, not a DataFrame.

To properly display a scatter plot you'll need the following:

import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt

arr = np.random.randint(0, 100, (100000,1))

df = pd.DataFrame(arr)

cnt = Counter(df[0])

df_p = pd.DataFrame(cnt, index=['data']).T


df_p['cumu'] = df_p['data'].cumsum()

df_p.plot(kind='scatter', x='data', y='cumu')

plt.show()

edited Oct 7, 2015 at 17:10

answered Oct 7, 2015 at 11:48

Leb

16k11 gold badges58 silver badges77 bronze badges

11 Comments

Leb Over a year ago

Look at my edit and let me know if that's what you're looking for.

Leb Over a year ago

I didn't use bins at all, I created the cumulative sum to represent the CDF, however I see a mistake where I didn't normalize it. The histogram I showed earlier was just proof of concept, didn't actually use its values

Leb Over a year ago

Let us continue this discussion in chat.

Leb Over a year ago

That's not the CDF, my last edit (Edit 2) is to show you how to plot as a scatter. You need to change df_p.plot() to plt.scatter(df_p.index , df_p['cumu']/100000) like I mentioned in chat.

Leb Over a year ago

Create a new question since this post relates only to your original one. This prevents multiple questions to be answered in one post.

|

Ivana Balazevic · Accepted Answer · 2015-10-07 11:02:46Z

1

You should use np.where and then count the length of the obtained vector of indices:

indices = np.where(x3 <= value)
count = len(indices[0])

answered Oct 7, 2015 at 11:02

Ivana Balazevic

1481 silver badge9 bronze badges

Comments

Koray · Accepted Answer · 2022-03-25 12:04:51Z

0

# for counting a single value
mask = (my_np_array == value_to_count).astype('uint8')
# or a condition
mask = (my_np_array <= max_value).astype('uint8')

count = np.sum(mask)

answered Mar 25, 2022 at 12:04

Koray

1,8261 gold badge29 silver badges39 bronze badges

Collectives™ on Stack Overflow

Python: faster way of counting occurences in numpy arrays (large dataset)

4 Answers 4

Comments

11 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

11 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related