2

I have a pandas dataframe with indices to a numpy array. The value of the array has to be set to 1 for those indices. I need to do this millions of times on a big numpy array. Is there a more efficient way than the approach shown below?

from numpy import float32, uint
from numpy.random import choice
from pandas import DataFrame
from timeit import timeit

xy = 2000,300000
sz = 10000000
ind = DataFrame({"i":choice(range(xy[0]),sz),"j":choice(range(xy[1]),sz)}).drop_duplicates()
dtype = uint
repeats = 10

#original (~21s)
stmt = '''\
from numpy import zeros
a = zeros(xy, dtype=dtype)
a[ind.values[:,0],ind.values[:,1]] = 1'''

print(timeit(stmt, "from __main__ import xy,sz,ind,dtype", number=repeats))

#suggested by @piRSquared (~13s)
stmt = '''\
from numpy import ones
from scipy.sparse import coo_matrix
i,j = ind.i.values,ind.j.values
a = coo_matrix((ones(i.size, dtype=dtype), (i, j)), dtype=dtype).toarray()
'''

print(timeit(stmt, "from __main__ import xy,sz,ind,dtype", number=repeats))

I have edited the above post to show the approach(es) suggested by @piRSquared and re-wrote it to allow an apples-to-apples comparison. Irrespective of the data type (tried uint and float32), the suggested approach has a 40% reduction in time.

0

1 Answer 1

5

OP time

56.56 s

I can only marginally improve with

i, j = ind.i.values, ind.j.values
a[i, j] = 1

New Time

52.19 s

However, you can considerably speed this up by using scipy.sparse.coo_matrix to instantiate a sparse matrix and then convert it to a numpy.array.

import timeit

stmt = '''\
import numpy, pandas
from scipy.sparse import coo_matrix

xy = 2000,300000

sz = 10000000
ind = pandas.DataFrame({"i":numpy.random.choice(range(xy[0]),sz),"j":numpy.random.choice(range(xy[1]),sz)}).drop_duplicates()

################################################
i, j = ind.i.values, ind.j.values
dtype = numpy.uint8
a = coo_matrix((numpy.ones(i.size, dtype=dtype), (i, j)), dtype=dtype).toarray()'''

timeit.timeit(stmt, number=10)

33.06471237000369
Sign up to request clarification or add additional context in comments.

3 Comments

Yes... a tiny bit. You forgo the overhead of creating the ind1 array. ind.i.values and ind.j.values are already there. ind.values is not and will be created.
@jezrael new time.
thank you @piRSquared. I have updated the original post to show your method and compare easily.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.