How to map nearest values from a dataframe to numpy array efficiently

Question

I have created a dataframe with a a column for codes and another for discrete values. I have a numpy array with some experimental values. How do i create a numpy array of codes that are associated to the nearest value defined in the dataframe.

The dataframe that defines the mapping between the codes and values doesn't have to be a dataframe. I am much more familiar with pandas than numpy, so i tend to lean towards using pandas dataframes. I am very unfamiliar with numpy so not sure what the best way to do this might be.

This is what i have tried and it gives me the correct response. Its just too slow. My actual data set is 500x1500 and i have over 700 sets of data that this operation needs to be performed over, so efficiency and speed are paramount. Ideas? Thoughts? Suggestions? Thanks!

import numpy as np
import pandas as pd
from pandas import DataFrame


def main():
    npsize = (2,4)
    #Create an array of data between -0.75 and 0.25
    data = np.random.uniform(-0.75,0.25,npsize)
    
    #Pandas dataframe that creates a map
    codes = [np.array([1,2,3]),np.array([4,5,6]),np.array([7,8,9]),np.array([10,11,12]),np.array([13,14,15])]
    values = [-0.75,-0.5,-0.25,0,0.25]
    d = {'code':codes, 'value':values}
    data_map = pd.DataFrame(data=d)

    #I need to associate each element of data to the code within the data_map dataframe by looking up the nearest value
    #For example ... -0.05 ---> [10,11,12] 
    
    #Silly Looping approach ... surely there is a better/faster way to do this!
    mapped_data = np.zeros(shape=(2,4,3))
    xctr = 0
    yctr = 0

    while xctr < npsize[0]:
        #print(xctr)
        while yctr < npsize[1]:
            nearest_code = data_map.iloc[(data_map['value']-data[xctr,yctr]).abs().argsort()[:1]].code.iloc[0]
            mapped_data[xctr,yctr] = nearest_code
            yctr = yctr + 1
        yctr = 0
        xctr = xctr + 1

    print (mapped_data)

if __name__ == "__main__":
    main()

For future reference, this question is better suited to codereview.stackexchange.com since you have an implementation that works but needs optimisation — Reinderien
– Reinderien, Commented Nov 2, 2022 at 22:41

Reinderien · Accepted Answer · 2022-11-02 23:55:58Z

Don't create codes as a list of arrays of manual values; use a reshaped arange.

Don't create values manually; also use arange.

Avoid holding inner lists; expand your "code" to multiple columns and then an additional dimension in your output array.

And yes, don't loop. Numpy doesn't have anything for this kind of merge but Pandas does - merge_asof. There are sorting requirements, and if you can avoid needing to preserve order after, your code will be faster.

import numpy as np
import pandas as pd
from numpy.random import default_rng


def main() -> None:
    # Pandas dataframe that creates a map
    code_width = 3
    codes = np.arange(1, 16).reshape((-1, code_width))
    d = {
        f'code_{i}': col
        for i, col in enumerate(codes.T)
    }
    data_map = pd.DataFrame(d, index=np.arange(-0.75, 0.5, 0.25))

    rand = default_rng(seed=0)
    givens = rand.uniform(-0.75, 0.25, (2, 4))
    givens_flat = givens.ravel()
    # Givens need to be sorted. If you don't need to preserve original order, then replace argsort with sort.
    order = givens_flat.argsort()
    givens_flat = givens_flat[order]

    merged_givens = pd.merge_asof(
        left=pd.Series(givens_flat, name='givens'), right=data_map,
        left_on='givens', right_index=True, direction='nearest',
    )

    mapped_givens = np.empty((code_width, len(givens_flat)))
    mapped_givens[:, order] = merged_givens.iloc[:, 1:].T
    mapped_givens = mapped_givens.reshape((-1, *givens.shape))

    print('Map:')
    print(data_map)
    print()

    print('Givens:')
    print(givens)
    print()

    print('Mapped:')
    print(mapped_givens)


if __name__ == "__main__":
    main()

Map:
       code_0  code_1  code_2
-0.75       1       2       3
-0.50       4       5       6
-0.25       7       8       9
 0.00      10      11      12
 0.25      13      14      15

Givens:
[[-0.11303831 -0.48021329 -0.70902648 -0.73347236]
 [ 0.06327024  0.16275558 -0.14336422 -0.02050344]]

Mapped:
[[[10.  4.  1.  1.]
  [10. 13.  7. 10.]]

 [[11.  5.  2.  2.]
  [11. 14.  8. 11.]]

 [[12.  6.  3.  3.]
  [12. 15.  9. 12.]]]

Collectives™ on Stack Overflow

How to map nearest values from a dataframe to numpy array efficiently

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related