1

I have a dataframe like this:

array([[1374495, 3, 'prior', ..., 16.0, 'soy lactosefree', 'dairy eggs'],
       [3002854, 3, 'prior', ..., 16.0, 'soy lactosefree', 'dairy eggs'],
       [2710558, 3, 'prior', ..., 16.0, 'soy lactosefree', 'dairy eggs'],
       ...,
       [1355976, 206200, 'prior', ..., 16.0, 'soy lactosefree',
        'dairy eggs'],
       [1909878, 206200, 'prior', ..., 16.0, 'soy lactosefree',
        'dairy eggs'],
       [943915, 206200, 'train', ..., 16.0, 'soy lactosefree', 'dairy eggs']], dtype=object)

the first number of every row is orderid, like 1374495, 3002854, 2710558... Now I have a list of orderid which shall be used to get the rows from the array. For example, the list to be used is [1355976, 1909878, 943915 ], I should select the rows from array whose orderid in [1355976, 1909878, 943915 ]. How can I realize this in an efficient way ?

2 Answers 2

4

Approach #1

Here's one approach based on np.searchsorted -

def filter_rows(a, idx):
    # a is input dataframe as array
    # idx is list of indices for selecting rows

    a_idx = a[:,0]
    idx_arr = np.sort(idx)
    pos_idx = np.searchsorted(idx_arr, a_idx)
    pos_idx[pos_idx == idx_arr.size] = 0
    mask = idx_arr[pos_idx] == a_idx
    out = a[mask]
    return out

Approach #2

Here's another with np.in1d -

a[np.in1d(a[:,0], idx)]

Sample runs -

In [83]: a
Out[83]: 
array([[1374495, 3, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
       [3002854, 3, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
       [2710558, 3, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
       [1355976, 206200, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
       [1909878, 206200, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
       [943915, 206200, 'train', 16.0, 'soy lactosefree', 'dairy eggs']])

In [84]: idx
Out[84]: [1355976, 1909878, 943915]

In [85]: filter_rows(a, idx)
Out[85]: 
array([[1355976, 206200, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
       [1909878, 206200, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
       [943915, 206200, 'train', 16.0, 'soy lactosefree', 'dairy eggs']])

In [88]: a[np.in1d(a[:,0], idx)]
Out[88]: 
array([[1355976, 206200, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
       [1909878, 206200, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
       [943915, 206200, 'train', 16.0, 'soy lactosefree', 'dairy eggs']])
Sign up to request clarification or add additional context in comments.

Comments

0

The numpy_indexed package (disclaimer: I am its author) contains efficient functionality for these type of operations:

import numpy_indexed as npi
row_idx = npi.indices(id_column, ids_to_get_index_of)

Should have the same performance as the solution offered by Divakar, but comes with some extra bells and whistles, like kwargs to select various ways of dealing with missing values, and so on.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.