1

I'm trying to subset data in a pandas dataframe based on values that exist in a separate array. Below is a sample example that does work and illustrates what I'm trying to do:

import pandas as pd
import numpy as np
mysubset = np.array([1,2,3,4])
d = {'col1': [1, 2, 3, 4, 5, 6], 'col2': [3, 4, 1, 3, 5, 5]}
df = pd.DataFrame(data=d)
df[df['col1'].isin(mysubset)]

Using that working code as a prototype, I'm implementing (what I think is) the same process on my actual real data, but it doesn't work. My real data look like

>>> tmp.head()
   ItemID                  P0
44  26785         0.276844507
61  26534  1.4108438640000001
71  14107  1.0652574239999999
86  26530  1.1059459039999999
93  18142         0.903011679

and the array I want to use for subsetting is

>>> op_items
array([18692, 18694, 18696, 18706, 18711, 18714, 18716, 18722, 19332,
       19333, 26526, 26527, 26530, 26532, 26533, 26534, 26535, 26536,
       26538, 26541, 14107, 14110, 14120, 14149, 14165, 17984, 18004,
       18005, 18006, 18007, 18008, 18134, 18136, 18139, 18141, 18142,
       19081, 19084, 19086, 20789, 20794, 20796, 20800, 20802, 26784,
       26785, 26786, 26787], dtype=int64) 

Using this as in the toy example above gives

>>> tmp[tmp['ItemID'].isin(op_items)]
Empty DataFrame
Columns: [ItemID, P0]
Index: []

But, manually grabbing some elements from within a list does work:

>>> tmp[tmp['ItemID'].isin(['18692', '18696'])]
    ItemID           P0
236  18696  0.566035305
624  18692   0.60981902

Using the following confirms they are of the same form as in the toy example

>>> type(op_items)
<class 'numpy.ndarray'>
>>> type(tmp['ItemID'])
<class 'pandas.core.series.Series'>

So, I am uncertain what other mistake I am making and could use a pointer. I realize in the example where I hardcoded and grabbed I cast the values in a list. But, the toy example above uses the isin feature where mysubset is an array similar to op_items.

Thank you My question differs from this one in that I'm not worried about duplicates, subset pandas dataframe with corresponding numpy array.

1
  • 2
    tmp[tmp['ItemID'].isin(op_items.astype(str))]? Commented May 26, 2020 at 17:00

1 Answer 1

2

Your op_items is an array of integers, whereas your tmp['ItemID'] is string type. Use:

tmp['ItemID'] = tmp['ItemID'].astype('Int64')

tmp[tmp['ItemID'].isin(op_items)]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.