Compute if value exists in a column on lists in pandas dataframe

Question

I have 2 columns in my dataframe

product ID purchased by the customer "p"

list of products IDs purchased by similar customers "p_list"

df = pd.DataFrame({'p': [12, 4, 5, 6, 7, 7, 6,5],'p_list':[[12,1,5], [3,1],[8,9,11], [6,7,9], [7,1,2],[12,9,8], [6,1,15],[6,8,9,11]]})

I want to check if "p" exists on "p_list" or not, so I applied this code

df["exist"]= df.apply(lambda r: 1 if r["p"] in r["p_list"] else 0, axis=1)

The problem is that I have around 50 million rows in this dataframe, so it takes very long time to execute.

Is there more efficient way to compute this column?

Thanks.

jezrael · Accepted Answer · 2017-07-30 07:13:55Z

6

You can use list comprehension, last cast True, False values to int:

df["exist"] = [r[0] in r[1]  for r in zip(df["p"], df["p_list"])]
df["exist"] = df["exist"].astype(int)
print (df)
    p         p_list  exist
0  12     [12, 1, 5]      1
1   4         [3, 1]      0
2   5     [8, 9, 11]      0
3   6      [6, 7, 9]      1
4   7      [7, 1, 2]      1
5   7     [12, 9, 8]      0
6   6     [6, 1, 15]      1
7   5  [6, 8, 9, 11]      0

df["exist"] = [int(r[0] in r[1])  for r in zip(df["p"], df["p_list"])]
print (df)
    p         p_list  exist
0  12     [12, 1, 5]      1
1   4         [3, 1]      0
2   5     [8, 9, 11]      0
3   6      [6, 7, 9]      1
4   7      [7, 1, 2]      1
5   7     [12, 9, 8]      0
6   6     [6, 1, 15]      1
7   5  [6, 8, 9, 11]      0

Timings:

#[8000 rows x 2 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
print (df)

In [89]: %%timeit
    ...: df["exist2"] = [r[0] in r[1]  for r in zip(df["p"], df["p_list"])]
    ...: df["exist2"] = df["exist2"].astype(int)
    ...: 
100 loops, best of 3: 6.07 ms per loop

In [90]: %%timeit
    ...: df["exist"] = [1 if r[0] in r[1] else 0  for r in zip(df["p"], df["p_list"])]
    ...: 
100 loops, best of 3: 7.16 ms per loop

In [91]: %%timeit
    ...: df["exist"] = [int(r[0] in r[1])  for r in zip(df["p"], df["p_list"])]
    ...: 
100 loops, best of 3: 9.23 ms per loop

In [92]: %%timeit
    ...: df['exist1'] = df.apply(lambda x: x.p in x.p_list, axis=1).astype(int)
    ...: 
1 loop, best of 3: 370 ms per loop

In [93]: %%timeit
    ...: df["exist"]= df.apply(lambda r: 1 if r["p"] in r["p_list"] else 0, axis=1)
1 loop, best of 3: 310 ms per loop

edited Jul 30, 2017 at 7:13

answered Jul 30, 2017 at 7:05

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

SethMMorton Over a year ago

Could isin be used for this? Or eval('p in p_list')?

jezrael Over a year ago

@SethMMorton - I think no, because need compare by rows, eval for me return error (not sure how is used)

SethMMorton Over a year ago

Sorry, I meant df.eval('p in p_list'). Is that what failed? That is supposed to evaluate row-wise.

jezrael Over a year ago

@SethMMorton - it return me TypeError: unhashable type: 'list'

Collectives™ on Stack Overflow

Compute if value exists in a column on lists in pandas dataframe

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related