Selecting dataframe rows based on multiple columns, where new functions should be created to handle conditions in some columns

Question

I have a dataframe that consists of multiple columns. I want to select rows based on conditions in multiple columns. Assuming that I have four columns in a dataframe:

import pandas as pd
di={"A":[1,2,3,4,5],
    "B":['Tokyo','Madrid','Professor','helsinki','Tokyo Oliveira'],
"C":['250','200//250','250//250//200','12','200//300'],
"D":['Left','Right','Left','Right','Right']}
data=pd.DataFrame(di)

I want to select Tokyo in column B, 200 in column C, Left in column D. By that, the first row will be only selected. I have to create a function to handle column C. Since I need to check the first value if the row contains a list with //

To handle this, I assume this can be done through the following:

def check_200(thecolumn):
thelist=[]
for i in thecolumn:
    f=i
    if "//" in f:
        #split based on //
        z=f.split("//")
        f=z[0]

    f=float(f)
    if f > 200.00:
        thelist.append(True)
    else:
        thelist.append(False)
return thelist

Then, I will create the multiple conditions:

selecteddata=data[(data.B.str.contains("Tokyo")) & 
(data.D.str.contains("Left"))&(check_200(data.C))]

Is this the best way to do that, or there is an easier pandas function that can handle such requirements ?

whats your target output?

Umar.H
– Umar.H

2020-03-30 12:09:07 +00:00
Commented Mar 30, 2020 at 12:09 — Umar.H
– Umar.H, Commented Mar 30, 2020 at 12:09

Bruno Mello · Accepted Answer · 2020-03-30 12:39:51Z

2

I don't think there is a most pythonic way to do this, but I think this is what you want:

bool_idx = ((data.B.str.contains("Tokyo")) & 
(data.D.str.contains("Left")) & (data.C.str.contains("//")
& (data.C.str.split("//")[0].astype(float)>200.00))

selecteddata=data[bool_idx]

edited Mar 30, 2020 at 12:39

answered Mar 30, 2020 at 12:28

Bruno Mello

4,6781 gold badge16 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Bruno Mello Over a year ago

If by the best way you mean without boolean indexing, I really think this is the best way because you have to "represent" how you want to slice the dataframe in some way and thus this looks like the most "compressed" way to show this.

Alan · Accepted Answer · 2020-03-30 12:37:57Z

0

Bruno's answer does the job, and I agree that boolean masking is the way to go. This answer keeps the code a little closer to the requested format.


import numpy as np

def col_condition(col):
    col = col.apply(lambda x: float(x.split('//')[0]) > 200)
    return col

data = data[(data.B.str.contains('Tokyo')) & (data.D.str.contains("Left")) &
             col_condition(data.C)]

The function reads in a Series, and converts each element to True or False, depending on the condition. It then returns this mask.

answered Mar 30, 2020 at 12:37

Alan

2,6282 gold badges14 silver badges30 bronze badges

1 Comment

Bruno Mello Over a year ago

You also have to check whether '//' is in x because if you have something like '300' when you use split('//') it will return the '300' and thus return true.

Collectives™ on Stack Overflow

Selecting dataframe rows based on multiple columns, where new functions should be created to handle conditions in some columns

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related