1

Need help with Pandas multiple IF-ELSE statements. I have a test dataset (titanic) as follows:

ID  Survived    Pclass  Name    Sex Age
1   0   3   Braund  male    22
2   1   1   Cumings, Mrs.   female  38
3   1   3   Heikkinen, Miss. Laina  female  26
4   1   1   Futrelle, Mrs.  female  35
5   0   3   Allen, Mr.  male    35
6   0   3   Moran, Mr.  male    
7   0   1   McCarthy, Mr.   male    54
8   0   3   Palsson, Master male    2

where Id is the passenger id. I want to create a new flag variable in this data frame which has the following rule:

if Sex=="female" or (Pclass==1 and Age <18) then 1 else 0. 

Now to do this I tried a few approaches. This is how I approached first:

df=pd.read_csv(data.csv)
for passenger_index,passenger in df.iterrows():
    if passenger['Sex']=="female" or (passenger['Pclass']==1 and passenger['Age']<18):
       df['Prediction']=1
    else:
       df['Prediction']=0

The problem with above code is that it creates a Prediction variable in df but with all values as 0.

However if I use the same code but instead output it to a dictionary it gives the right answer as shown below:

prediction={}
df=pd.read_csv(data.csv)
for passenger_index,passenger in df.iterrows():
    if passenger['Sex']=="female" or (passenger['Pclass']==1 and passenger['Age']<18):
       prediction[passenger['ID']=1
    else:
       prediction[passenger['ID']=0

This gives a dict prediction with keys as ID and values as 1 or 0 based on the above logic.

So why the df variable works wrongly?. I even tried by first defining a function and then calling it. Gave the same ans as first.

So, how can we do this in pandas?.

Secondly, I guess the same can be done if we can just use some multiple if-else statements. I know np.where but it is not allowing to add 'and' condition. So here is what I was trying:

df['Prediction']=np.where(df['Sex']=="female",1,np.where((df['Pclass']==1 and df['Age']<18),1,0)

The above gave an error for 'and' keyword in where.

So can someone help?. Solutions with multiple approache using np.where(simple if-else like) and using some function(applymap etc) or modifications to what I wrote earlier would be really appreciated.

Also how do we do the same using some applymap or apply/map method of df?.

1 Answer 1

8

Instead of looping through the rows using df.iterrows (which is relatively slow), you can assign the desired values to the Prediction column in one assignment:

In [27]: df['Prediction'] = ((df['Sex']=='female') | ((df['Pclass']==1) & (df['Age']<18))).astype('int')

In [29]: df['Prediction']
Out[29]: 
0    0
1    1
2    1
3    1
4    0
5    0
6    0
7    0
Name: Prediction, dtype: int32

For your first approach, remember that df['Prediction'] represents an entire column of df, so df['Prediction']=1 assigns the value 1 to each row in that column. Since df['Prediction']=0 was the last assignment, the entire column ended up being filled with zeros.

For your second approach, note that you need to use & not and to perform an elementwise logical-and operation on two NumPy arrays or Pandas NDFrames. Thus, you could use

In [32]: np.where(df['Sex']=='female', 1, np.where((df['Pclass']==1)&(df['Age']<18), 1, 0))
Out[32]: array([0, 1, 1, 1, 0, 0, 0, 0])

though I think it is much simpler to just use | for logical-or and & for logical-and:

In [34]: ((df['Sex']=='female') | ((df['Pclass']==1) & (df['Age']<18)))
Out[34]: 
0    False
1     True
2     True
3     True
4    False
5    False
6    False
7    False
dtype: bool
Sign up to request clarification or add additional context in comments.

13 Comments

Hi @unutbu. That helps a lot. One doubt. When you said for the first approach, df['Prediction']=1 adds 1 to all rows, am not sure if I understand the reason behind that. I used: for passenger_index,passenger in df.iterrows(): if passenger['Sex']=="female" or (passenger['Pclass']==1 and passenger['Age']<18): df['Prediction']=1. So essentially it first picks one record at a time and when the condition is true then it adds 1 for that variable else 0. If there was no conditional statement or all records are used at once then I understand the reason. But why here?.
Like in tools like SAS, SPSS etc, if I write the same condition and say Prediction=1 else prediction=0 it will do as per the condition with a new variable prediction which has 1 if the condition is true else 0. I guess the reason it works is because SAS, SPSS read in one record at a time and then do the operation as mentioned in the code on that record before passing the output. So essentially we also did the same here by calling df.iterrows() to fetch one record at a time and then use the conditional on that?. Why doesn't it work? Thanks for your help!
Even though your for-loop is iterating through rows of df, your assignment is referencing the entire column, not just one row. Consider, for example, if you were to print df['Prediction'] you get the entire column, not just a row. df['Prediction'] is not somehow contextualized to refer to one row just because it is in a for-loop. That's not how Python works.
And just to be clear, df['Prediction'] = 1 assigns 1 to each value in the column, it does not add 1 to each value in the column. To add 1 to each value, you'd use df['Prediction'] += 1.
In your for-loop, if you replace df['Prediction']=1 with passenger['Prediction']=1, then you would be adding a new "column" to the passenger row but this would not affect df since passenger is a copy of a row of df, not a view into the underlying data in df. So again, this is just not the right way to accomplish your goal.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.