Pandas Compare Two String Columns to Create a Third Column

Question

I have dataframe that contains two columns of different classes diamond, gold and silver.

class_pd = pd.DataFrame({'old_class':['gold', 'gold' , 'silver'],
    'new_class':['diamond', 'silver', 'silver']})

I want to create a new column that shows wither the classes was Upgraded or Downgraded

What I have tried

I wrote the below function to set the rules

def status_desc(class_pd, old_class, new_class):
    if ((class_pd['old_class'] == 'gold') & (class_pd['new_class'] == 'diamond') or \
       (class_pd['old_class'] == 'silver') & (class_pd['new_class'] == 'diamond') or \
       (class_pd['old_class'] == 'silver') & (class_pd['new_class'] == 'gold')):
        val = 'Upgrade'
    elif ((class_pd['old_class'] == 'diamond') & (class_pd['new_class'] == 'gold') or \
       (class_pd['old_class'] == 'diamond') & (class_pd['new_class'] == 'silver') or \
       (class_pd['old_class'] == 'gold') & (class_pd['new_class'] == 'silver')):
        val = 'Downgrade'
    else:
         val = 'NA'

Then I tried to apply the function to my dataframe using the below method

class_pd['class_desc'] = class_pd.apply(lambda x: status_desc(class_pd['old_class'], class_pd['new_class']), axis=1)

Error

I get this error

TypeError: status_desc() missing 1 required positional argument: new_class

Desired Output

class_pd = pd.DataFrame({'old_class':['gold', 'gold' , 'silver'],
    'new_class':['diamond', 'silver', 'silver'],
                        'class_desc':['Upgrade','Downgrade', 'NA']})

Josh Friedlander · Accepted Answer · 2022-10-13 11:16:26Z

3

Another solution with pd.Categorical, seems more elegant to me and more scalable:

categories = ['silver', 'gold', 'diamond']
class_pd = class_pd.apply(pd.Categorical, categories=categories, ordered=True)

class_pd['class_desc'] = 'NA'

class_pd.loc[class_pd.old_class > class_pd.new_class, 'class_desc'] = 'Downgrade'
class_pd.loc[class_pd.old_class < class_pd.new_class, 'class_desc'] = 'Upgrade'

We tell Pandas the inherent order, and can then use comparison operators.

Another way to do the last bit (after adding categories) suggested by @jezrael with numpy.select:

import numpy as np

conditions = [
    class_pd.old_class < class_pd.new_class,
    class_pd.old_class > class_pd.new_class,
    class_pd.old_class == class_pd.new_class,
]
labels = ["Upgrade", "Downgrade", "NA"]
class_pd["class_desc"] = np.select(conditions, labels)

edited Oct 13, 2022 at 11:16

answered Oct 13, 2022 at 11:09

Josh Friedlander

11.8k7 gold badges42 silver badges89 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

jezrael Over a year ago

Working for same solution, super!

Josh Friedlander Over a year ago

thanks, it means a lot to hear that from you! :)

jezrael Over a year ago

Maybe np.select should be alternative ;)

Tom McLean Over a year ago

I was working on the same answer, but this turned out a bit nicer :) (I was struggling on getting less than, greater or equal, but setting the default as "NA" is how to do it nicely)

jezrael Over a year ago

@JoshFriedlander - No, be free add to answer.

|

Adam J · Accepted Answer · 2022-10-13 11:09:53Z

1

Your function status_desc takes 3 arguments: class_pd, old_class, new_class, but you are only passing 2 arguments class_pd['old_class'], class_pd['new_class']. You need to pass the first argument for class_pd as well. Also you're missing a few things:

you need to return the values, not just assign them to val. So return "Upgrade", "Downgrade" and "NA".
In you .apply you need to pass the x of the lambda function, if you pass class_pd you pass the whole dataframe. x contains a single row of the df, so you're looping through each row and the function looks at the old_class and new_class columns for each row for the logic.

However a simpler step would be to only have 1 argument (the row) and define your function like this since you're not even using old_class, new_class in your function:

def status_desc(class_pd):
    if ((class_pd['old_class'] == 'gold') & (class_pd['new_class'] == 'diamond') or \
       (class_pd['old_class'] == 'silver') & (class_pd['new_class'] == 'diamond') or \
       (class_pd['old_class'] == 'silver') & (class_pd['new_class'] == 'gold')):
        return 'Upgrade'
    elif ((class_pd['old_class'] == 'diamond') & (class_pd['new_class'] == 'gold') or \
       (class_pd['old_class'] == 'diamond') & (class_pd['new_class'] == 'silver') or \
       (class_pd['old_class'] == 'gold') & (class_pd['new_class'] == 'silver')):
        return 'Downgrade'
    else:
         return 'NA'

Then call it using:

class_pd['class_desc'] = class_pd.apply(lambda x: status_desc(x), axis=1)

Output using this code:

old_class   new_class   class_desc
0   gold    diamond     Upgrade
1   gold    silver      Downgrade
2   silver  silver      NA

edited Oct 13, 2022 at 11:09

answered Oct 13, 2022 at 11:02

Adam J

1,4482 gold badges18 silver badges35 bronze badges

1 Comment

Adam J Over a year ago

No worries, glad I could help! You can read more about this way of using apply and lambda to achieve this + other techniques to achieve the same result here: datascienceparichay.com/article/… @Leena

R. Baraiya · Accepted Answer · 2022-10-13 11:09:58Z

1

Here, the main logic is to provide rank list which will replicate the importance by position and then compare position number new and old using if else. Code:

rank = ['silver', 'gold', 'diamond'] #position silver = 0, gold=1 ,dia=2
class_pd['class_desc'] = class_pd.apply(lambda x: ('Upgrade' if (rank.index(x.old_class)) < (rank.index(x.new_class)) else 'Downgrade') if x.old_class != x.new_class else 'NA',axis=1)
class_pd

Output:

    old_class   new_class   class_desc
0   gold       diamond      Upgrade
1   gold       silver       Downgrade
2   silver     silver       NA

edited Oct 13, 2022 at 11:09

answered Oct 13, 2022 at 11:04

R. Baraiya

1,5281 gold badge6 silver badges20 bronze badges

Comments

Gokhan · Accepted Answer · 2022-10-13 11:10:28Z

0

Firstly, you need to give one more parameter which is "class_pd" to your function. Also you need to give indexes of column names. For instance instead of class_pd['old_class'] == 'gold' you need to write class_pd['old_class'][0] == 'gold'.

answered Oct 13, 2022 at 11:10

Gokhan

235 bronze badges

Collectives™ on Stack Overflow

Pandas Compare Two String Columns to Create a Third Column

4 Answers 4

6 Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related