Pandas compare strings in two columns within the same dataframe with conditional output to new column

Question

I have two columns within a data frame containing strings. For example,

import pandas as pd
import numpy as np

data = [['Oct-2019', 'Oranges + Grapes + Pears', 'Grapes + Pears'],
       ['Nov-2019', 'Oranges + Grapes + Pears', 'Oranges + Grapes + Pears']]

df = pd.DataFrame(data, columns =['Date', 'Previous shopping list', 'Recent shopping list'])
print(df)

Fish = ['Salmon', 'Trout']
Fruit = ['Oranges', 'Grapes', 'Pears']

     Date     PSL                 RSL
0  Oct-2019   Oranges + Grapes    Grapes + Pears
              + Pears + Salmon                     

1  Nov-2019   Oranges + Grapes    Oranges + Grapes
              + Pears + Trout     + Pears

I want to compare the strings in both columns and have a text output to a new column that says what has changed between the two lists. Such as, creating a column that will check for the strings related to "Fruit" and output what fruit has been dropped from the recent shopping when compared to the previous list previous shopping list. See Desired output below:

     Date     PSL                 RSL               Fruit lost   Fish Lost
0  Oct-2019   Oranges + Grapes    Grapes + Pears    Oranges      Salmon
              + Pears + Salmon                     

1  Nov-2019   Oranges + Grapes    Oranges + Grapes               Trout
              + Pears + Trout     + Pears

How would I be able to achieve this in using pandas! Apologies if this was not clear the first time!

Thank you for any suggestion/help!

Do you only want to do these for Oranges, or for any difference between Previous and recent? — Erfan
– Erfan, Commented Jan 22, 2020 at 20:54
@Erfan, apologies this was not clear the first time, have made edits to hopefully make it clearer. I am looking for a solution that can compare the two lists and output based on categorisation what item is missing from the previous shopping list compared to the new one. For example, if oranges are missing from the recent shopping list, a column would be created placing oranges as the missing item of fruit. Same goes for if Salmon was missing from the list, it would make a new column stating salmon was missing from the recent shopping list. — Amen_90
– Amen_90, Commented Jan 23, 2020 at 9:36

moo · Accepted Answer · 2020-01-22 21:03:41Z

1

The exact function that you use to process the data depends on your exact output that you require for each combination. Hopefully below will give you enough to create a solution for your problem:

# process data so each row contains a list of elements
df['PSL_processed'] = df['Previous shopping list'].str.split('+')
df['RSL_processed'] = df['Recent shopping list'].str.split('+')

def compare_items(x):
    if set(x.PSL_processed) == set(x.RSL_processed):
        return 'No change'
    elif set(x.PSL_processed) - set(x.CSL_processed) > 0:
        return 'Lost'
    # add in conditional logic here, to meet specification

df.apply(compare_items, axis=1)

The official documentation for pd.apply() is well written.

answered Jan 22, 2020 at 21:03

moo

2,3462 gold badges23 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Amen_90 Over a year ago

thank you for the reply! Apologies this was not clear the first time, have made edits to hopefully make it clearer. I am looking for a solution that can compare the two lists and output based on categorisation what item is missing from the previous shopping list compared to the new one. For example, if oranges are missing from the recent shopping list, a column would be created placing oranges as the missing item of fruit. Same goes for if Salmon was missing from the list, it would make a new column stating salmon was missing from the recent shopping list.

Ioanna · Accepted Answer · 2020-01-22 21:07:53Z

1

To check if the string "Oranges" exists in "Recent shopping list" and create a new column "Oranges Lost" based on the outcome:

df['Oranges Lost'] = np.where(df['Recent shopping list'].str.contains('Oranges'), 'No Change', 'Lost')```

answered Jan 22, 2020 at 21:07

Ioanna

3761 silver badge5 bronze badges

Comments

Amen_90 · Accepted Answer · 2020-01-23 14:08:12Z

So Mark's solution works well to grab the difference between the lists

# process data so each row contains a list of elements
df['PSL_processed'] = df['Previous shopping list'].str.split()
df['RSL_processed'] = df['Recent shopping list'].str.split()

def compare_items(x):
    return set(x.PSL_processed) - set(x.RSL_processed)
    # add in conditional logic here, to meet specification
df['Products_lost'] = df.apply(compare_items, axis=1)

print(df)

On top to that to find the products that = fruit and the products = fish I used the following:

for idx, row in df.iterrows():
    for c in Fruit:
        if c in row['Products_lost']:
            df.ix[idx, 'Fruit lost'] = c
            for c in Fish:
                if c in row['Products_lost']:
                    df.ix[idx, 'Fish lost'] = c

Seems to work well!

Collectives™ on Stack Overflow

Pandas compare strings in two columns within the same dataframe with conditional output to new column

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related