Remove strings at column based on strings of another column

Question

I have this in pandas and python:

    text1       text2
0   sunny       This is a sunny day
1   rainy day   No this day is a rainy day

and I want to transform it to this:

    text1       text2
0   sunny       This is a day
1   rainy day   No this day is a

Therefore, I want to remove some text from text2 based on text1 of the same row.

I did this:

df = df.apply(lambda x: x['text2'].str.replace(x['text1'], ''))

but I was getting an error:

AttributeError: ("'str' object has no attribute 'str'", 'occurred at index 0')

which maybe related to this: https://stackoverflow.com/a/53986135/9024698.

What is the most efficient way to do what I want to do?

jezrael · Accepted Answer · 2019-06-20 13:20:02Z

4

Fast a bit ugly solution is replace - but possible multiple whitespaces if need replace per rows by another column:

df['text2'] = df.apply(lambda x: x['text2'].replace(x['text1'], ''), axis=1)
print (df)
       text1              text2
0      sunny     This is a  day
1  rainy day  No this day is a

Solution with split both columns:

df['text2'] = df.apply(lambda x: ' '.join(y for y in x['text2'].split() 
                                          if y not in set(x['text1'].split())), axis=1)

If need replace by all values of another column better is use solution by @Erfan:

df['text2'].str.replace('|'.join(df['text1']), '')

edited Jun 20, 2019 at 13:20

answered Jun 20, 2019 at 12:24

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Outcast Over a year ago

Thank you (upvote). By the way, I am looking for No this day is a which is not what you have at your second solution.

Erfan Over a year ago

Cant you simply use df['text2'].str.replace('|'.join(df['text1']), '')?

Outcast Over a year ago

@Erfan, I do not want to replace by all values of another column

Erfan Over a year ago

I see, then it makes sense :) @PoeteMaudit

Alexandre B. · Accepted Answer · 2019-06-20 12:30:56Z

This is because your applying your function over column instead of row. Also, x['text2'] is already a string so no need to call .str. With these modifications, you will have:

print(df.apply(lambda x: x['text2'].replace(x['text1'], ''), axis=1))
# 0       This is a  day
# 1    No this day is a

As you can see, you only return the text2 column.

Here is one example returning the whole dataframe processed:

# Import module
import pandas as pd

df = pd.DataFrame({"text1": ["sunny", "rainy day"],
                   "text2": ["This is a sunny day", "No this day is a rainy day"]})
print(df)
#        text1                       text2
# 0      sunny         This is a sunny day
# 1  rainy day  No this day is a rainy day

# Function to apply
def remove_word(row):
    row['text2'] = row.text2.replace(row['text1'], '')
    return row

# Apply the function on each row (axis = 1)
df = df.apply(remove_word, axis=1)
print(df)
#        text1              text2
# 0      sunny     This is a  day
# 1  rainy day  No this day is a

Lawis · Accepted Answer · 2019-06-20 12:38:26Z

0

Simply use the replace method :

df["text2"]=df["text2"].replace(to_replace=df["text1"],value="",regex=True)

EDIT:

As metioned by @jezrael, this method does not take into account surounding spaces (as they are not matched by the regex). However you can tune the regex to avoid some of them adding optional spaces to the pattern for example :

df["text2"]=df["text2"].replace(to_replace=df["text1"]+" *",value="",regex=True)

edited Jun 20, 2019 at 12:38

answered Jun 20, 2019 at 12:29

Lawis

1257 bronze badges

Collectives™ on Stack Overflow

Remove strings at column based on strings of another column

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related