0

I am trying to replace all strings within a Python dataframe column that contain a certain substring, with only the substring itself. Preferably it would be an 'inplace=True' sort of result.

I've tried various regex expressions, unfortunately as I'm new to this, everything I have tried has not yielded the desired result. I am on Python 3.7.3.

I think the code I need to conduct the replacement within the dataframe is

df.replace(to_replace = regex expression that identifies substring in string containing the substring , value = 'substring', regex = True). So below is an example of what I'm trying to do

#original dataframe
import pandas as pd

df = pd.DataFrame({'brand':['brand1 & brand2','brand1/brand3','brand4 brand3','brand1 and brand 6']})
df

    brand
0   brand1 & brand2
1   brand6
2   brand1/brand3
3   brand9
4   brand4 brand3
5   brand8
6   brand1 and brand6

#desired result

df

    brand
0   brand1
1   brand6
2   brand1
3   brand9
4   brand4 brand3
5   brand8
6   brand1

So far, my regex expressions have effected no change. Just as a note, the brand names don't actually include 1-9, to avoid any possible confusion. The actual df I'm manipulating has a little over 10k rows, but within the column 'brands' strings that contain brand1 comprise about 2k of the 10k, and I need to replace all of the strings containing brand1 with just 'brand1' alone.

6
  • did you add the inplace=True Commented Oct 13, 2019 at 5:47
  • the data you put with pd.DataFrame({'brand':['brand1 & brand2','brand1/brand3','brand4 brand3','brand1 and brand 6']}) and the data you have shown as input don't match. Also is it now clear what you are trying to replace with what. Commented Oct 13, 2019 at 5:49
  • it should match now. In terms of what I'm trying to replace, the example shows that any rows containing brand1, I want to replace those strings with just brand 1 alone. so row 0 originally is literally the string 'brand1 & brand2' and I want to replace it with just 'brand1'. And so on for the other rows. Commented Oct 13, 2019 at 5:53
  • So, what is going on with row 4? why don't it just become brand4? Commented Oct 13, 2019 at 5:54
  • I need to leave that row as is. Basically, all rows that don't have brand1 somewhere in the string, need to be left alone. Only rows with brand1 would be processed with the regex. Commented Oct 13, 2019 at 5:56

1 Answer 1

1

Use:

df['brand'] = np.where(df['brand'].str.contains('brand1'), 'brand1',df['brand'])

Input

    brand
0   brand1 & brand2
1   brand6
2   brand1/brand3
3   brand9
4   brand4 brand3
5   brand1 and brand 6

Output

    brand
0   brand1
1   brand6
2   brand1
3   brand9
4   brand4 brand3
5   brand1
Sign up to request clarification or add additional context in comments.

5 Comments

That did it. Thanks. I'll accept when it lets me. I'm a bit surprised I didn't need regex...I'll have to read up on np.where. Thank you.
Glad to help. It is always better to show input & expected output & explain what you are trying to achieve. This helps others to suggest various ways of achieving it & some of those methods may be more efficient than the one we have in mind :-)
Could you explain some more about how this works? What's special in particular about brand1 in the code, seeing as the output produces other brand<num> values?
There is nothing special with brand1. In np.where, you can do something when the condition is true & when the condition is false. what i have done is told that when the string contains brand1 , make the value of the cell as brand1, else keep the content as in column brand
If the solution helped you, consider upvoting/accepting the answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.