1

I need to replace substrings in a column value in dataframe

Example: I have this column 'code' in a dataframe (in really, the dataframe is very large)

3816R(motor) #I need '3816R'
97224(Eletro)
502812(Defletor)
97252(Defletor)
97525(Eletro)
5725 ( 56)

And I have this list to replace the values:

list = ['(motor)', '(Eletro)', '(Defletor)', '(Eletro)', '( 56)']

I've tried a lot of methods, like:

df['code'] = df['code'].str.replace(list, '')

And regex= True, but anyone method worked to remove the substrings.

How can I do that?

1
  • Can you have cases in which there are parentheses with something to keep? It will be more efficient to handle a generic case Commented Feb 2, 2023 at 17:31

4 Answers 4

2

You can try regex replace and regex or condition: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/

l = ['(motor)', '(Eletro)', '(Defletor)', '( 56)']
l = [s.replace('(', '\(').replace(')', '\)') for s in l]
regex_str = f"({'|'.join(l)})"
df['code'] = df['code'].str.replace(regex_str, '', regex=True)

The regex_str will end up with something like

"(\(motor\)|\(Eletro\)|\(Defletor\)|\( 56\))"
Sign up to request clarification or add additional context in comments.

10 Comments

I need to pass the list as an argument, how can I?
Do you just want to remove those strings in the list or any strings that's within parenthesis?
Only the string in the list
why do you need to pass the list as an argument? Can you explain why that is?
@JoãoFelipeHolanda in that case, you and create the regex string with "or" condition based on the list and use that for replace
|
0

If you are certain any and all rows follow the format provided, you could attempt the following by using a lambda function:

df['code_clean'] = df['code'].apply(lambda x: x.split('(')[0])

Comments

0

You can try the regular expression match method: https://docs.python.org/3/library/re.html#re.Pattern.match

df['code'] = df['code'].apply(lambda x: re.match(r'^(\w+)\(\w+\)',x).group(1))

The first part of the regular expression ^(\w+), creates a capturing group of any letters or numbers before encountering a parenthesis. The group(1) then extracts the text.

Comments

0

str.replace will work with one string not a list of strings.. you could probably loop through it

rmlist = ['(motor)', '(Eletro)', '(Defletor)', '(Eletro)', '( 56)']
for repl in rmlist:
    df['code'] = df['code'].str.replace(repl, '')

alternatively if your bracketed substring is at the end.. split it at "(" and discard additional column generated..will be faster for sure

df["code"]=df["code"].str.split(pat="(",n=1,expand=True)[0]

str.split is reasonably fast

4 Comments

the dataframe is too big to loop, so I'm looking for a method or a function
str.replace is a vectorised implementation ..considerably faster than any other... alternatively why not just split at "(" whatever afte opening brace can be ignored
you can use apply with a lambda function..but that will be very heavy for a big dataframe
and Regex is atleast 10x slower than string replacement...regex not good for large dataframes. avoid Regex if possible with large dataframes if you need some speed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.