I have a dataframe, df with a column that has different school names, school_name. I want to remove certain words, and wonder what the best way to go about this might be.
For example, I want to remove ‘male’ and ‘female’ from strings like:
‘gps hafiz shahmale p’
‘gpps mogal malep’
‘government primary school chak femalep’
‘govt girls high school syebadadfemale p’
‘ghs male p’
…
There are many other strings besides ‘male’ or ‘female’ that I want to remove that have similar complexities, e.g:
I also want to remove ‘sbcombined’ from strings like:
'government girls high school chak no120sbcombinedp',
'govt boys elementary school chak no119sbcombined t',
'govt boys elementary school chak no 37 sbcombined p'
…
All I could think of now is to write separate functions for each words, e.g. to remove ‘male’:
l = df.school_name.tolist()
for i in l:
if (i[-4:]=='male') or (i[-5:-1]=='male' and i[-7:-5]!='fe'):
i2 = i.replace('male', '')
df.loc[df.school_name==i, school_name] = i2
Is there a better, more efficient way to go about this?
edit: I also would like to know how I could deal with the complexity involved with the string 'male' - 'male' is part of the string 'female' (which I want to remove as well), that when I use re.search to remove the word 'male', for strings that include the word 'female', the 'male' part of the 'female' word gets removed that only 'fe' is left behind; something which I want to avoid.