I have a dataframe with the column 'purpose' that has a lot of string values that I want to standardize by finding a string and replacing it.
For instance, some very similar values are car purchase, buying a second-hand car, buying my own car, cars, second-hand car purchase, car, to own a car, purchase of a car, to buy a car
I used the following to make this change:
#replace anything to do with buying a car with "Vehicle"
credit_data['purpose'] = credit_data.purpose.str.replace(r'(^.*car.*$)','Vehicle')
and it worked great, all of those values were replaced with 'Vehicle'
I have a number of other similar strings in this column for other types, like education - supplementary education, education, getting an education, to get a supplementary education, university education, etc.
so, I looked up regex syntax and came up with the following:
#replace anything to do with education with "Education"
credit_data['purpose'] = credit_data.purpose.str.replace(r'(^.*education|university|educated.*$)','Education')
the results for this are similar to above - everything says education now - yay!
which brings me to my question - I've gone wrong somewhere in applying this to some of my other strings - for instance, I used a similar method for real estate:
#replace anything to do with real estate with real estate
credit_data['purpose'] = credit_data.purpose.str.replace(r'(^.*real estate|housing|house|property.*$)','Real Estate')
and my results here are different - I started with values like purchase my own house, building a house, purchase of a property, etc. and all the method seems to have done was replace just the string i identified, instead of the entire string with just the replacement string.
so instead of having a bunch of entries that say "Real Estate" I have a bunch of entries that say purchase my own Real Estate, building a Real Estate, purchase of a Real Estate, etc.
I'm not sure where I've gone wrong?
Thanks in advance.
edited to add requested series from the dataframe:
Df = [purchase of the house, car purchase, supplementary education, to have a wedding, housing, transactions, education, having a wedding, purchase of the house for my family, buy real estate, buy commercial real estate, buy residential real estate, construction of own property, property, building a property, buying a second-hand car, buying my own car, transactions with commercial real estate, building a real estate, housing, transactions with my real estate, cars, to become educated, second-hand car purchase, getting an education, car, wedding ceremony, to get a supplementary education, purchase of my own house, real estate transactions, getting higher education, to own a car, purchase of a car, profile education, university education, buying property for renting out, to buy a car, housing renovation, going to university]
df.to_dict()for examplereplace(r'(^.car.$)','Vehicle')to "buying a second-hand car" doesn't work because the regex explicitly looks for "car" as the 2nd, 3rd and 4th values of a 5 character string - that's what the "^." and ".$" do.