0

I have a dataframe with the column 'purpose' that has a lot of string values that I want to standardize by finding a string and replacing it.

For instance, some very similar values are car purchase, buying a second-hand car, buying my own car, cars, second-hand car purchase, car, to own a car, purchase of a car, to buy a car

I used the following to make this change:

#replace anything to do with buying a car with "Vehicle"

credit_data['purpose'] = credit_data.purpose.str.replace(r'(^.*car.*$)','Vehicle')

and it worked great, all of those values were replaced with 'Vehicle'

I have a number of other similar strings in this column for other types, like education - supplementary education, education, getting an education, to get a supplementary education, university education, etc.

so, I looked up regex syntax and came up with the following:

#replace anything to do with education with "Education"

credit_data['purpose'] = credit_data.purpose.str.replace(r'(^.*education|university|educated.*$)','Education')

the results for this are similar to above - everything says education now - yay!

which brings me to my question - I've gone wrong somewhere in applying this to some of my other strings - for instance, I used a similar method for real estate:

#replace anything to do with real estate with real estate

credit_data['purpose'] = credit_data.purpose.str.replace(r'(^.*real estate|housing|house|property.*$)','Real Estate')

and my results here are different - I started with values like purchase my own house, building a house, purchase of a property, etc. and all the method seems to have done was replace just the string i identified, instead of the entire string with just the replacement string.

so instead of having a bunch of entries that say "Real Estate" I have a bunch of entries that say purchase my own Real Estate, building a Real Estate, purchase of a Real Estate, etc.

I'm not sure where I've gone wrong?

Thanks in advance.

edited to add requested series from the dataframe:

Df = [purchase of the house, car purchase, supplementary education, to have a wedding, housing, transactions, education, having a wedding, purchase of the house for my family, buy real estate, buy commercial real estate, buy residential real estate, construction of own property, property, building a property, buying a second-hand car, buying my own car, transactions with commercial real estate, building a real estate, housing, transactions with my real estate, cars, to become educated, second-hand car purchase, getting an education, car, wedding ceremony, to get a supplementary education, purchase of my own house, real estate transactions, getting higher education, to own a car, purchase of a car, profile education, university education, buying property for renting out, to buy a car, housing renovation, going to university]

3
  • 1
    can you provide a sample dataframe for us to see/test? You can code format the contents of df.to_dict() for example Commented Nov 14, 2020 at 3:19
  • you may find regex101.com helpful if you haven't already seen it. You can use it for fast feedback on regex match/replacement operations Commented Nov 14, 2020 at 3:21
  • Applying replace(r'(^.car.$)','Vehicle') to "buying a second-hand car" doesn't work because the regex explicitly looks for "car" as the 2nd, 3rd and 4th values of a 5 character string - that's what the "^." and ".$" do. Commented Nov 14, 2020 at 3:25

1 Answer 1

1

You are making the regular expression too restrictive and using the wrong character for alternation. You can use \b to match a word boundary, | to match multiple patterns and IGNORECASE to cover case issues. So for example

credit_data.purpose.str.replace(r'\b(real estate|housing|house|property)\b',
    'Real Estate', regex=True, flags=re.IGNORECASE)

If you want to replace the entire string, you can use dot-all (.*).

credit_data.purpose.str.replace(r'.*(real estate|housing|house|property).*',
    'Real Estate', regex=True, flags=re.IGNORECASE)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the information! that seems to function better, but still leaves me with results like - "purchase a real estate for my family" when i was hoping to drop the rest of the characters to end up with simply *"real estate" i'll read more on regex101 but do you know if there is a way to drop the rest of the character string?
Oh, you want to replace the whole thing? Then you may need the end terminators after all. Let me add an example.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.