I have job description text data that is stored in a data frame. I cleaned/preprocessed my data a little bit, including lowercasing and removing whitespaces, URLs, punctuation, stopwords, and HTML tags.
Next, I want to remove particular combinations of substrings. I give you an example:
If the text reads "required certification license registration" I want to remove the substring combination "certification licence" so that the output reads: "required registration". However, I want to keep the substrings if they appear separately in the text like this: "required certification include" [...] "with a license in".
I want to remove these substrings if they show up in a different order: "required license certification registration". Is there a different solution than using str.replace?
How do I tackle this? Should I tokenize words first?