Removing substring combinations in Python

Question

I have job description text data that is stored in a data frame. I cleaned/preprocessed my data a little bit, including lowercasing and removing whitespaces, URLs, punctuation, stopwords, and HTML tags.

Next, I want to remove particular combinations of substrings. I give you an example:

If the text reads "required certification license registration" I want to remove the substring combination "certification licence" so that the output reads: "required registration". However, I want to keep the substrings if they appear separately in the text like this: "required certification include" [...] "with a license in".

I want to remove these substrings if they show up in a different order: "required license certification registration". Is there a different solution than using str.replace?

How do I tackle this? Should I tokenize words first?

What did you mean by substrings combinations? is it a text words separated by spaces? — AziMez
– AziMez, Commented Oct 21, 2022 at 13:20
Wouldn't string.replace("certification license", "") do the job (notwithstanding the double space it leaves behind) ? — Swifty
– Swifty, Commented Oct 21, 2022 at 13:21
Thanks @Swifty. That was my first thought but I also want to remove these substrings if they show up in reverse order (license certification). I have several of these 'word combinations', so I was hoping for a different, cleaner, simpler strategy. But you are right, it gets the job done. — isa_r
– isa_r, Commented Oct 21, 2022 at 13:33

AziMez · Accepted Answer · 2022-10-21 13:55:08Z

You can put the combinations in a list, then loop through it. every combination can be reversed in the next. here is an example below:

NB: regular combinations have marked by ****, while reversed ones are marked by ++++

combinations =["certification license", "particular combinations"]
               
text='''I want to remove particular combinations of substrings.
if the text reads "required certification license registration"
should also remove if they in a different order:
"required license certification registration".
'''


for combination in combinations:
    words = combination.split()
    reverse_combination = ' '.join(reversed(words))
    text=text.replace(combination,'***********')
    text=text.replace(reverse_combination,'++++++++++++')
  
print(text)

Output:

I want to remove *********** of substrings.
if the text reads "required *********** registration"
should also remove if they in a different order:
"required ++++++++++++ registration".

Collectives™ on Stack Overflow

Removing substring combinations in Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related