0

I have job description text data that is stored in a data frame. I cleaned/preprocessed my data a little bit, including lowercasing and removing whitespaces, URLs, punctuation, stopwords, and HTML tags.

Next, I want to remove particular combinations of substrings. I give you an example:

If the text reads "required certification license registration" I want to remove the substring combination "certification licence" so that the output reads: "required registration". However, I want to keep the substrings if they appear separately in the text like this: "required certification include" [...] "with a license in".

I want to remove these substrings if they show up in a different order: "required license certification registration". Is there a different solution than using str.replace?

How do I tackle this? Should I tokenize words first?

3
  • What did you mean by substrings combinations? is it a text words separated by spaces? Commented Oct 21, 2022 at 13:20
  • Wouldn't string.replace("certification license", "") do the job (notwithstanding the double space it leaves behind) ? Commented Oct 21, 2022 at 13:21
  • Thanks @Swifty. That was my first thought but I also want to remove these substrings if they show up in reverse order (license certification). I have several of these 'word combinations', so I was hoping for a different, cleaner, simpler strategy. But you are right, it gets the job done. Commented Oct 21, 2022 at 13:33

1 Answer 1

1

You can put the combinations in a list, then loop through it. every combination can be reversed in the next. here is an example below:

NB: regular combinations have marked by ****, while reversed ones are marked by ++++

combinations =["certification license", "particular combinations"]
               
text='''I want to remove particular combinations of substrings.
if the text reads "required certification license registration"
should also remove if they in a different order:
"required license certification registration".
'''


for combination in combinations:
    words = combination.split()
    reverse_combination = ' '.join(reversed(words))
    text=text.replace(combination,'***********')
    text=text.replace(reverse_combination,'++++++++++++')
  
print(text)

Output:

I want to remove *********** of substrings.
if the text reads "required *********** registration"
should also remove if they in a different order:
"required ++++++++++++ registration".
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.