0

I have below C# code to remove stop words from a string:

public static string RemoveStopWords(string Parameter)
{
      Parameter = Regex.Replace(Parameter, @"(?<=(\A|\s|\.|,|!|\?))($|_|0|1|2|3|4|5|6|7|8|9|A|about|after|all|also|an|and|another|any|are|as|at|B|be|because|been|before|being|between|both|but|by|C|came|can|come|could|D|did|do|does|E|each|else|F|for|from|G|get|got|H|had|has|have|he|her|here|him|himself|his|how|I|if|in|into|is|it|its|J|just|K|L|like|M|make|many|me|might|more|most|much|must|my|N|never|no|not|now|O|of|on|only|or|other|our|out|over|P|Q|R|re|S|said|same|see|should|since|so|some|still|such|T|take|than|that|the|their|them|then|there|these|they|this|those|through|to|too|U|under|up|use|V|very|W|want|was|way|we|well|were|what|when|where|which|while|who|will|with|would|X|Y|you|your|Z)(?=(\s|\z|,|!|\?))([^.])", " ", RegexOptions.IgnoreCase);
      return Parameter.Trim();
}

But when I run it, it works when the stop word in not at end of the string, for example:

about this book output is book

manager only output is manager only

only manager output is manager

Can anyone please guide?

6
  • ([^.]) i think that part at the end might be your problem. what happens with the input "manager only "? (notice the space at the end) Commented Jan 31, 2021 at 12:40
  • When we have space at the end e.g. "manager only " it's replaced to "manager " Commented Jan 31, 2021 at 12:57
  • The character class at the end [^.] expects at a single character to be present. But you use a positive lookahead to assert what is directly to the right is either ! ? , a whitespace char or end of string. So this part ([^.]) can also only contain what is asserted before it, you could omit the lookahead and just match it instead. You can also shorten the pattern a bit by making use of character classes instead of using the | to sum up all the alternatives for the single characters. Commented Jan 31, 2021 at 13:08
  • 1
    For example (?<=(?:\A|[\s.,!?]))(?:$|[A-Z0-9_]|about|after|all|also|and?|another|any|are|a[ts]|be|because|been|before|being|between|both|but|by|came|can|come|could|did|do|does|each|else|for|from|get|got|ha[ds]|have|her?|here|him|himself|his|how|i[nf]|into|i[st]|its|just|like|make|many|me|might|more|most|much|must|my|never|not?|now|o[fnr]|only|other|our|out|over|re|said|same|see|should|since|so|some|still|such|take|tha[tn]|the[nm]?|their|there|these|they|this|those|through|too?|under|up|use|very|want|wa[ys]|we|well|were|what|when|where|which|while|who|will|with|would|your?)(?=(?:\s|\z|[,!?])) Commented Jan 31, 2021 at 13:30
  • 1
    Thank you so much @the-fourth-bird, its working perfectly :) Commented Jan 31, 2021 at 14:34

1 Answer 1

1

The capture group at the end of the pattern ([^.]) requires a single char other than a dot. The looakhead preceding that (?=(\s|\z|,|!|\?)) limits that match to only one of the listed alternatives (it can not match a dot already as it is excluded by the lookahead).

If you want to keep that, you could omit that lookahead, and just match what you would allow to match like ([\s,!?]|\z) but it would still require at least 1 of the listed alternatives.

What you could so is only use the positive lookahead, and update it to (?=[\s,!?]|\z)

(?<=\A|[\s.,!?])(?:$|[A-Z0-9_]|about|after|all|also|and?|another|any|are|a[ts]|be|because|been|before|being|between|both|but|by|came|can|come|could|did|do|does|each|else|for|from|get|got|ha[ds]|have|her?|here|him|himself|his|how|i[nf]|into|i[st]|its|just|like|make|many|me|might|more|most|much|must|my|never|not?|now|o[fnr]|only|other|our|out|over|re|said|same|see|should|since|so|some|still|such|take|tha[tn]|the[nm]?|their|there|these|they|this|those|through|too?|under|up|use|very|want|wa[ys]|we|well|were|what|when|where|which|while|who|will|with|would|your?)(?=[\s,!?]|\z)

.NET regex demo

A few notes about the pattern

  • To shorten the alternation, you can for example a character class a[ts] to either match at or as or make a character optional and? to match either an or and
  • Inside the lookarounds, you don't have to add another grouping mechanism, so you can use (?=[\s,!?]|\z) instead of (?=(?:[\s,!?]|\z))
  • If you don't need the values of the capture groups () you can make them non capturing (?:)
  • The numbers 1|2|3 and the characters A|B|C can be shortened to [A-Z0-9] and also matching the underscore, you might even shorten it to \w
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.