C# replace by regular expression

Question

I have below C# code to remove stop words from a string:

public static string RemoveStopWords(string Parameter)
{
      Parameter = Regex.Replace(Parameter, @"(?<=(\A|\s|\.|,|!|\?))($|_|0|1|2|3|4|5|6|7|8|9|A|about|after|all|also|an|and|another|any|are|as|at|B|be|because|been|before|being|between|both|but|by|C|came|can|come|could|D|did|do|does|E|each|else|F|for|from|G|get|got|H|had|has|have|he|her|here|him|himself|his|how|I|if|in|into|is|it|its|J|just|K|L|like|M|make|many|me|might|more|most|much|must|my|N|never|no|not|now|O|of|on|only|or|other|our|out|over|P|Q|R|re|S|said|same|see|should|since|so|some|still|such|T|take|than|that|the|their|them|then|there|these|they|this|those|through|to|too|U|under|up|use|V|very|W|want|was|way|we|well|were|what|when|where|which|while|who|will|with|would|X|Y|you|your|Z)(?=(\s|\z|,|!|\?))([^.])", " ", RegexOptions.IgnoreCase);
      return Parameter.Trim();
}

But when I run it, it works when the stop word in not at end of the string, for example:

about this book output is book

manager only output is manager only

only manager output is manager

Can anyone please guide?

([^.]) i think that part at the end might be your problem. what happens with the input "manager only "? (notice the space at the end) — Franz Gleichmann
– Franz Gleichmann, Commented Jan 31, 2021 at 12:40
When we have space at the end e.g. "manager only " it's replaced to "manager " — Naveed Ahmed
– Naveed Ahmed, Commented Jan 31, 2021 at 12:57
The character class at the end [^.] expects at a single character to be present. But you use a positive lookahead to assert what is directly to the right is either ! ? , a whitespace char or end of string. So this part ([^.]) can also only contain what is asserted before it, you could omit the lookahead and just match it instead. You can also shorten the pattern a bit by making use of character classes instead of using the | to sum up all the alternatives for the single characters. — The fourth bird
– The fourth bird, Commented Jan 31, 2021 at 13:08
For example (?<=(?:\A|[\s.,!?]))(?:$|[A-Z0-9_]|about|after|all|also|and?|another|any|are|a[ts]|be|because|been|before|being|between|both|but|by|came|can|come|could|did|do|does|each|else|for|from|get|got|ha[ds]|have|her?|here|him|himself|his|how|i[nf]|into|i[st]|its|just|like|make|many|me|might|more|most|much|must|my|never|not?|now|o[fnr]|only|other|our|out|over|re|said|same|see|should|since|so|some|still|such|take|tha[tn]|the[nm]?|their|there|these|they|this|those|through|too?|under|up|use|very|want|wa[ys]|we|well|were|what|when|where|which|while|who|will|with|would|your?)(?=(?:\s|\z|[,!?])) — The fourth bird
– The fourth bird, Commented Jan 31, 2021 at 13:30
Thank you so much @the-fourth-bird, its working perfectly :) — Naveed Ahmed
– Naveed Ahmed, Commented Jan 31, 2021 at 14:34

The fourth bird · Accepted Answer · 2021-01-31 15:17:26Z

The capture group at the end of the pattern ([^.]) requires a single char other than a dot. The looakhead preceding that (?=(\s|\z|,|!|\?)) limits that match to only one of the listed alternatives (it can not match a dot already as it is excluded by the lookahead).

If you want to keep that, you could omit that lookahead, and just match what you would allow to match like ([\s,!?]|\z) but it would still require at least 1 of the listed alternatives.

What you could so is only use the positive lookahead, and update it to (?=[\s,!?]|\z)

(?<=\A|[\s.,!?])(?:$|[A-Z0-9_]|about|after|all|also|and?|another|any|are|a[ts]|be|because|been|before|being|between|both|but|by|came|can|come|could|did|do|does|each|else|for|from|get|got|ha[ds]|have|her?|here|him|himself|his|how|i[nf]|into|i[st]|its|just|like|make|many|me|might|more|most|much|must|my|never|not?|now|o[fnr]|only|other|our|out|over|re|said|same|see|should|since|so|some|still|such|take|tha[tn]|the[nm]?|their|there|these|they|this|those|through|too?|under|up|use|very|want|wa[ys]|we|well|were|what|when|where|which|while|who|will|with|would|your?)(?=[\s,!?]|\z)

.NET regex demo

A few notes about the pattern

To shorten the alternation, you can for example a character class a[ts] to either match at or as or make a character optional and? to match either an or and
Inside the lookarounds, you don't have to add another grouping mechanism, so you can use (?=[\s,!?]|\z) instead of (?=(?:[\s,!?]|\z))
If you don't need the values of the capture groups () you can make them non capturing (?:)
The numbers 1|2|3 and the characters A|B|C can be shortened to [A-Z0-9] and also matching the underscore, you might even shorten it to \w

Collectives™ on Stack Overflow

C# replace by regular expression

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related