8

I'm wondering how I can replace (remove) multiple words (like 500+) from a string. I know I can use the replace function to do this for a single word, but what if I want to replace 500+ words? I'm interested in removing all generic keywords from an article (such as "and", "I", "you" etc).

Here is the code for 1 replacement.. I'm looking to do 500+..

        string a = "why and you it";
        string b = a.Replace("why", "");
        MessageBox.Show(b);

Thanks

@ Sergey Kucher Text size will vary between a few hundred words to a few thousand. I am replacing these words from random articles.

6
  • What is the size of the text you replacing in? Commented Aug 4, 2013 at 6:37
  • Does my answer help? If you need something more complex, please let me know. Commented Aug 4, 2013 at 6:44
  • is this for a stopword list ? Commented Aug 4, 2013 at 7:19
  • @Tomer W - yes it is for a stop word list (such as I, you, go, etc all common English words). Commented Aug 4, 2013 at 7:34
  • @user1926567 How long are the texts you are indexing? books? articles? messages? comments? Commented Aug 4, 2013 at 9:21

6 Answers 6

8

I would normally do something like:

// If you want the search/replace to be case sensitive, remove the 
// StringComparer.OrdinalIgnoreCase
Dictionary<string, string> replaces = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase) { 
    // The format is word to be searched, word that should replace it
    // or String.Empty to simply remove the offending word
    { "why", "xxx" }, 
    { "you", "yyy" },
};

void Main()
{
    string a = "why and you it and You it";

    // This will search for blocks of letters and numbers (abc/abcd/ab1234)
    // and pass it to the replacer
    string b = Regex.Replace(a, @"\w+", Replacer);
}

string Replacer(Match m)
{
    string found = m.ToString();

    string replace;

    // If the word found is in the dictionary then it's placed in the 
    // replace variable by the TryGetValue
    if (!replaces.TryGetValue(found, out replace))
    {
        // otherwise replace the word with the same word (so do nothing)
        replace = found;
    }
    else
    {
        // The word is in the dictionary. replace now contains the
        // word that will substitute it.

        // At this point you could add some code to maintain upper/lower 
        // case between the words (so that if you -> xxx then You becomes Xxx
        // and YOU becomes XXX)
    }

    return replace;
}

As someone else wrote, but without problems with substrings (the ass principle... You don't want to remove asses from classes :-) ), and working only if you only need to remove words:

var escapedStrings = yourReplaces.Select(Regex.Escape);
string result = Regex.Replace(yourInput, @"\b(" + string.Join("|", escapedStrings) + @")\b", string.Empty);

I use the \b word boundary... It's a little complex to explain what it's, but it's useful to find word boundaries :-)

Sign up to request clarification or add additional context in comments.

2 Comments

It's more optimal to create a Regex instance and reuse it, if replacing must be done on several inputs.
@SargeBorsch This is a quick-n-dirty example. There is even a Main method :-)
0

Create a list of all text you want and load it into a list, you do this fairly simple or get very complex. A trivial example would be:

var sentence = "mysentence hi";
var words = File.ReadAllText("pathtowordlist.txt").Split(Enviornment.NewLine);
foreach(word in words)
   sentence.replace("word", "x");

You could create two lists if you wanted a dual mapping scheme.

4 Comments

inefficient, string len M, word count = N this is O(N*M) and can be made in O(M).
It is indeed inefficent - but it will get the job done depending on the requirements. If you have a better solution - I'd be glad to see it!
at first i thought i have, and started to write it, but i found out it is as well O(NM) , maybe bit less overhead, but still, so i take my words back... this is the easiest
-1. 1. What's the point of the foreach loop when the word iteration variable is never used inside the loop's body? 2. Why do you require that the word replacement list must go in a file? Why not start with something simpler, more obvious; i.e. an in-memory collection? Your file suggestion goes a little too far IMHO. (I'm not saying that this approach is invalid, or won't work. It's just more than is perhaps required, and makes your answer more complex than it really needs to be.)
0

Try this:

string text = "word1 word2 you it";
List<string> words = new System.Collections.Generic.List<string>();
words.Add("word1");
words.Add("word2");
words.ForEach(w => text = text.Replace(w, ""));

Edit

If you want to replace text with another text, you can create class Word:

 public class Word
 {
     public string SearchWord { get; set; }
     public string ReplaceWord { get; set; }
 }

And change above code to this:

string text = "word1 word2 you it";
List<Word> words = new System.Collections.Generic.List<Word>();
words.Add(new Word() { SearchWord = "word1", ReplaceWord = "replaced" });
words.Add(new Word() { SearchWord = "word2", ReplaceWord = "replaced" });
words.ForEach(w => text = text.Replace(w.SearchWord, w.ReplaceWord));

Comments

0

if you are talking about a single string the solution is to remove them all by a simple replace method. as you can read there:

"Returns a new string in which all occurrences of a specified string in the current instance are replaced with another specified string".

you may be needing to replace several words, and you can make a list of these words:

List<string> wordsToRemove = new List<string>();
wordsToRemove.Add("why");
wordsToRemove.Add("how);

and so on

and then remove them from the string

foreach(string curr in wordsToRemove)
   a = a.ToLower().Replace(curr, "");

Importent

if you want to keep your string as it was, without lowering words and without struggling with lower and upper case use

foreach(string curr in wordsToRemove)
   // You can reuse this object
   Regex regex = new Regex(curr, RegexOptions.IgnoreCase);
   myString = regex.Replace(myString, "");

Comments

0

depends on the situation ofcourse,
but if your text is long and you have many words,
and you want optimize performance.

you should build a trie from the words, and search the Trie for a match.

it won't lower the Order of complexity, still O(nm), but for large groups of words, it will be able to check multiple words against each char instead of one by one.
i can assume couple of houndred words should be enough to get this faster.

This is the fastest method in my opinion and
i written a function for you to start with:

public struct FindRecord
    {
        public int WordIndex;
        public int PositionInString;
    }

    public static FindRecord[] FindAll(string input, string[] words)
    {
        LinkedList<FindRecord> result = new LinkedList<FindRecord>();
        int[] matchs = new int[words.Length];

        for (int i = 0; i < input.Length; i++)
        {
            for (int j = 0; j < words.Length; j++)
            {
                if (input[i] == words[j][matchs[j]])
                {
                    matchs[j]++;
                    if(matchs[j] == words[j].Length)
                    {
                        FindRecord findRecord = new FindRecord {WordIndex = j, PositionInString = i - matchs[j] + 1};
                        result.AddLast(findRecord);
                        matchs[j] = 0;
                    }

                }
                else
                    matchs[j] = 0;
            }
        }
        return result.ToArray();
    }

Another option:
it might be the rare case where regex will be faster then building the code.

Try using

public static string ReplaceAll(string input, string[] words)
    {
        string wordlist = string.Join("|", words);
        Regex rx = new Regex(wordlist, RegexOptions.Compiled);
        return rx.Replace(input, m => "");
    }

Comments

0

Regex can do this better, you just need all the replace words in a list, and then:

var escapedStrings = yourReplaces.Select(PadAndEscape);
string result = Regex.Replace(yourInput, string.Join("|", escapedStrings);

This requires a function that space-pads the strings before escaping them:

public string PadAndEscape(string s)
{
    return Regex.Escape(" " + s + " ");
}

2 Comments

This suffers from the ass problem (codinghorror.com/blog/2008/10/…)... ass will replace class
@xanatos Whoops, fixed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.