Optimize string.Replace method

Question

I have a list of 200+ words that are not allowed on a website. The string.Replace method below takes ~80ms. If I increase s < 1000 by a factor of 10.00 to s < 10,000 this delay goes to ~834ms, a 10.43 increase. I am woried about the scalability of this function, especially if the list increases in size. I was told strings are immutable and text.Replace() is creating 200 new strings in memory. Is there something similar to a Stringbuilder for this?

List<string> FilteredWords = new List<string>();
FilteredWords.Add("RED");
FilteredWords.Add("GREEN");
FilteredWords.Add("BLACK");
for (int i = 1; i < 200; i++)
{ FilteredWords.Add("STRING " + i.ToString()); }

string text = "";

//simulate a large dynamically generated html page
for (int s = 1; s < 1000; s++)
{ text += @"Lorem ipsum dolor sit amet, minim BLACK cetero cu nam.
            No vix platonem sententiae, pro wisi congue graecis id, GREEN assum interesset in vix.
            Eum tamquam RED pertinacia ex."; }

// This is the function I seek to optimize
foreach (string s in FilteredWords)
{ text = text.Replace(s, "[REMOVED]"); }

Why are these words not allowed? There are many ways of representing words that may be blocked without being filtered. — Qantas 94 Heavy
– Qantas 94 Heavy, Commented Oct 19, 2013 at 5:52
Colors are placeholders, obviously. Key terms, profanity, html tags, scripts, etc. need to be visibly scrubbed. We have a list. Please elaborate on "blocked without being filtered". — Zerkey
– Zerkey, Commented Oct 19, 2013 at 5:54
I would try to use regular expression. It is also getting slower when you have long expressions but it is worth a try. The other option would be to write your own Replace method - go throught the whole string a try to lookup each word in your blocked words — Ondra
– Ondra, Commented Oct 19, 2013 at 6:03

Trevor Elliott · Accepted Answer · 2013-10-19 06:06:14Z

2

Use StringBuilder.Replace and try to do it as a batch operation. That is to say you should try to only create the StringBuilder once as it has some overhead. It won't necessarily be a lot faster but it will be much more memory efficient.

You should also probably only do this sanitation once instead of every time data is requested. If you're reading the data from the database you should consider sanitizing it once when the data is inserted into the database, so there is less work to do when reading and displaying it to the page.

answered Oct 19, 2013 at 6:06

Trevor Elliott

11.3k11 gold badges69 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Zerkey Over a year ago

Thank you for your input. you should consider sanitizing it once when the data is inserted into the database - I would like to be able to see what the filtered words are on my end, otherwise, very good suggestion.

to StackOverflow Over a year ago

@Zerkey - "I would like to be able to see what the filtered words are on my end" - then store both sanitized and unsanitized, if they are different. If they're the same, which I guess they probably will be most of the time, then only store once to save space.

Alexei Levenkov · Accepted Answer · 2013-10-19 18:33:43Z

2

If you expect most of the text to be relatively nice than scanning whole text first for matching words could be better approach. You can also normalize words text at the same time to catch some standard replacements.

I.e. scan string by matching individual words (i.e. Regular expression like "\w+"), than for each detected word lookup (potentially normalized value) in dictionary of words to replace.

You can either simply scan first to get list of "words to replace" and than just replace individual word later, or scan and build resulting string at the same time (using StringBuilder or StreamWriter, obviously not String.Concat / +).

Note: Unicode provides large number of good characters to use, so don't expect your effort to be very successful. I.e. try to find "cool" in following text: "you are сооl".

Sample code (relying on Regex.Replace for tokenization and building the string and HashSet for matches).

var toFind = FilteredWords.Aggregate(
      new HashSet<string>(), (c, i) => { c.Add(i); return c;});

text = new Regex(@"\w+")
   .Replace(text, m => toFind.Contains(m.Value) ? "[REMOVED]" : m.Value));

edited Oct 19, 2013 at 18:33

answered Oct 19, 2013 at 6:05

Alexei Levenkov

101k15 gold badges138 silver badges189 bronze badges

2 Comments

Zerkey Over a year ago

Okay, I tried this. It was a horrible idea. s < 1000 took 2,031ms to complete and s < 10,000 took almost two full minutes to complete. It is much more efficient to check a string for the presence of a word than it is to check each "\w+ match against a list/dictionary/hash.

Alexei Levenkov Over a year ago

@Zerkey - I've added sample of code that I meant... It shows 4x improvement over your code. I'm not sure why your version is so much slower (hard to reason without code).

William · Accepted Answer · 2013-10-19 06:29:48Z

1

There may be a better way, but this is how I would go about solving the problem.

You will need to create a tree structure that contains your dictionary of words to be replaced. The class may be something like:

public class Node 
{
    public Dictionary<char, Node> Children;
    public bool IsWord;
}

Using a dictionary for the Children may not be the best choice, but it provides the easiest example here. Also, you will need a constructor to initialize the Children field. The IsWord field is used to deal with the possibility that a redacted "word" may be the prefix of another redacted "word". For example, if you want to remove both "red" and "redress".

You will build the tree from each character in each of the replacement words. For example:

public void AddWord ( string word ) 
{
    // NOTE: this assumes word is non-null and contains at least one character...

    Node currentNode = Root;

    for (int iIndex = 0; iIndex < word.Length; iIndex++)
    {
        if (currentNode.Children.ContainsKey(word[iIndex])))
        {
            currentNode = currentNode.Children[word[iIndex];
            continue;
        }

        Node newNode = new Node();
        currentNode.Children.Add(word[iIndex], newNode);
        currentNode = newNode;
    }

    // finished, mark the last node as being a complete word..
    currentNode.IsWord = true;
}

You'll need to deal with case sensitivity somewhere in there. Also, you only need to build the tree once, afterwards you can use it from any number of threads without worrying about locking because you will be only reading from it. (Basically, I'm saying: store it in a static somewhere.)

Now, when you are ready to remove words from your string you will need to do the following:

Create a StringBuilder instance to store the result
Parse through your source string, looking for the start and stop of a "word". How you define "word" will matter. For simplicity I would suggest starting with Char.IsWhitespace as defining word separators.
Once you have determined that a range of character is a "word", starting from the root of the tree, locate the child node associated with the first character in "word".
If you do not find a child node, the entire word is added to the StringBuilder
If you find a child node, you continue with the next character matching against Children of the current node, until you either run out of characters or out of nodes.
If you reach the end of the "word", check the last node's IsWord field. If true the word is excluded, do not add it to the StringBuilder. If IsWord is false, the word is not replaced and you add it to the StringBuilder
Repeat until you have exhausted the input string.

You will also need to add word separators to the StringBuilder, hopefully that will be obvious as you parse the input string. If you are careful to only use the start and stop indices within the input string, you should be able to parse the entire string without creating any garbage strings.

When all of this is done, use StringBuilder.ToString() to get your final result.

You may also need to consider Unicode surrogate codepoints, but you can probably get away without worrying about it.

Beware, I typed this code here directly, so syntax errors, typos and other accidental misdirections are probably included.

edited Oct 19, 2013 at 6:29

answered Oct 19, 2013 at 6:21

William

1,88511 silver badges10 bronze badges

5 Comments

Greg Beech Over a year ago

Why build a custom data structure? A basic HashSet would likely be significantly faster and have much lower memory usage. If there were many similar words then a radix tree might make sense, but with only a few hundred I doubt it.

William Over a year ago

Sure, a HashSet could be used to store the word list. The OP seems worried about the creation of garbage strings though. Each "word" would have to be allocated into a new string to be used as a key in the HashSet, which would result in more garbage creation.

Zerkey Over a year ago

Thank you for such an in-depth answer.

Greg Beech Over a year ago

I think you have misinterpreted that concern - he's talking about garbage strings during replacement which as you correctly point out a StringBuilder will solve. But creating a custom tree as the word lookup will use far more memory than a single HashSet instance that is reused, and be far slower, while adding significant unneeded complexity.

usr Over a year ago

The OP's problem is quadratic complexity and this search tree solves that. +1 A HashSet cannot solve it because it can only match full words, not substrings.

Ry- · Accepted Answer · 2013-10-19 14:23:25Z

0

The real regular expression solution would be:

var filteredWord = new Regex(@"\b(?:" + string.Join("|", FilteredWords.Select(Regex.Escape)) + @")\b", RegexOptions.Compiled);
text = filteredWord.Replace(text, "[REMOVED]");

I don’t know whether this is faster (but note that it also only replaces whole words).

answered Oct 19, 2013 at 14:23

Ry-♦

226k56 gold badges496 silver badges504 bronze badges

Collectives™ on Stack Overflow

Optimize string.Replace method

4 Answers 4

2 Comments

2 Comments

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

2 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related