1

What is faster having a String and then calling Substring for this String where the first Substring will be the initial String and every new Substring a smaller portion from within the initial substring until its end? Or using a StringBuilder and cutting off more and more from the beginning and then always using ToString from the shrinking string?

This link describes that the StringBuilder ToString method only allocates new space if a new thread accesses the StringBuilder or the returned String is much smaller than the currently allocated space. This sounds faster than using Substring. I can see this logic within the implementation up to .Net Framework 3.5. With version 4.0 there is an immediate call to FastAllocateString (more or less like with String Substring). Is it the same logic as in previous framework versions, only hidden now?

Why I need it:

Edit: Updated explanation: I have a string and a set of user given regexes and a context. The context tells me which regexes would be currently interesting to try to match with the start of the string. After a successful match the start of the string is now after where the last match ended. And now there may be a different context. And additionally I want to know which Regex matched with the start the last time. Since this is done multiple times the performance should be as good as possible.

In order to avoid searching the whole (remaining) string, I would use the start-anchor (^), so the search can be abborted as soon as possible (after knowing that the start does not match). If I do this I have the problem that I cannot use the start index parameter of the Match method with anything other than 0. Otherwise there would never be a match thanks to the anchor.

So I think I have two possibilities now:

1) Always using the Substring method on the input string and keeping and index for the current position.

2) Using a StringBuilder and always removing the start of the input when matched successfuly. Then calling the ToString method to see if the new start matches with the next Regex.

Is the performance of one or both of the above approaches acceptable, or is there another way which is faster?

Edit2: Regarding StringBuilder and String: As described here the implementation of StringBuilder now immediately allocates the space whereas previously this was delayed until the StringBuilder string was changed (which would have been the case all the time for my requirements). So I guess both Substring and ToString are very slow compared to the Regex-only solution from Kendall Frey.

5
  • You might consider posting the RegEx as it might be able to be tweaked to perform your need on its own. +1 for the good question though. Commented Jul 25, 2012 at 18:04
  • 1
    Regex can do this automatically. \G matches at the end of the previous match. Example: refiddle.com/2qf Commented Jul 25, 2012 at 18:06
  • @Kendall Frey This looks interesting I'll take a look! Commented Jul 25, 2012 at 18:11
  • I posted a more complete answer. Commented Jul 25, 2012 at 18:13
  • @Mike It could be any regex and also not necessarily the same regex during one execution. Commented Jul 25, 2012 at 18:22

2 Answers 2

3

You don't need to do this. The .NET regex engine includes a feature that lets you match a string immediately following the previous match, using the \G anchor.

Example regex:

\Gcat

Example string:

catcatcatdogcat

Matches:

(cat)(cat)(cat)dogcat

Edit: I don't think the regex lets you remember the \G anchor position between multiple regexes. Instead, you can use Regex.Match(string, int) overload to begin your string at a certain point, and use the \G anchor to match the starting position.

Sign up to request clarification or add additional context in comments.

9 Comments

I think this is helpful when I only have one regex that I'm using repeatedly but I have more.
What if you just added \G to the beginning of the regex dynamically? Where are they coming from? The user?
Yes from the user. I could add \G. But as I understand it I cannot continue with a new Regex from the last match, or am I wrong? I think this would only work in Perl
Oh, you need multiple regexes for one string? What's the point of splitting them then?
I don't quite get what you mean with "splitting them". I have a string and a set of regexes and a context. The context tells me which regexes would be currently interesting to match with the start of the string. After a successful match the start of the string is now after where the last match ended. And now there may be a different context. And additionally I want to know which Regex matched. I think I'll update this explanation in my initial question because it may clarify things.
|
0

A lot of string manipulation methods have versions with begin/end or begin/length pairs.

I.e. Regex.Match can be used to match substrings which is likely faster than cutting string in parts or inserting sentinel characters.

public Match Match(string input, int beginning, int length)

Note: as with any performance questions you need to try variants and measure yourself for your data.

1 Comment

The problem is that I don't know the length. It could be everything from one character up to the rest of the string.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.