Regex.match in C# performance problem

Question

Hi all i use Regex.match in C# to phase a text file line by line. I find it will spend more time(about 2-4 secs) when the line cant match the patten. But spend less time(less than 1 sec) when match. Who can tell me how can i improve the performance?

This is the regex I'm using:

^.*?\t.*?\t(?<npk>\d+)\t(?<bol>\w+)\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t\s*(?<netValue>[\d\.,]+)\t.*?\t.*?\t(?<item>\d{6})\t(?<salesDoc>\d+)\t(?<acGiDate>[\d\.]{10})\t.*?\t.*?\t.*?\t.*?\t.*?\t(?<delivery>\d+)\t\s*(?<billQuantity>\d+)\t.*?\t(?<material>[\w\-]+)\tIV$

Can you post the regex so we can have a look? The problems are most likely due to the regex itself so we'll need to see it. — MrKWatkins
– MrKWatkins, Commented Feb 18, 2011 at 9:21
That's perfectly normal; it's almost always the failed match attempts that kill performance. Show us the regex and I'm sure we can help you fine-tune it. Show us the code that applies the regex, and a sample of the data, and we'll be able to help even more. (And I mean, edit your question and add the info to it; don't put it in a comment.) — Alan Moore
– Alan Moore, Commented Feb 18, 2011 at 9:32
(minutiae - if it is delimited by tab, it isn't really csv - it is tsv; and your manager is wrong, frankly) — Marc Gravell
– Marc Gravell, Commented Feb 18, 2011 at 10:21

Tim Pietzcker · Accepted Answer · 2011-02-18 10:19:38Z

8

Performance problems that only show up when a regex can't match are very often due to catastrophic backtracking. This happens when a regex allows a lot of possible combinations to match the subject text, all of which have to be tried by the regex engine until it may declare failure.

In your case, the reason for failure is obvious:

First, what you're doing shouldn't really be done with a regex, but rather with a CSV parser (or TSV parser in your case).

If you're stuck with regex, though, you still need to change something. Your problem is that the delimiter \t can also be matched by the dot (.), so unless the entire string matches, the regex engine has to try oodles of permutations, like I outlined above.

A big step forward, therefore, would be to change all the .*? into [^\t]* where applicable, and to condense repeats using the {m,n} operator:

^(?:[^\t]*\t){2}(?<npk>\d+)\t(?<bol>\w+)(?:\t[^\t]*){9}\t\s*(?<netValue>[\d\.,]+)(?:\t[^\t]*){2}\t(?<item>\d{6})\t(?<salesDoc>\d+)\t(?<acGiDate>[\d\.]{10})(?:\t[^\t]*){5}\t(?<delivery>\d+)\t\s*(?<billQuantity>\d+)\t[^\t]*\t(?<material>[\w\-]+)\tIV$

I hope I haven't miscounted :)

Just for illustration:

Matching this text

1   2   3   4   5   6   7   8   9   0

with this excerpt from your regex above

.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t\s*(?<netValue>[\d\.,]+)

takes the regex engine 39 steps.

When you feed it this text, though:

1   2   3   4   5   6   7   8   9   X

it takes the regex engine 4602 steps to figure out that it can't match.

If you use

(?:[^\t]*\t){9}\s*(?<netValue>[\d\.,]+)

instead, the engine needs 30 steps for a successful match and only 39 for a failed attempt.

edited Feb 18, 2011 at 10:19

answered Feb 18, 2011 at 9:25

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user622851 Over a year ago

Thanks Tim, Here's my regexp.It is used to phase a CSV file (splited by \t).^.*?\t.*?\t(?<npk>\d+)\t(?<bol>\w+)\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t\s*(?<netValue>[\d\.,]+)\t.*?\t.*?\t(?<item>\d{6})\t(?<salesDoc>\d+)\t(?<acGiDate>[\d\.]{10})\t.*?\t.*?\t.*?\t.*?\t.*?\t(?<delivery>\d+)\t\s*(?<billQuantity>\d+)\t.*?\t(?<material>[\w\-]+)\tIV$

user622851 Over a year ago

Thank you very much indeed,Tim.I try your solution and the program accelerates so mush. I wonder how telent and amasing the guys in stackoverflow are! Thanks others who help me,too.Also sorry for my poor english. Best regards!

Mywiki Witwiki Over a year ago

Illuminati confirmed!

Inigo Over a year ago

@TimPietzcker, how'd you come up with the step counts, particularly 4602? Did you figure out the simple case and use mathematical induction or something to come up with a formula for sample strings of your example's form, or do you have a tool that calculates it for you? Such a tool would be pretty handy!

Tim Pietzcker Over a year ago

@Inigo: That's what the debugger of RegexBuddy outputs :) (regexbuddy.com)

Marc Gravell · Accepted Answer · 2011-02-18 10:21:06Z

4

Pre-compiling it usually helps:

private static readonly Regex re = new Regex(pattern, RegexOptions.Compiled);

however, I wonder if in this specific case it is tied to the regex - maybe some expensive back-reference. Regex isn't always the tool to use, for example...

Edit now that this appears to be delimited data:

Instead of regex, delimited data can use a parsing approach much more effectively. You might even get away with just var parts=line.Split('\t') (and access parts by index), but if that fails - this csv reader has options to control the delimiters etc.

edited Feb 18, 2011 at 10:21

answered Feb 18, 2011 at 9:21

Marc Gravell

1.1m273 gold badges2.6k silver badges3k bronze badges

3 Comments

user622851 Over a year ago

Thanks Marc. I think you're right. The regex is not the best solution for my issue.

Alan Moore Over a year ago

Since the regex appears to have no provision for quoted fields, splitting on \t probably would work. You'd have to access the fields by index rather than by name, but the regex is so vague, that would probably be more accurate anyway.

user622851 Over a year ago

Thak you Marc. You help me a lot. Best regards!

Collectives™ on Stack Overflow

Regex.match in C# performance problem

2 Answers 2

5 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related