4

Hi all i use Regex.match in C# to phase a text file line by line. I find it will spend more time(about 2-4 secs) when the line cant match the patten. But spend less time(less than 1 sec) when match. Who can tell me how can i improve the performance?

This is the regex I'm using:

^.*?\t.*?\t(?<npk>\d+)\t(?<bol>\w+)\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t\s*(?<netValue>[\d\.,]+)\t.*?\t.*?\t(?<item>\d{6})\t(?<salesDoc>\d+)\t(?<acGiDate>[\d\.]{10})\t.*?\t.*?\t.*?\t.*?\t.*?\t(?<delivery>\d+)\t\s*(?<billQuantity>\d+)\t.*?\t(?<material>[\w\-]+)\tIV$
9
  • 1
    can you show your regex? Commented Feb 18, 2011 at 9:21
  • 1
    Can you post the regex so we can have a look? The problems are most likely due to the regex itself so we'll need to see it. Commented Feb 18, 2011 at 9:21
  • 1
    That's perfectly normal; it's almost always the failed match attempts that kill performance. Show us the regex and I'm sure we can help you fine-tune it. Show us the code that applies the regex, and a sample of the data, and we'll be able to help even more. (And I mean, edit your question and add the info to it; don't put it in a comment.) Commented Feb 18, 2011 at 9:32
  • (see also my edit now you've posted the format) Commented Feb 18, 2011 at 10:19
  • (minutiae - if it is delimited by tab, it isn't really csv - it is tsv; and your manager is wrong, frankly) Commented Feb 18, 2011 at 10:21

2 Answers 2

8

Performance problems that only show up when a regex can't match are very often due to catastrophic backtracking. This happens when a regex allows a lot of possible combinations to match the subject text, all of which have to be tried by the regex engine until it may declare failure.

In your case, the reason for failure is obvious:

First, what you're doing shouldn't really be done with a regex, but rather with a CSV parser (or TSV parser in your case).

If you're stuck with regex, though, you still need to change something. Your problem is that the delimiter \t can also be matched by the dot (.), so unless the entire string matches, the regex engine has to try oodles of permutations, like I outlined above.

A big step forward, therefore, would be to change all the .*? into [^\t]* where applicable, and to condense repeats using the {m,n} operator:

^(?:[^\t]*\t){2}(?<npk>\d+)\t(?<bol>\w+)(?:\t[^\t]*){9}\t\s*(?<netValue>[\d\.,]+)(?:\t[^\t]*){2}\t(?<item>\d{6})\t(?<salesDoc>\d+)\t(?<acGiDate>[\d\.]{10})(?:\t[^\t]*){5}\t(?<delivery>\d+)\t\s*(?<billQuantity>\d+)\t[^\t]*\t(?<material>[\w\-]+)\tIV$

I hope I haven't miscounted :)


Just for illustration:

Matching this text

1   2   3   4   5   6   7   8   9   0

with this excerpt from your regex above

.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t\s*(?<netValue>[\d\.,]+)

takes the regex engine 39 steps.

When you feed it this text, though:

1   2   3   4   5   6   7   8   9   X

it takes the regex engine 4602 steps to figure out that it can't match.

If you use

(?:[^\t]*\t){9}\s*(?<netValue>[\d\.,]+)

instead, the engine needs 30 steps for a successful match and only 39 for a failed attempt.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks Tim, Here's my regexp.It is used to phase a CSV file (splited by \t).^.*?\t.*?\t(?<npk>\d+)\t(?<bol>\w+)\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t.*?\t\s*(?<netValue>[\d\.,]+)\t.*?\t.*?\t(?<item>\d{6})\t(?<salesDoc>\d+)\t(?<acGiDate>[\d\.]{10})\t.*?\t.*?\t.*?\t.*?\t.*?\t(?<delivery>\d+)\t\s*(?<billQuantity>\d+)\t.*?\t(?<material>[\w\-]+)\tIV$
Thank you very much indeed,Tim.I try your solution and the program accelerates so mush. I wonder how telent and amasing the guys in stackoverflow are! Thanks others who help me,too.Also sorry for my poor english. Best regards!
Illuminati confirmed!
@TimPietzcker, how'd you come up with the step counts, particularly 4602? Did you figure out the simple case and use mathematical induction or something to come up with a formula for sample strings of your example's form, or do you have a tool that calculates it for you? Such a tool would be pretty handy!
@Inigo: That's what the debugger of RegexBuddy outputs :) (regexbuddy.com)
4

Pre-compiling it usually helps:

private static readonly Regex re = new Regex(pattern, RegexOptions.Compiled);

however, I wonder if in this specific case it is tied to the regex - maybe some expensive back-reference. Regex isn't always the tool to use, for example...


Edit now that this appears to be delimited data:

Instead of regex, delimited data can use a parsing approach much more effectively. You might even get away with just var parts=line.Split('\t') (and access parts by index), but if that fails - this csv reader has options to control the delimiters etc.

3 Comments

Thanks Marc. I think you're right. The regex is not the best solution for my issue.
Since the regex appears to have no provision for quoted fields, splitting on \t probably would work. You'd have to access the fields by index rather than by name, but the regex is so vague, that would probably be more accurate anyway.
Thak you Marc. You help me a lot. Best regards!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.