2

I asked this question a long time ago, I wish I had read the answers to When not to use Regex in C# (or Java, C++ etc) first!

I wish to use Regex (regular expressions) to get a list of all strings in my C# source code, including strings that have double quotes embedded in them.

This should not be hard, however before I spend time trying to build the Regex expression up, has anyone got a “pre canned” one already?

This is not as easy as it seems as first due to

  • “av\”d”
  • @”ab””cd”
  • @”ab”””
  • @”””ab”
  • etc
6
  • 1
    You might find that this gets a little hairier than you expect (depending upon what you need). For example: what about commented code? If you want to ignore strings within comments, you have to watch for "//" and "/*..*/" blocks (unless the comment delimiters themselvs are within a string literal), etc. Commented Jun 8, 2009 at 17:12
  • There are also multi-line @-strings none of the "answers" handle. Regular expressions cannot handle this, you have to use a parser. Commented Jun 8, 2009 at 17:56
  • 7
    As jwz famously said, "now you've got two problems". -- regex.info/blog/2006-09-15/247 -- I would not be using a regular expression to solve this problem in the first place; I would be writing a tokenizer. Commented Jun 8, 2009 at 18:17
  • 3
    Sigh. Don't try to use regular expressions to parse non-regular languages. The examples you showed are just some of the things that will get in your way. Programming languages are, generally speaking, "nested", or "bumpy", and by using regexes, you try to treat them as "flat". Metaphorically speaking, you can try to navigate around the bumps, but it is unclear whether such an algorithm really always reaches its goal. Commented Jun 8, 2009 at 19:01
  • 1
    Therefore, "This should not be hard" is plainly wrong. With plain, or "real" regular expressions, this cannot be done. Modern "regex" engines have extensions to reach out for parts of non-regular domains, but their use is not trivial. So, "this should be very hard", indeed. Commented Jun 8, 2009 at 19:05

4 Answers 4

8

I am posting this as my answer so it stands out to other reading the questions.

As has been pointed out in the helpful comments to my question, it is clear that regex is not a good tool for finding strings in C# code. I could have written a simple “parser” in the time I spent reminding my self of the regex syntax. – (Parser is a over statement as there are no “ in comments etc, it is my source code I am dealing with.)

This seems to sums it up well:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

However until it breaks on my code I will use the regular expression Blixt has posted, but if it give me problems I will not spend match time trying to fix it before writing my own parser. E.g as a C# string it is

@"@Q(?:[^Q]+|QQ)*Q|Q(?:[^Q\\]+|\\.)*Q".Replace('Q', '\"')

Update, the above regEx had problem, so I just wrote my own parser, including writing unit tests it took about 2 hours to write the parser. That's I lot less time then I spend just trying to find (and test) a pre-canned Regex on the web.

The problem I see to have, is I tend to avoid Regex and just write the string handling code my self, then have a lot of people claim I am wasting the client’s money by not using Regex. However whenever I try to use Regex what seems like a simple match pattern becomes match harder quickly. (None the on-line articles on using Regex in .net that I have read, have a good instruction that make it clear when NOT to use Regex. Likewise with it’s MSDN documentation)

Lets see if we can help solve this problem, I have just created a stack overflow questions “When not to use Regex

Sign up to request clarification or add additional context in comments.

2 Comments

you should share your parser! I'm aware of at least one other question where someone is doing something similar.
@Josh, I was paid to write it so can't post it - also it is very lickly to fail on any code base other then the code base it was used on. (It was used to find strings that may need translating)
7

The regular expression for finding C-style strings is:

"(?:[^"\\]+|\\.)*"

This will not take comments into consideration, so your best bet would be to remove all comments first, using the following regular expression:

/\*(?s:(?!\*/).)*\*/|//.*

Note that if you put the above regular expressions in a string you'll need to double all backslashes and escape any citation marks.

Update: Changed regular expression for comments to use DOTALL flag for multi-line comments.

Also, you may want to support literal strings, so use this instead of the other string regex:

@"(?:[^"]+|"")*"|"(?:[^"\\]+|\\.)*"

And a reminder: Don't use DOTALL as a global flag for any of these regular expressions, as it would break the single-line comments and single-line strings (normal strings are single-line, while literal strings can span multiple lines.)

3 Comments

This regular experssion doesn't take @"" type string literals into consideration though.
@"(?:[^"]+|"")*"|"(?:[^"\]+|\\.)*" is not a valid C# string and as it does not have a @ within the regex I don't see how it is taking @"" type string literals into consideration
Ah but you see, it's not a C# string, it's just the regular expression, as stated above. As a C# string it would be "@\"(?:[^\"]+|\"\")*\"|\"(?:[^\"\\\]+|\\\\.)*\""
0

Via www.regular-expressions.info:

"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" matches a single-line string in which the quote character can appear if it is escaped by a backslash. Though this regular expression may seem more complicated than it needs to be, it is much faster than simpler solutions which can cause a whole lot of backtracking in case a double quote appears somewhere all by itself rather than part of a string. "[^"\\]*(?:\\.[^"\\]*)*" allows the string to span multiple lines.

1 Comment

This regular experssion doesn't take @"" type string literals into consideration though
0

My 5 cents expression i use in my own C# parser:

normal string:

"((\")|[^"\]|\)"

verbatim string:

@("[^"]*")+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.