C# Regex Expression Issue

Question

I am trying to parse the following line:

"\#" TEST #comment hello world

In my input, the #comment always comes at the end of the line. There may or may not be a comment, but if there is, its always in the end of the line.

I used the following Regex to parse it:

(\#.+)?

I have the RegexOption.RightToLeft on. I expected it to pull #comment hello world. But instead it is pulling "#" TEST #comment hello world"

Why is my Regex expression not pulling the right thing and what is the valid Regex expression I need to make it pull correctly?

You have to parse the entire string, character escapes and all... FYI, it's a lot harder than it looks. — user541686
– user541686, Commented Jul 9, 2011 at 17:24
Imagine a line "\#" TEST #" TEST #comment hello world - presumably, the comment starts at the second # - but how would you distinguish that? — Damien_The_Unbeliever
– Damien_The_Unbeliever, Commented Jul 9, 2011 at 17:26
Also, what if there was no comment and the line was simply: "\#" TEST You really need something that's able to determine if you're inside a pair of quotes. This may be possible with balanced matching, but it's gonna get really complex. — Steve Wortham
– Steve Wortham, Commented Jul 9, 2011 at 17:32
@icemanind - Exactly. This situation is one that is far more complex than the answers so far give it credit for. If it were me, I'd write some procedural code to accomplish this. — Steve Wortham
– Steve Wortham, Commented Jul 9, 2011 at 17:40
What, exactly, is your input? If it has multi-line strings (e.g. C#'s @"...", Python's r"""...""" or PHP's '...') or comments (e.g. /*...*/), then you'll need to parse the whole document starting from the beginning to do it right. — ridgerunner
– ridgerunner, Commented Jul 9, 2011 at 17:43

Heinzi · Accepted Answer · 2011-07-09 17:29:53Z

1

The important question is: How do you see the difference between the # at the end of the line and the # that starts the comment? Let's assume for simplicity that the last # starts a comment.

In that case, what you want to match is

one #
an arbitrary sequence of text not containing #
until the end of the line

So let's put that into a regex: #[^#]*$. You don't need RightToLeft for it. As far as I know, you also don't need to escape # in C# regular expressions.

Of course, if you provide information on how to see the difference between a "valid" # and a "comment-starting" #, a more elegant solution could be found that allows for # within comments.

edited Jul 9, 2011 at 17:29

answered Jul 9, 2011 at 17:24

Heinzi

173k61 gold badges386 silver badges554 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user541686 Over a year ago

The whole point is that the # at the beginning messes it up.

user541686 Over a year ago

Removed the -1, but I'd bet the "simplicity" assumption is the killer for the OP here. (Edit: Apparently not.)

Heinzi Over a year ago

@Mehrdad: Since he did not specify the difference between a comment-starting and a non-comment-starting #, that's all you can do given the specification. (Edit: In fact, his last comment on the question states that this is exactly what he wants.)

Icemanind Over a year ago

@Heinzi - a "comment-starting" # will always be at the end of the input line.

Steve Wortham · Accepted Answer · 2011-07-09 19:39:11Z

I think you'll find too many edge cases when trying to pull this off with regular expressions. Dealing with the quotes is what really complicates things, not to mention escape characters.

A procedural solution is not complicated, and will be faster and easier to modify as needs dictate. Note that I don't know what the escape characters should be in your example, but you could certainly add that to the algorithm...

string CodeSnippet = Resource1.CodeSnippet;
StringBuilder CleanCodeSnippet = new StringBuilder();
bool InsideQuotes = false;
bool InsideComment = false;

Console.WriteLine("BEFORE");
Console.WriteLine(CodeSnippet);
Console.WriteLine("");

for (int i = 0; i < CodeSnippet.Length; i++)
{
    switch(CodeSnippet[i])
    {
        case '"' : 
            if (!InsideComment) InsideQuotes = !InsideQuotes;
            break;
        case '#' :
            if (!InsideQuotes) InsideComment = true;
            break;
        case '\n' :
            InsideComment = false;
            break;                       
    }

    if (!InsideComment)
    {
        CleanCodeSnippet.Append(CodeSnippet[i]);
    }
}

Console.WriteLine("AFTER");
Console.WriteLine(CleanCodeSnippet.ToString());
Console.WriteLine("");

This example strips the comments away from the CodeSnippet. I assumed that's what you were after.

Here's the output:

BEFORE
"\#" TEST #comment hello world
"ab" TEST #comment hello world
"ab" TEST #comment "hello world
"ab" + "ca" + TEST #comment
"\#" TEST
"ab" TEST

AFTER
"\#" TEST
"ab" TEST
"ab" TEST
"ab" + "ca" + TEST
"\#" TEST
"ab" TEST

As I said, you'll probably need to add escape characters to the algorithm. But this is a good starting point.

Andomar · Accepted Answer · 2011-07-09 17:25:19Z

0

The + operator tries to match as many times as it can. To match as few times as possible, use its lazy equivalent, +?:

(#.+?)

Of course, this would give trouble with comments that contain #:

"\#" TEST #comment #hello #world

edited Jul 9, 2011 at 17:25

answered Jul 9, 2011 at 17:22

Andomar

239k55 gold badges387 silver badges412 bronze badges

5 Comments

Howard Over a year ago

Unfortunately that way you can never have a comment like ##### IMPORTANT LINE #####

Andomar Over a year ago

@Steve Wortham: Yeah, and it works. Don't forget to turn on the RightToLeft option, as the question suggests.

Steve Wortham Over a year ago

It doesn't seem to work for me, even with the RightToLeft option. It'll match starting at the first # sign.

Andomar Over a year ago

@Steve Wortham: Here's a nice website where you can test it, derekslager.com/blog/posts/2007/09/… Works for me

Steve Wortham Over a year ago

Well, #" TEST is still being matched. Granted, it's being matched in a group separately from the rest, but how are you to know if the first match is code and second is a comment? It seems this problem is more complex than anyone is willing to admit.

MrFox · Accepted Answer · 2011-07-09 17:31:35Z

0

Use " #.+". I left the \ out of my test because # is not a recognized escape sequence. I left out the (, ) and ? because they where not needed.

Regex regex = new Regex(" #.+");
Console.WriteLine(regex.Match("#\" TEST #comment hello world"));

answered Jul 9, 2011 at 17:31

MrFox

5,1849 gold badges48 silver badges83 bronze badges

Comments

Mrchief · Accepted Answer · 2011-07-09 17:34:02Z

0

For the test string you've given, this regex pulls the comment correctly (with right to left option): /((?: #).+)$/

Disclaimer:

Also pulls the whitespace just before the '#', so you may need to do a trim.
Comment cannot contain the sequence ' #' in them

answered Jul 9, 2011 at 17:34

Mrchief

76.4k20 gold badges145 silver badges193 bronze badges

Comments

Eduardo Cobuci · Accepted Answer · 2011-07-09 17:35:02Z

0

This will match "#" and everything after it, witch is the expected behavior :)

var reg = new Regex("#(.)*")

Hope this helps

answered Jul 9, 2011 at 17:35

Eduardo Cobuci

5,6514 gold badges27 silver badges28 bronze badges

Comments

Steve Morgan · Accepted Answer · 2011-07-09 17:41:13Z

0

Right, I've tested this one and it seems to do the necessary.

\#.+(\#.+)$

Specifically, it skips past the first #, then captures everything from the second # to the end of the line, returning

#comment hello world

answered Jul 9, 2011 at 17:41

Steve Morgan

13.1k2 gold badges44 silver badges49 bronze badges

Collectives™ on Stack Overflow

C# Regex Expression Issue

7 Answers 7

4 Comments

Comments

5 Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

4 Comments

Comments

5 Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related