3

I am trying to parse the following line:

"\#" TEST #comment hello world

In my input, the #comment always comes at the end of the line. There may or may not be a comment, but if there is, its always in the end of the line.

I used the following Regex to parse it:

(\#.+)?

I have the RegexOption.RightToLeft on. I expected it to pull #comment hello world. But instead it is pulling "#" TEST #comment hello world"

Why is my Regex expression not pulling the right thing and what is the valid Regex expression I need to make it pull correctly?

9
  • You have to parse the entire string, character escapes and all... FYI, it's a lot harder than it looks. Commented Jul 9, 2011 at 17:24
  • Imagine a line "\#" TEST #" TEST #comment hello world - presumably, the comment starts at the second # - but how would you distinguish that? Commented Jul 9, 2011 at 17:26
  • 2
    Also, what if there was no comment and the line was simply: "\#" TEST You really need something that's able to determine if you're inside a pair of quotes. This may be possible with balanced matching, but it's gonna get really complex. Commented Jul 9, 2011 at 17:32
  • 1
    @icemanind - Exactly. This situation is one that is far more complex than the answers so far give it credit for. If it were me, I'd write some procedural code to accomplish this. Commented Jul 9, 2011 at 17:40
  • 2
    What, exactly, is your input? If it has multi-line strings (e.g. C#'s @"...", Python's r"""...""" or PHP's '...') or comments (e.g. /*...*/), then you'll need to parse the whole document starting from the beginning to do it right. Commented Jul 9, 2011 at 17:43

7 Answers 7

1

The important question is: How do you see the difference between the # at the end of the line and the # that starts the comment? Let's assume for simplicity that the last # starts a comment.

In that case, what you want to match is

  • one #
  • an arbitrary sequence of text not containing #
  • until the end of the line

So let's put that into a regex: #[^#]*$. You don't need RightToLeft for it. As far as I know, you also don't need to escape # in C# regular expressions.

Of course, if you provide information on how to see the difference between a "valid" # and a "comment-starting" #, a more elegant solution could be found that allows for # within comments.

Sign up to request clarification or add additional context in comments.

4 Comments

The whole point is that the # at the beginning messes it up.
Removed the -1, but I'd bet the "simplicity" assumption is the killer for the OP here. (Edit: Apparently not.)
@Mehrdad: Since he did not specify the difference between a comment-starting and a non-comment-starting #, that's all you can do given the specification. (Edit: In fact, his last comment on the question states that this is exactly what he wants.)
@Heinzi - a "comment-starting" # will always be at the end of the input line.
0

I think you'll find too many edge cases when trying to pull this off with regular expressions. Dealing with the quotes is what really complicates things, not to mention escape characters.

A procedural solution is not complicated, and will be faster and easier to modify as needs dictate. Note that I don't know what the escape characters should be in your example, but you could certainly add that to the algorithm...

string CodeSnippet = Resource1.CodeSnippet;
StringBuilder CleanCodeSnippet = new StringBuilder();
bool InsideQuotes = false;
bool InsideComment = false;

Console.WriteLine("BEFORE");
Console.WriteLine(CodeSnippet);
Console.WriteLine("");

for (int i = 0; i < CodeSnippet.Length; i++)
{
    switch(CodeSnippet[i])
    {
        case '"' : 
            if (!InsideComment) InsideQuotes = !InsideQuotes;
            break;
        case '#' :
            if (!InsideQuotes) InsideComment = true;
            break;
        case '\n' :
            InsideComment = false;
            break;                       
    }

    if (!InsideComment)
    {
        CleanCodeSnippet.Append(CodeSnippet[i]);
    }
}

Console.WriteLine("AFTER");
Console.WriteLine(CleanCodeSnippet.ToString());
Console.WriteLine("");

This example strips the comments away from the CodeSnippet. I assumed that's what you were after.

Here's the output:

BEFORE
"\#" TEST #comment hello world
"ab" TEST #comment hello world
"ab" TEST #comment "hello world
"ab" + "ca" + TEST #comment
"\#" TEST
"ab" TEST

AFTER
"\#" TEST
"ab" TEST
"ab" TEST
"ab" + "ca" + TEST
"\#" TEST
"ab" TEST

As I said, you'll probably need to add escape characters to the algorithm. But this is a good starting point.

Comments

0

The + operator tries to match as many times as it can. To match as few times as possible, use its lazy equivalent, +?:

(#.+?)

Of course, this would give trouble with comments that contain #:

"\#" TEST #comment #hello #world

5 Comments

Unfortunately that way you can never have a comment like ##### IMPORTANT LINE #####
@Steve Wortham: Yeah, and it works. Don't forget to turn on the RightToLeft option, as the question suggests.
It doesn't seem to work for me, even with the RightToLeft option. It'll match starting at the first # sign.
@Steve Wortham: Here's a nice website where you can test it, derekslager.com/blog/posts/2007/09/… Works for me
Well, #" TEST is still being matched. Granted, it's being matched in a group separately from the rest, but how are you to know if the first match is code and second is a comment? It seems this problem is more complex than anyone is willing to admit.
0

Use " #.+". I left the \ out of my test because # is not a recognized escape sequence. I left out the (, ) and ? because they where not needed.

Regex regex = new Regex(" #.+");
Console.WriteLine(regex.Match("#\" TEST #comment hello world"));

Comments

0

For the test string you've given, this regex pulls the comment correctly (with right to left option): /((?: #).+)$/

Disclaimer:

  • Also pulls the whitespace just before the '#', so you may need to do a trim.
  • Comment cannot contain the sequence ' #' in them

Comments

0

This will match "#" and everything after it, witch is the expected behavior :)

var reg = new Regex("#(.)*")

Hope this helps

Comments

0

Right, I've tested this one and it seems to do the necessary.

\#.+(\#.+)$

Specifically, it skips past the first #, then captures everything from the second # to the end of the line, returning

#comment hello world 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.