0

I'm trying to match pascal string literal input to the following pattern: @"^'([^']|(''))*'$", but that's not working. What is wrong with the pattern?

public void Run()
{             
    using(StreamReader reader = new StreamReader(String.Empty))
    {
        var LineNumber = 0;
        var LineContent = String.Empty;

        while(null != (LineContent = reader.ReadLine()))
        {
            LineNumber++;

            String[] InputWords = new Regex(@"\(\*(?:\w|\d)*\*\)").Replace(LineContent.TrimStart(' '), @" ").Split(' ');

            foreach(String word in InputWords)
            {
                Scanner.Scan(word);
            }

        }
    }
}

I search input string for any pascal-comment entry, replace it with whitespace, then I split input into substrings to match them to the following:

private void Initialize()
{
    MatchingTable = new Dictionary<TokenUnit.TokenType, Regex>();

    MatchingTable[TokenUnit.TokenType.Identifier] = new Regex
    (
        @"^[_a-zA-Z]\w*$",
        RegexOptions.Compiled | RegexOptions.Singleline
    );
    MatchingTable[TokenUnit.TokenType.NumberLiteral] = new Regex
    (
        @"(?:^\d+$)|(?:^\d+\.\d*$)|(?:^\d*\.\d+$)",
         RegexOptions.Compiled | RegexOptions.Singleline
    );
}
// ... Here it all comes together
public TokenUnit Scan(String input)
{                         
    foreach(KeyValuePair<TokenUnit.TokenType, Regex> node in this.MatchingTable)
    {
        if(node.Value.IsMatch(input))
        {
            return new TokenUnit
            {
                Type = node.Key                        
            };
        }
    }
    return new TokenUnit
    {
        Type = TokenUnit.TokenType.Unsupported
    };
}
2
  • 3
    What is a Pascal-like string literal? This? Commented Dec 20, 2010 at 15:45
  • 1
    Could you show some input strings and expected result ? Commented Dec 20, 2010 at 15:56

1 Answer 1

1

The pattern appears to be correct, although it could be simplified:

^'(?:[^']+|'')*'$

Explanation:

^      # Match start of string
'      # Match the opening quote
(?:    # Match either...
 [^']+ # one or more characters except the quote character
 |     # or
 ''    # two quote characters (= escaped quote)
)*     # any number of times
'      # Then match the closing quote
$      # Match end of string

This regex will fail if the input you're checking it against contains anything besides a Pascal string (say, surrounding whitespace).

So if you want to use the regex to find Pascal strings within a larger text corpus, then you need to remove the ^ and $ anchors.

And if you want to allow double quotes, too, then you need to augment the regex:

^(?:'(?:[^']+|'')*'|"(?:[^"]+|"")*")$

In C#:

foundMatch = Regex.IsMatch(subjectString, "^(?:'(?:[^']+|'')*'|\"(?:[^\"]+|\"\")*\")$");

This regex will match strings like

'This matches.'
'This too, even though it ''contains quotes''.'
"Mixed quotes aren't a problem."
''

It won't match strings like

'The quotes aren't balanced or escaped.'
There is something 'before or after' the quotes.
    "Even whitespace is a problem."
Sign up to request clarification or add additional context in comments.

3 Comments

I use whitespace-split input, per single string to match it to some lexeme classes. That's why I put the anchors. So as far as I understand it isn't appropriate sequence in Pascal '''pascal-like'' string'. Am I right?
If you split your input by whitespace, then it will also split inside the strings, won't it? I'll provide some examples of what this regex will and won't match - you might want to provide some samples of your actual input (edit your question and paste some samples).
So you are right about string literals. If I split line with whitespaces then I won't be able to match it in my way even with working pattern. So what am I gonna do?? Thanks in advice!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.