2

I'm trying to use this but can't make it work. I want to check the syntax of expressions like this: (1+2)*(3+4)

I have integers, +, * and brackets. That's it, but it can be nested to any depth.

In BNF syntax the expr can be described like this:

expr
<sum>
sum
<product>{+<product>}
product
<atom>{*<atom>}
atom
<number>|(<expr>)
number
<digit>{<digit>}

I tried to translate this to Perl like this:

$number = '\d+';
$atom = "($number|\\((?R)\\))";
$product = "$atom(\\*$atom)*";
$sum = "$product(\\+$product)*";
$expr = $sum;
if ('(1+2)*(3+4)' =~ /^$expr$/)
{
    print "OK";
}

But it doesn't match! What am I doing wrong?

1
  • @toolic That's to avoid collapsing escape, right? But I have doubled the \ as you can see so... Commented Oct 4, 2019 at 11:52

2 Answers 2

5

When you recurse, the ^ at the start of the pattern will fail to match.

Use (?(DEFINE)...) to define the rules instead of using (?R).

'(1+2)*(3+4)' =~ /
   ^ (?&expr) \z

   (?(DEFINE)
      # Rules.
      (?<expr>    (?&sum)                            )
      (?<sum>     (?&product) (?: \+ (?&product) )*+ )
      (?<product> (?&atom)    (?: \* (?&atom)    )*+ )
      (?<atom>    (?&NUMBER) | \( (?&expr) \)        )

      # Tokens.
      (?<NUMBER> \d++ )
   )
/x
   or die("Doesn't match.\n");

which simplifies to

'(1+2)*(3+4)' =~ /
   ^ (?&expr) \z

   (?(DEFINE)
      # Rules.
      (?<expr>      (?&binary_op)                  )
      (?<binary_op> (?&atom) (?: [+*] (?&atom) )*+ )
      (?<atom>      (?&NUMBER) | \( (?&expr) \)    )

      # Tokens.
      (?<NUMBER> \d++ )
   )
/x
   or die("Doesn't match.\n");

That's assuming you're only trying to check for validity rather than trying to parse the string. If you need to parse the string, you can build a parser using Parse::RecDescent or Marpa::R2.

Sign up to request clarification or add additional context in comments.

9 Comments

Thanks for this valuable contribution. I didn't know about this DEFINE stuff. I've got some questions: Does the regex end with /x and incorporate the whole DEFINE section? Does /x mean extension? Is white space not significant? What does \z mean? You have a + at the end of <sum> and <product>, what does it mean? You can't combine + and * like that because it violates precedence: 1*2+3*4 will then be parsed at the first * which is wrong.
You're probably right about the ^ at the start of my regex and $ at the end, but if I drop them, how do I match the whole string and not just a part of it? My string matches if I drop them but then even A(1+2)*(3+4)B matches!
Re "Does the regex end with /x and incorporate the whole DEFINE section? Does /x mean extension? Is white space not significant?" /x makes whitespace insignificant in the entire pattern except in character classes ([]) and when escaped. See perlre.
Re "What does \z mean?", End of string. ($ does not mean end of string.) See perlre.
Re "You have a + at the end of <sum> and <product>", An optimization. Prevents the * is modifies from attempting to match fewer times during backtracking. See perlre.
|
1

ikegami's workaround above with the DEFINE stuff is beautiful, but it doesn't answer the question how to do it my way. A minimal change of my code to make it work? ikegami is right, the cause of no match is the ^ in /^$expr$/ . When the parser reenters the regex recursively it again checks for beginning of string, which fails. So I cannot have ^ and $ in the regex it seems. Without them my string matches. But then some invalid strings match too, like A(1+2)*(3+4)B . In the absence of ^ and $ it doesn't necessarily match the whole string. Problem.

ikegami suggested a solution to this in a comment above. I'll just write it out. I have tested it and it works:

$number = '\d+';
$atom = "($number|\\((?1)\\))";
$product = "$atom(\\*$atom)*";
$sum = "$product(\\+$product)*";
$expr = $sum;
if ('(1+2)*(3+4)' =~ /^($expr)$/)
{
    print "OK";
}

Notice that I now have (?1) instead of (?R) and that I have enclosed $expr in brackets. (?1) refers to the first capture group, which is ($expr). So the recursion reenters this subex instead of the whole regex. ^ is not met again. That solves it.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.