7

I'm trying to match text like sp { ...{...}... }, where the curly braces are allowed to nest. This is what I have so far:

my $regex = qr/
(                   #save $1
    sp\s+           #start Soar production
    (               #save $2
        \{          #opening brace
        [^{}]*      #anything but braces
        \}          #closing brace  
        | (?1)      #or nested braces
    )+              #0 or more
)
/x;

I just cannot get it to match the following text: sp { { word } }. Can anyone see what is wrong with my regex?

2 Answers 2

6

There are numerous problems. The recursive bit should be:

(
   (?: \{ (?-1) \}
   |   [^{}]+
   )*
)

All together:

my $regex = qr/
   sp\s+
   \{
      (
         (?: \{ (?-1) \}
         |   [^{}]++
         )*
      )
   \}
/x;

print "$1\n" if 'sp { { word } }' =~ /($regex)/;
Sign up to request clarification or add additional context in comments.

6 Comments

As near as I can tell, the regex doesn't allow spaces around the braces (sorry for the rhyme) so the test case should fail. What's up with that?
Hmmm... This ends up taking forever for some partial matches, like this: sp {word{(aaaaaaaaaaaaaaaaaaaaaaaaaaaaa)}.
@NateGlenn see my solution, also what do you want to happen on that partial match, is that a failure or would you have it return the remainder of the string?
@JoelBerger that one should not match. The rest of the matches should be returned.
Speed issue fixed by changing [^{}]+ to [^{}]++.
|
6

This is case for the underused Text::Balanced, a very handy core module for this kind of thing. It does rely on the pos of the start of the delimited sequence being found/set first, so I typically invoke it like this:

#!/usr/bin/env perl

use strict;
use warnings;

use Text::Balanced 'extract_bracketed';

sub get_bracketed {
  my $str = shift;

  # seek to beginning of bracket
  return undef unless $str =~ /(sp\s+)(?={)/gc;

  # store the prefix
  my $prefix = $1;

  # get everything from the start brace to the matching end brace
  my ($bracketed) = extract_bracketed( $str, '{}');

  # no closing brace found
  return undef unless $bracketed;

  # return the whole match
  return $prefix . $bracketed;
}

my $str = 'sp { { word } }';

print get_bracketed $str;

The regex with the gc modifier tells the string to remember where the end point of the match is, and extract_bracketed uses that information to know where to start.

2 Comments

I really need to read up on this module. It comes up a lot, but I always prefer regex because I've already invested so much time in learning it, it's fun to learn more of, and seems more compact. Thanks for the answer!
@NateGlenn, it really is complementary to regexp and especially the regexp gc (parser) functionality. This is why it uses the pos of the string, because it is expected that you will intermingle the calls to text_balanced with //gc

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.