7

I came up with a regex string that parses the given text into 3 categories:

  • in parentheses
  • in brackets
  • neither.

Like this:

\[.+?\]|\(.+?\)|[\w+ ?]+

My intention is to use the outermost operator only. So, given a(b[c]d)e, the split is going to be:

a || (b[c]d) || e

It works fine given parentheses inside brackets, or brackets inside parentheses, but breaks down when there are brackets inside brackets and parentheses inside parentheses. For example, a[b[c]d]e is split as

a || [b[c] || d || ] || e.

Is there any way to handle this using regex alone, not resorting to using code to count number of open/closed parentheses? Thanks!

8
  • Why can't you create your own parser? Commented Jun 29, 2013 at 20:43
  • 2
    Which language are you using for this? Regular expressions can (in theory) not parse nested structures. If you are using .NET or Perl/PCRE, you might be lucky though, because they have some advanced features that can. Commented Jun 29, 2013 at 20:44
  • The language of nested parentheses is not regular. Hence 'regular' expressions (in the mathematical sense of the term) are not up to the job. period. Commented Jun 29, 2013 at 20:45
  • Rohit - I am trying to create a parser for a context free grammar (JSGF), for which I need to split the text into the three categories. Then my script goes through them recursively one nested level at a time. M, I am using python. Which I guess puts me in the unfortunate category. Commented Jun 29, 2013 at 20:48
  • @MinasAbovyan just split your string into the matches of something like [[\]()]|[^[\]()]+ (the brackets in question and anything else). then walk the matches, incrementing the relevant depth counters when encountering each bracket type. Commented Jun 29, 2013 at 20:50

2 Answers 2

12

Standard1 regular expressions are not sophisticated enough to match nested structures like that. The best way to approach this is probably to traverse the string and keep track of opening / closing bracket pairs.


1 I said standard, but not all regular expression engines are indeed standard. You might be able to this with Perl, for instance, by using recursive regular expressions. For example:

$str = "[hello [world]] abc [123] [xyz jkl]";

my @matches = $str =~ /[^\[\]\s]+ | \[ (?: (?R) | [^\[\]]+ )+ \] /gx;

foreach (@matches) {
    print "$_\n";
}
[hello [world]]
abc
[123]
[xyz jkl]

EDIT: I see you're using Python; check out pyparsing.

Sign up to request clarification or add additional context in comments.

Comments

2

Well, once you abandon the idea that parsing nested expressions should work at unlimited depth, one can use regular expressions just fine by specifying a maximum depth in advance. Here is how:

def nested_matcher (n):
    # poor man's matched paren scanning, gives up after n+1 levels.
    # Matches any string with balanced parens or brackets inside; add
    # the outer parens yourself if needed.  Nongreedy.  Does not
    # distinguish parens and brackets as that would cause the
    # expression to grow exponentially rather than linearly in size.
    return "[^][()]*?(?:[([]"*n+"[^][()]*?"+"[])][^][()]*?)*?"*n

import re

p = re.compile('[^][()]+|[([]' + nested_matcher(10) + '[])]')
print p.findall('a(b[c]d)e')
print p.findall('a[b[c]d]e')
print p.findall('[hello [world]] abc [123] [xyz jkl]')

This will output

['a', '(b[c]d)', 'e']
['a', '[b[c]d]', 'e']
['[hello [world]]', ' abc ', '[123]', ' ', '[xyz jkl]']

1 Comment

This search actually deleted all of my input EXCEPT for parentheses themsel\/es

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.