C# should be able to do balanced text via recursion in regular expressions. The only problem is I think it retains the outer match as a whole. To further parse the inner contents (between the parenthesis) needs a recursive function call, picking off the tokens each time.
I agree with @dasblinkenlight though about needing a decent parser. As he says, the complexity can become quickly considerable.
The regex below is from Perl, but the construct's should be the same for .Net hacking.
As you can see, the regex is like a seive in that the general form is adhered to, but
only comma and digits are handled between Math tokens, allowing the rest to fall through.
But, if this is the only thing you care about, then it should work. You'll notice that even though you can parse it into a data structure (as below), to use the structure in an internal way requires yet another recursive "parse" on the data structure (albeit easier). If for display or statistical purposes then its not a problem.
The expanded regex:
{
( #1 - Recursion group 1
\b(\w+)\s* #2 - Math token
\( # - Open parenth
( #3 - Capture between parenth's
(?: (?> (?: (?!\b\w+\s*\(|\)) . )+ ) # - Get all up to next math token or close parenth
| (?1) # - OR, recurse group 1
)* # - Optionally do many times
) # - End capture 3
\) # - Close parenth
) # - End recursion group 1
\s*(\,?) #4 - Capture optional comma ','
| # OR,
# (Here, it is only getting comma and digits, ignoring the rest.
# Comma's ',' are escaped to make them standout)
\s*
(?| # - Start branch reset
(\d+)\s*(\,?) #5,6 - Digits then optional comma ','
| (?<=\,)()\s*(\,|\s*$) #5,6 - Comma behind. No digit then, comma or end
) # - End branch reset
}xs; # Options: expanded, single-line
Here is a rapid prototype in Perl (easier than C#):
use Data::Dumper;
#//
my $regex = qr{(\b(\w+)\s*\(((?:(?>(?:(?!\b\w+\s*\(|\)).)+)|(?1))*)\))\s*(\,?)|\s*(?|(\d+)\s*(\,?)|(?<=\,)()\s*(\,|\s*$))}s;
#//
my $sample = ', asdf Multiply(9, 4, 3, hello, _Sum(3,5,4,) , Division(4, Sum(3,5,4), 5), ,, Subtract(7,8,9))';
print_math_toks( 0, $sample );
my @array;
store_math_toks( 0, $sample, \@array );
print Dumper(\@array);
#//
sub print_math_toks
{
my ($cnt, $segment) = @_;
while ($segment =~ /$regex/g )
{
if (defined $5) {
next if $cnt < 1;
print "\t"x($cnt+1), "$5$6\n";
}
else {
++$cnt;
print "\t"x$cnt, "$2(\n";
my $post = $4;
$cnt = print_math_toks( $cnt, $3 );
print "\t"x$cnt, ")$post\n";
--$cnt;
}
}
return $cnt;
}
sub store_math_toks
{
my ($cnt, $segment, $ary) = @_;
while ($segment =~ /$regex/g )
{
if (defined $5) {
next if $cnt < 1;
if (length $5) {
push (@$ary, $5);
}
else {
push (@$ary, '');
}
}
else {
++$cnt;
my %hash;
$hash{$2} = [];
push (@$ary, \%hash);
$cnt = store_math_toks( $cnt, $3, $hash{$2} );
--$cnt;
}
}
return $cnt;
}
Output:
Multiply(
9,
4,
3,
_Sum(
3,
5,
4,
),
Division(
4,
Sum(
3,
5,
4
),
5
),
,
,
Subtract(
7,
8,
9
)
)
$VAR1 = [
{
'Multiply' => [
'9',
'4',
'3',
{
'_Sum' => [
'3',
'5',
'4',
''
]
},
{
'Division' => [
'4',
{
'Sum' => [
'3',
'5',
'4'
]
},
'5'
]
},
'',
'',
{
'Subtract' => [
'7',
'8',
'9'
]
}
]
}
];
Multiply(Multiply(Multiply(1,2),Multiply(3,4)),Multiply(5,6))?