0

I have a string that looks something like:

" 'a 'b '(d f g (1 2)) '(3 4) (a d) d "

And what I am trying to do is match so I get this output:

'a, 'b, '(d f g (1 2)), '(3 4), (a d), d

I am currently using:

"'\(.*\)|\(\.*\)|'\w+|\w+"

But there is a problem i've runned into using this, for example if I write

'(a b c) (d f)

it will return

'(a b c) (d f)

instead of

'(a b c), (d f)

So my question is if there is a way to solve this with regex or do I have to solve this an other way?

3
  • Since it can't be parsed by regex alone, do you have a preferred lanuage for an alternate solution? Commented Apr 1, 2012 at 22:21
  • @cbuckley , I am currently writing the program in java. Commented Apr 1, 2012 at 22:24
  • @warbio, I updated my answer with algorithm proposal. It's a common approach to work with bracket structures. Commented Apr 1, 2012 at 22:32

3 Answers 3

4

The answer is no.

The language you are trying to parse is not regular, it's context-free. So you are not able to parse it with regex.

If you're interested, here is the grammar:

 S->SS|e;
 S->'(A);
 A-> AA|(A)|w+;

It's not a regular since you can't build FSM to represent it, which is true, in case you can recursively include bracket structures.

Well, whatever. Let's answer the question "How?". Traverse the string from the first character. Once you find a hyphen, start counting brackets. Opening counts for +1, closing counts for -1. Once you hit a closing bracket with zero resulting counter, insert a comma after that bracket. Problem solved:

 'a 'b '(d f g (1 2)) '(3 4) (a d) d
        |      |   ||
        |      |   |+-- counter = 0 on closing bracket, insert comma
        |      |   +--- counter = 1
        |      +------- counter = 2
        +-------------- start counting, counter = 1

etc.

Sign up to request clarification or add additional context in comments.

5 Comments

Alright I guess I have to do it an other way, thanks for the quick answer.
Most regex flavors are not regular. Matching context-free languages is no problem for PCRE. Example stackoverflow.com/questions/7434272/…
@Qtax in fact, it's rather a bad practice to call something a "regex" that is not in fact a "regex" :( Although I agree that my answer may be not precise in the real world, it's absolutely right in academic context.
But @sudd, this is not an academic context. In fact, we call them "regexes" to emphasize that we're not talking about theory-pure regular expressions. Many regex flavors can even handle nested structures like the one in this question, though Java doesn't happen to be one of them.
@AlanMoore As I already said, I agree with this point of view.
0

If you are using PCRE or the like, you could an expression like:

'?(?:\w+|(\(?:([^()]+|(?1))*\)))

Comments

0

If I understand correctly, you want to add a comma before every space, except in parentheses. Is that right?

If so, there might be a way to do it in regex using lookaheads and lookbehinds but it's going to get messy fast. Better to split up all the terms first and then add commas just after the one's you want.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.