Recursively capture patterns in regex - Python

Question

Given the solution in How do i extract a list of elements encased in quotation marks bounded by <> and delimited by commas - python, regex?, I was able to capture the prefix and the values of the desired pattern denoted by a CAPITALIZED.PREFIX and values within angle brackets < "value1" , "value2", ... >

"""calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",\n ORHT2BAR <"what so ever >", "this that mess < up"> ,\n LKEYS.KEYREL.CARG "<20>",\nLOOSE.SCREW ">20 but <30"\n JOKE <'whatthe ', "what" >,\n THIS + ]."""

However I get into problems with i have strings like the one above. The desired output would be:

('ORTH.FOO', ['cali.ber,kl','calf','done'])
('ORHT2BAR', ['what so ever >', 'this that mess < up'])
('JOKE', ['whathe ', 'what'])

I have tried the following but it only give me the 1st tuple, how do i get all possible tuples as in the desired output?:

import re
intext = """calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",\n ORHT2BAR <"what so ever >", "this that mess < up">\n LKEYS.KEYREL.CARG "<20>",\nLOOSE.SCREW ">20 but <30" ]."""
pattern = re.compile(r'.*?([A-Z0-9\.]*) < ([^>]*) >.*', flags=re.DOTALL)
f, v = pattern.match(intext).groups()
names = re.findall('[\'"](.*?)["\']', v)
print f, names

Regular expressions cannot capture information recursively. You'll have to parse the content twice instead. — Martijn Pieters
– Martijn Pieters, Commented Aug 12, 2013 at 9:18
so i have to parse till i read the character index of the first capture and then reparse from that index to the end of the string. and do it recursively till my groups() returns None? — alvas
– alvas, Commented Aug 12, 2013 at 9:20
As Marijn said, your input isn't a regular language so you can't use regular expressions. Just write a small state machine for parsing the input, shouldn't be more than 20something lines... — l4mpi
– l4mpi, Commented Aug 12, 2013 at 9:21
I'm not sure why re.findall is not capturing everything on my machine, but this regex is working on regex101. Otherwise, re.findall is extracting the first two parts of your desired output on my machine. — Jerry
– Jerry, Commented Aug 12, 2013 at 10:35

Jerry · Accepted Answer · 2013-08-12 11:57:01Z

1

Huh silly me. Somehow, I wasn't testing the whole string on my machine ^^;

Anyway, I used this regex and it works, you just get the results you were looking for in a list, which I guess is okay. I'm not too good in python, and don't know how to transform this list into array or tuple:

>>> import re
>>> intext = """calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",\n ORHT2BAR <"what so ever >", "this that mess < up"> ,\n LKEYS.KEYREL.CARG "<20>",\nLOOSE.SCREW ">20 but <30"\n JOKE <'whatthe ', "what" >,\n THIS + ]."""
>>> results = re.findall('\\n .*?([A-Z0-9\.]*) < *((?:[^>\n]|>")*) *>.*?(?:\\n|$)', intext)
>>> print results
[('ORTH.FOO', '"cali.ber,kl", \'calf\', "done"'), ('ORHT2BAR', '"what so ever>", "this that mess < up"'), ('JOKE', '\'whatthe \', "what" ')]

The parentheses indicate the first level elements and the single quotes the second level elements.

answered Aug 12, 2013 at 11:57

Jerry

71.8k14 gold badges106 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

alvas Over a year ago

interesting that you don't need the re.DOTALL flag, because you put int the \n into the regex.

Jerry Over a year ago

@2er0 Well, it seemed that \n is not inserting newlines in the intext, so I matched the literal \n instead. And the \n was actually when I was testing some stuff out and forgot to remove it, oops! Not that it hinders the regex in any way though. I was trying the regex on an intext with \n as true new lines when I put the \n there.

Jerry Over a year ago

The regex without the extra \n: \\n .*?([A-Z0-9\.]*) *< *((?:[^>]|>")*) *>.*?(?:\\n|$) and the demo. Also, maybe worth noting that I'm explicitly allowing >" within the < ... > part. Not sure if this might cause a problem, but the patterns seems that the 'true' > is followed shortly by a comma (with optional space in between).

alvas Over a year ago

yeah, the "true" > does is signalled by a comma (w|w/o a space). Care to explain the part on the 'non-capturing group', .*?(?:\\n|$) ?

Jerry Over a year ago

@2er0 Sure. You can remove it on regex101 (the link in my previous comment named 'demo') and see what happens. It is basically there to ensure that the stuff being matched is between \n (or at the end of the string since there would be no \n there. You'll observe that without it, <20>",\nLOOSE.SCREW "> is considered as one match (see ([A-Z0-9\.]*) that it can be absent too). Actually, I just found out a shortcoming which might or mightn't cause problems. Do the things you are matching alternate? See this edited regex.

Martijn Pieters · Accepted Answer · 2013-08-12 09:28:46Z

1

Regular expressions do not support 'recursive' parsing. Process the group between the < and > characters after capturing it with a regular expression.

The shlex module would do nicely here to parse your quoted strings:

import shlex
import re

intext = """calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",\n ORHT2BAR <"what so ever >", "this that mess < up">\n LKEYS.KEYREL.CARG "<20>",\nLOOSE.SCREW ">20 but <30" ]."""
pattern = re.compile(r'.*?([A-Z0-9\.]*) < ([^>]*) >.*', flags=re.DOTALL)
f, v = pattern.match(intext).groups()

parser = shlex.shlex(v, posix=True)
parser.whitespace += ','
names = list(parser)

print f, names

output:

ORTH.FOO ['cali.ber,kl', 'calf', 'done']

answered Aug 12, 2013 at 9:28

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

1 Comment

Andy Hayden Over a year ago

You can use the recursive pattern in the regex module (it's not supported in re), see stackoverflow.com/q/26385984/1240268... though I'm not sure if it helps in this (confusing) example of splitting.

Collectives™ on Stack Overflow

Recursively capture patterns in regex - Python

2 Answers 2

5 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related