0

Given the solution in How do i extract a list of elements encased in quotation marks bounded by <> and delimited by commas - python, regex?, I was able to capture the prefix and the values of the desired pattern denoted by a CAPITALIZED.PREFIX and values within angle brackets < "value1" , "value2", ... >

"""calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",\n ORHT2BAR <"what so ever >", "this that mess < up"> ,\n LKEYS.KEYREL.CARG "<20>",\nLOOSE.SCREW ">20 but <30"\n JOKE <'whatthe ', "what" >,\n THIS + ]."""

However I get into problems with i have strings like the one above. The desired output would be:

('ORTH.FOO', ['cali.ber,kl','calf','done'])
('ORHT2BAR', ['what so ever >', 'this that mess < up'])
('JOKE', ['whathe ', 'what'])

I have tried the following but it only give me the 1st tuple, how do i get all possible tuples as in the desired output?:

import re
intext = """calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",\n ORHT2BAR <"what so ever >", "this that mess < up">\n LKEYS.KEYREL.CARG "<20>",\nLOOSE.SCREW ">20 but <30" ]."""
pattern = re.compile(r'.*?([A-Z0-9\.]*) < ([^>]*) >.*', flags=re.DOTALL)
f, v = pattern.match(intext).groups()
names = re.findall('[\'"](.*?)["\']', v)
print f, names
4
  • 1
    Regular expressions cannot capture information recursively. You'll have to parse the content twice instead. Commented Aug 12, 2013 at 9:18
  • so i have to parse till i read the character index of the first capture and then reparse from that index to the end of the string. and do it recursively till my groups() returns None? Commented Aug 12, 2013 at 9:20
  • As Marijn said, your input isn't a regular language so you can't use regular expressions. Just write a small state machine for parsing the input, shouldn't be more than 20something lines... Commented Aug 12, 2013 at 9:21
  • 1
    I'm not sure why re.findall is not capturing everything on my machine, but this regex is working on regex101. Otherwise, re.findall is extracting the first two parts of your desired output on my machine. Commented Aug 12, 2013 at 10:35

2 Answers 2

1

Huh silly me. Somehow, I wasn't testing the whole string on my machine ^^;

Anyway, I used this regex and it works, you just get the results you were looking for in a list, which I guess is okay. I'm not too good in python, and don't know how to transform this list into array or tuple:

>>> import re
>>> intext = """calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",\n ORHT2BAR <"what so ever >", "this that mess < up"> ,\n LKEYS.KEYREL.CARG "<20>",\nLOOSE.SCREW ">20 but <30"\n JOKE <'whatthe ', "what" >,\n THIS + ]."""
>>> results = re.findall('\\n .*?([A-Z0-9\.]*) < *((?:[^>\n]|>")*) *>.*?(?:\\n|$)', intext)
>>> print results
[('ORTH.FOO', '"cali.ber,kl", \'calf\', "done"'), ('ORHT2BAR', '"what so ever>", "this that mess < up"'), ('JOKE', '\'whatthe \', "what" ')]

The parentheses indicate the first level elements and the single quotes the second level elements.

Sign up to request clarification or add additional context in comments.

5 Comments

interesting that you don't need the re.DOTALL flag, because you put int the \n into the regex.
@2er0 Well, it seemed that \n is not inserting newlines in the intext, so I matched the literal \n instead. And the \n was actually when I was testing some stuff out and forgot to remove it, oops! Not that it hinders the regex in any way though. I was trying the regex on an intext with \n as true new lines when I put the \n there.
The regex without the extra \n: \\n .*?([A-Z0-9\.]*) *< *((?:[^>]|>")*) *>.*?(?:\\n|$) and the demo. Also, maybe worth noting that I'm explicitly allowing >" within the < ... > part. Not sure if this might cause a problem, but the patterns seems that the 'true' > is followed shortly by a comma (with optional space in between).
yeah, the "true" > does is signalled by a comma (w|w/o a space). Care to explain the part on the 'non-capturing group', .*?(?:\\n|$) ?
@2er0 Sure. You can remove it on regex101 (the link in my previous comment named 'demo') and see what happens. It is basically there to ensure that the stuff being matched is between \n (or at the end of the string since there would be no \n there. You'll observe that without it, <20>",\nLOOSE.SCREW "> is considered as one match (see ([A-Z0-9\.]*) that it can be absent too). Actually, I just found out a shortcoming which might or mightn't cause problems. Do the things you are matching alternate? See this edited regex.
1

Regular expressions do not support 'recursive' parsing. Process the group between the < and > characters after capturing it with a regular expression.

The shlex module would do nicely here to parse your quoted strings:

import shlex
import re

intext = """calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",\n ORHT2BAR <"what so ever >", "this that mess < up">\n LKEYS.KEYREL.CARG "<20>",\nLOOSE.SCREW ">20 but <30" ]."""
pattern = re.compile(r'.*?([A-Z0-9\.]*) < ([^>]*) >.*', flags=re.DOTALL)
f, v = pattern.match(intext).groups()

parser = shlex.shlex(v, posix=True)
parser.whitespace += ','
names = list(parser)

print f, names

output:

ORTH.FOO ['cali.ber,kl', 'calf', 'done']

1 Comment

You can use the recursive pattern in the regex module (it's not supported in re), see stackoverflow.com/q/26385984/1240268... though I'm not sure if it helps in this (confusing) example of splitting.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.