I am trying to write a regex to split a string into what I call 'terms' (e.g. words, numbers, and surrounding spaces) and 'logical operators' (e.g. <AND, &>, <OR, |>, <NOT,-,~>, <(,{,[,),},]>). For this question, we can ignore the alternative symbols for AND, OR, and NOT, and grouping is just with '(' and ')'.
For example:
Frank and Bob are nice AND NOT (Henry is good OR Sam is 102 years old)
should be split into this Python list:
["Frank and Bob are nice", "AND", "NOT", "(", "Henry is good", "OR", "Sam is 102 years old", ")"]
My code:
pattern = r"(NOT|\-|\~)?\s*(\(|\[|\{)?\s*(NOT|\-|\~)?\s*([\w+\s*]*)\s+(AND|&|OR|\|)?\s+(NOT|\-|\~)?\s*([\w+\s*]*)\s*(\)|\]|\})?"
t = re.split(pattern, text)
raw_terms = list(filter(None, t))
The pattern works for this test case, the one above, and others,
NOT Frank is a good boy AND Joe
raw_terms=['NOT', 'Frank is a good boy', 'AND', 'Joe']
but not these:
NOT Frank
raw_terms = ['NOT Frank']
NOT Frank is a good boy
raw_terms=['NOT Frank is a good boy']
I have tried changing the two \s+ to \s*, but not all test cases passed. I am not a regex expert (this one is the most complicated one I have tried).
I am hoping someone can help me understand why these two test cases fail, and how to fix the regex so all the test cases pass.
Thanks,
Mark