2

I am trying to write a regex to split a string into what I call 'terms' (e.g. words, numbers, and surrounding spaces) and 'logical operators' (e.g. <AND, &>, <OR, |>, <NOT,-,~>, <(,{,[,),},]>). For this question, we can ignore the alternative symbols for AND, OR, and NOT, and grouping is just with '(' and ')'.

For example:

Frank and Bob are nice AND NOT (Henry is good OR Sam is 102 years old)

should be split into this Python list:

["Frank and Bob are nice", "AND", "NOT", "(", "Henry is good", "OR", "Sam is 102 years old", ")"]

My code:

pattern = r"(NOT|\-|\~)?\s*(\(|\[|\{)?\s*(NOT|\-|\~)?\s*([\w+\s*]*)\s+(AND|&|OR|\|)?\s+(NOT|\-|\~)?\s*([\w+\s*]*)\s*(\)|\]|\})?"  
t = re.split(pattern, text)
raw_terms = list(filter(None, t))

The pattern works for this test case, the one above, and others,

NOT Frank is a good boy AND Joe
raw_terms=['NOT', 'Frank is a good boy', 'AND', 'Joe']

but not these:

NOT Frank
raw_terms = ['NOT Frank']
NOT Frank is a good boy
raw_terms=['NOT Frank is a good boy']

I have tried changing the two \s+ to \s*, but not all test cases passed. I am not a regex expert (this one is the most complicated one I have tried).

I am hoping someone can help me understand why these two test cases fail, and how to fix the regex so all the test cases pass.

Thanks,

Mark

1 Answer 1

1

Use

re.split(r'\s*(\b(?:AND|OR|NOT)\b|[()])\s*', string)

See regex proof.

Explanation

--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
      AND                      'AND'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      OR                       'OR'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      NOT                      'NOT'
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    [()]                     any character of: '(', ')'
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))

Python code:

import re
string = 'Frank and Bob are nice AND NOT (Henry is good OR Sam is 102 years old)'
output = re.split(r'\s*(\b(?:AND|OR|NOT)\b|[()])\s*', string)
output = list(filter(None, output))
print(output)

Results: ['Frank and Bob are nice', 'AND', 'NOT', '(', 'Henry is good', 'OR', 'Sam is 102 years old', ')']

Sign up to request clarification or add additional context in comments.

5 Comments

Great solution! I don't completely understand how it works. If you group but don't capture the AND/OR/NOT/(), how does it end up in the output list? Regexs make my brain hurt....
@user1045680 (\b(?:AND|OR|NOT)\b|[()]) is a capturing group, re.split adds all matches to the results.
I tried adding back in the alternate logical operators (&, |, ~, -) and alternate grouping characters ([]{}) and all the tests with the alternate characters failed. I tried \s*(\b(?:AND|OR|NOT|&|\||-|~)\b|[(){}[]])\s* What am I missing?
I played around with your solution and was able to add back the alternate characters. The new regex is \s*(\b(?:AND|OR|NOT)\b|[()&\~\-\|{}\[\]])\s*
@user1045680 Use \s*(\b(?:AND|OR|NOT)\b|[][()&~|{}-])\s*, see proof.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.