0

I need to match patterns like the following: AAXX#

Where:
* AA is from a set (ie. a list) of 1-3 char alpha prefixes,
* XX is from a different list of pre-defined strings, and
* any single-digit numeral follows.

AA strings: ['bo','h','fr','sam','pe']

XX strings: cl + ['x','n','r','nr','eaner] //OR ELSE JUST// ro

Desired Result: bool indicating whether any of the possible combos match the provided string.

Sample Test Strings:
item = "boro1" - that is, bo + ro + 1
item = "samcl2"- i.e. sam + cl + 2
item = "hcln3" - i.e. h + cln + 3

The best I can figure is to use a loop, but I am having trouble with the essential regex. It works for the single-letter optionals cln, clx, clr, but not for the longer ones clnr, cleaner.

Code:

item = "hclnr2" #h + clnr + 2
out = False
arr = ['bo','h','fr','sam','pe']
for mnrl in arr:
    myrx = re.escape(mnrl) + r'cl[x|n|r|nr|eaner]\d'
    thisone = bool(re.search(myrx, item))
    print('mnrl: '+mnrl+' - ', thisone)
    if thisone: out = True

##########################################################################
# SKIP THIS - INCLUDED IN CASE S/O HAS A BETTER SOLUTION THAN A SECOND LOOP
# THE ABOVE FOR-LOOP handled THE CL[opts] TESTS, THIS LOOP DOES THE RO TESTS
##########################################################################
#if not out: #If not found a match amongst the "cl__" options, test for "ro"
#    for mnrl in arr:
#        myrx = re.escape(mnrl) + r'ro\d'
#        thisone = bool(re.search(myrx, item))
#        print('mnrl: '+mnrl+' - ', thisone)
#    if thisone: out = True
##########################################################################

print('result: ', out)

PRINTS:

mnrl: bo - False
mnrl: h - False <======
mnrl: fr - False
mnrl: sam - False
mnrl: pe - False

However, changing item to:

item = "hcln2" #h + cln + 2

PRINTS:
mnrl: bo - False
mnrl: h - True <========
mnrl: fr - False
mnrl: sam - False
mnrl: pe - False

And ditto for item = hclr5 or item = hclx9 BUT NOT hcleaner9

2
  • 1
    Per the explanation, samcl2 shouldn't be matching (should return false). Or can XX strings be cl alone (without anything following it)? Commented Oct 22, 2018 at 18:42
  • Sorry, my bad. The answer is Yes, cl alone is enough to make the output True. So, samcl2 should be True. Merely cl by itself, is False, as is cl# without one of the desired prefixes. Commented Oct 22, 2018 at 18:49

2 Answers 2

2

My approach would be

import re

words = ['boro1', 'samcl2', 'hcln3', 'boro1+unwantedstuff']

p = r'(bo|h|fr|sam|pe)(cl(x|n|r|nr|eaner|)|ro)\d$'

for w in words:
      print(re.match(p, w))

Result:

<_sre.SRE_Match object; span=(0, 5), match='boro1'>
<_sre.SRE_Match object; span=(0, 6), match='samcl2'>    
<_sre.SRE_Match object; span=(0, 5), match='hcln3'>
None

For your desired boolean output you can simply cast the match object to 'bool'.

Sign up to request clarification or add additional context in comments.

Comments

2

Some of the misconceptions in your code include the usage of character classes (syntax: [ ... ]). When you use a character class, any single character from the character class will try to match the string (with the exception where a few other characters are used, these characters being ^ and - when placed in specific positions). This means that:

[x|n|r|nr|eaner]

Will match any one character among: x, |, n, r, e, a (duplicated characters are essentially being discarded)

I'm not entirely sure why you are doing all those intricate things like re.escape in your code, I trust you can understand the snippet below to adapt it to your situation:

import re

def matchPattern(item, extract=False):
    result = re.match(r"(bo|h|fr|sam|pe)((?:cl(?:nr|eaner|[xnr]|))|ro)([0-9])$", item)
    if result:
        if extract:
            return (result.group(1), result.group(2), result.group(3))
        else:
            return True
    else:
        if extract:
            return ('','','')
        else:
            return False

I tweaked the def a little such that you get a boolean if you call for example matchPattern("boro1"), and if you want to get the substring components, you can call matchPattern("boro1", True) and you will get ('bo', 'ro', '1') as result (or ('', '', '') if it doesn't match)

As for the regex itself, you can test it on here (regex101.com)

You need to use groups if you want to use the | regex operator. In the regex I use above,

  • (bo|h|fr|sam|pe) means either one of bo, h, fr, sam or pe
  • ((?:cl(?:nr|eaner|[xnr]|))|ro) means either (?:cl(?:nr|eaner|[xnr]|)) (this means cl followed by either nr, eaner, x, n, r or nothing) or ro
  • ([0-9]) means a number (I prefer this to \d for minor additional performance)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.