How to match this pattern using regex in Python

Question

I have a list of names with different notations: for example:

 myList = [ab2000, abc2000_2000, AB2000, ab2000_1, ABC2000_01, AB2000_2, ABC2000_02, AB2000_A1]

the standarized version for those different notations are, for example:

'ab2000' is 'ABC2000'
'ab2000_1' is 'ABC2000_01'
'AB2000_A1' is 'ABC2000_A1'

What I tried is to separate the different characters of the string using compile.

input:

compiled = re.compile(r'[A-Za-z]+|\d+|\W+')
compiled.findall("AB2000_2000_A1")

output:

characters = ['AB', '2000', '2000', 'A', '1']

Then applying:

characters = list(set(characters))

To finally try to match the values of that list with the main components of the string: an alpha format followed by a digit format followed by an alphanumeric format.

But as you can see in the previous output I can't match 'A1' into a single character using \W+. My desired output is:

characters = ['AB', '2000', '2000', 'A1']

any idea to fix that?

o any better idea to solve my problem in general. Thank you, in advance.

It's not clear to me what the possible inputs are or what the desired output is in all cases. Perhaps ^([A-Za-z]+)(\d+)(_([A-Za-z]*)(\d+))?$ will match the groups you want? Using group matching seems more straightforward than the tokenization you're attempting. — chash
– chash, Commented Jul 7, 2020 at 15:48

Valdi_Bo · Accepted Answer · 2020-07-07 16:14:29Z

Use the following pattern with optional groups and capturing groups:

r'([A-Z]+)(\d+)(?:_([A-Z\d]+))?(?:_([A-Z\d]+))?'

and re.I flag.

Note that (?:_([A-Z\d]+))? must be repeated in order to match both third and fourth group. If you attempted to "repeat" this group, putting it once with "*" it would match only the last group, skipping the third group.

To test it, I ran the following test:

myList = ['ab2000', 'abc2000_2000', 'AB2000', 'ab2000_1', 'ABC2000_01',
    'AB2000_2', 'ABC2000_02', 'AB2000_A1', 'AB2000_2000_A1']
pat = re.compile(r'([A-Z]+)(\d+)(?:_([A-Z\d]+))?(?:_([A-Z\d]+))?', re.I)
for tt in myList:
    print(f'{tt:16} ', end=' ')
    mtch = pat.match(tt)
    if mtch:
        for it in mtch.groups():
            if it is not None:
                print(f'{it:5}', end=' ')
    print()

getting:

ab2000            ab    2000  
abc2000_2000      abc   2000  2000  
AB2000            AB    2000  
ab2000_1          ab    2000  1     
ABC2000_01        ABC   2000  01    
AB2000_2          AB    2000  2     
ABC2000_02        ABC   2000  02    
AB2000_A1         AB    2000  A1    
AB2000_2000_A1    AB    2000  2000  A1

Collectives™ on Stack Overflow

How to match this pattern using regex in Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related