0

I have a list of names with different notations: for example:

 myList = [ab2000, abc2000_2000, AB2000, ab2000_1, ABC2000_01, AB2000_2, ABC2000_02, AB2000_A1]

the standarized version for those different notations are, for example:

'ab2000' is 'ABC2000'
'ab2000_1' is 'ABC2000_01'
'AB2000_A1' is 'ABC2000_A1'

What I tried is to separate the different characters of the string using compile.

input:

compiled = re.compile(r'[A-Za-z]+|\d+|\W+')
compiled.findall("AB2000_2000_A1")

output:

characters = ['AB', '2000', '2000', 'A', '1']

Then applying:

characters = list(set(characters))

To finally try to match the values of that list with the main components of the string: an alpha format followed by a digit format followed by an alphanumeric format.

But as you can see in the previous output I can't match 'A1' into a single character using \W+. My desired output is:

characters = ['AB', '2000', '2000', 'A1']

any idea to fix that?

o any better idea to solve my problem in general. Thank you, in advance.

1
  • 1
    It's not clear to me what the possible inputs are or what the desired output is in all cases. Perhaps ^([A-Za-z]+)(\d+)(_([A-Za-z]*)(\d+))?$ will match the groups you want? Using group matching seems more straightforward than the tokenization you're attempting. Commented Jul 7, 2020 at 15:48

1 Answer 1

1

Use the following pattern with optional groups and capturing groups:

r'([A-Z]+)(\d+)(?:_([A-Z\d]+))?(?:_([A-Z\d]+))?'

and re.I flag.

Note that (?:_([A-Z\d]+))? must be repeated in order to match both third and fourth group. If you attempted to "repeat" this group, putting it once with "*" it would match only the last group, skipping the third group.

To test it, I ran the following test:

myList = ['ab2000', 'abc2000_2000', 'AB2000', 'ab2000_1', 'ABC2000_01',
    'AB2000_2', 'ABC2000_02', 'AB2000_A1', 'AB2000_2000_A1']
pat = re.compile(r'([A-Z]+)(\d+)(?:_([A-Z\d]+))?(?:_([A-Z\d]+))?', re.I)
for tt in myList:
    print(f'{tt:16} ', end=' ')
    mtch = pat.match(tt)
    if mtch:
        for it in mtch.groups():
            if it is not None:
                print(f'{it:5}', end=' ')
    print()

getting:

ab2000            ab    2000  
abc2000_2000      abc   2000  2000  
AB2000            AB    2000  
ab2000_1          ab    2000  1     
ABC2000_01        ABC   2000  01    
AB2000_2          AB    2000  2     
ABC2000_02        ABC   2000  02    
AB2000_A1         AB    2000  A1    
AB2000_2000_A1    AB    2000  2000  A1   
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.