I have a list of names with different notations: for example:
myList = [ab2000, abc2000_2000, AB2000, ab2000_1, ABC2000_01, AB2000_2, ABC2000_02, AB2000_A1]
the standarized version for those different notations are, for example:
'ab2000' is 'ABC2000'
'ab2000_1' is 'ABC2000_01'
'AB2000_A1' is 'ABC2000_A1'
What I tried is to separate the different characters of the string using compile.
input:
compiled = re.compile(r'[A-Za-z]+|\d+|\W+')
compiled.findall("AB2000_2000_A1")
output:
characters = ['AB', '2000', '2000', 'A', '1']
Then applying:
characters = list(set(characters))
To finally try to match the values of that list with the main components of the string: an alpha format followed by a digit format followed by an alphanumeric format.
But as you can see in the previous output I can't match 'A1' into a single character using \W+. My desired output is:
characters = ['AB', '2000', '2000', 'A1']
any idea to fix that?
o any better idea to solve my problem in general. Thank you, in advance.
^([A-Za-z]+)(\d+)(_([A-Za-z]*)(\d+))?$will match the groups you want? Using group matching seems more straightforward than the tokenization you're attempting.