Python RegEx: Extract all possible types of chars from string and automatically create a RegEx pattern based on sample

Question

I want automatically analyse string for all present types of chars and also create a RegEx pattern based on a column with a sample template.
So that later any string which is related to this pattern could be cleaned by only allowed chars and then aligned with pattern.

For example samples could be:

"A111AA1" - means that all possible chars: only letters and didgits; pattern should be: first letter, then 3 digits, followed by 2 letters and 1 digit.

"11AA-111A" - means that possible chars: letters, digits, hyphen/dash; pattern: 2 digits, 2 letters, dash, 3 digits, 1 letter.

Is it possible without manual if-else hardcoding? Unique patterns could be > 1000.

Thanks.

Update

Regarding extracting all possible chars in string I've created following function. It creates RegEx with existing (allowed) chars in pattern.
If you know better method, let me know.

def extractCharsFromPattern(pattern: str) -> str:
    allowedChars = []
    
    # Convert string to set of chars
    pattern = ''.join(set(pattern))
    
    # Letters
    if re.findall(r"[a-zA-Z]", pattern):
        allowedChars.append("a-zA-Z")
        pattern = re.sub(r"[a-zA-Z]", "", pattern)
    # Digits
    if re.findall(r"[0-9]", pattern):
        allowedChars.append("0-9")
        pattern = re.sub(r"[0-9]", "", pattern)    
    # Special chars
    allowedChars.append(pattern)
    
    # Prepare in regex format
    allowedChars = "[" + "".join(allowedChars) + "]"
    
    return allowedChars

I did manual regex for each template, like '[A-Z]{3}[0-9]{4}[0-9]{2}' but it's impossible for such amount of patterns. — Alex_Y
– Alex_Y, Commented Apr 14, 2021 at 6:23

David542 · Accepted Answer · 2021-04-15 19:14:40Z

1

If your patterns are that simplistic then of course you can match on that to get a regex, for example:

patterns = ["A111AA1", "11AA-111A"]
for pattern in patterns:
    re_pattern = ''.join([r'\d' if c.isdigit() else r'[a-zA-Z]' if c.isalpha() else r'-' if c=='-' else '???' for c in pattern])
    print (pattern, '-->', re_pattern)

A111AA1   --> [a-zA-Z]\d\d\d[a-zA-Z][a-zA-Z]\d
11AA-111A --> \d\d[a-zA-Z][a-zA-Z]-\d\d\d[a-zA-Z]

From your comments, if you just want a character class, you'd chain it all together. Here is an example one-line but based on your requirements you'd put it in a function:

>>> s="AA-22"
>>> r = ('['                                   # start of character class
  +  ('a-z' if re.search(r'[a-z]', s) else '') # have a lowercase?
  + ('A-Z' if re.search(r'[A-Z]', s) else '')  # have an uppercase?
  + ('0-9' if re.search(r'[0-9]', s) else '')  # have a number?
  + ('-' if re.search(r'-', s) else '')        # have a dash
  + ']'                                        # end of character class
  +  '{' + str(len(s)) + '}'                   # enforce a length?
)
# '[A-Z0-9-]{5}'
>>> re.search(r, "BB-44").group(0)
# 'BB-44'

edited Apr 15, 2021 at 19:14

answered Apr 14, 2021 at 6:03

David542

112k211 gold badges581 silver badges1.1k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Alex_Y Over a year ago

Unfortunately not that simplistic, but I got your point, thanks! Any ideas how I can extract all types of chars from string?

David542 Over a year ago

@Oleksii what o you mean "all types of characters" ? You only have a number, dash, and letter in your question?

Alex_Y Over a year ago

Yes, but it's just sample, it could be patterns with dots, space, etc. So what I mean - to extract all possible/allowed chars according to the pattern. Check my update of the question. For example: pattern = "A1-AA-1A". Means that string should contain only digits, letters and dash, all other chars removed ( re.sub(r'[^a-zA-Z0-9-]',"",string) )

David542 Over a year ago

@Oleksii sorry I don't follow: could you give an example or so of input/output?

David542 Over a year ago

@Oleksii see update, but I think your function in the question already does that. If you want to add lengths it'd need more logic.

|

Collectives™ on Stack Overflow

Python RegEx: Extract all possible types of chars from string and automatically create a RegEx pattern based on sample

Update

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Update

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related