I want automatically analyse string for all present types of chars and also create a RegEx pattern based on a column with a sample template.
So that later any string which is related to this pattern could be cleaned by only allowed chars and then aligned with pattern.
For example samples could be:
"A111AA1" - means that all possible chars: only letters and didgits; pattern should be: first letter, then 3 digits, followed by 2 letters and 1 digit.
"11AA-111A" - means that possible chars: letters, digits, hyphen/dash; pattern: 2 digits, 2 letters, dash, 3 digits, 1 letter.
Is it possible without manual if-else hardcoding? Unique patterns could be > 1000.
Thanks.
Update
Regarding extracting all possible chars in string I've created following function. It creates RegEx with existing (allowed) chars in pattern.
If you know better method, let me know.
def extractCharsFromPattern(pattern: str) -> str:
allowedChars = []
# Convert string to set of chars
pattern = ''.join(set(pattern))
# Letters
if re.findall(r"[a-zA-Z]", pattern):
allowedChars.append("a-zA-Z")
pattern = re.sub(r"[a-zA-Z]", "", pattern)
# Digits
if re.findall(r"[0-9]", pattern):
allowedChars.append("0-9")
pattern = re.sub(r"[0-9]", "", pattern)
# Special chars
allowedChars.append(pattern)
# Prepare in regex format
allowedChars = "[" + "".join(allowedChars) + "]"
return allowedChars