0

I want automatically analyse string for all present types of chars and also create a RegEx pattern based on a column with a sample template.
So that later any string which is related to this pattern could be cleaned by only allowed chars and then aligned with pattern.

For example samples could be:

"A111AA1" - means that all possible chars: only letters and didgits; pattern should be: first letter, then 3 digits, followed by 2 letters and 1 digit.

"11AA-111A" - means that possible chars: letters, digits, hyphen/dash; pattern: 2 digits, 2 letters, dash, 3 digits, 1 letter.

Is it possible without manual if-else hardcoding? Unique patterns could be > 1000.

Thanks.

Update

Regarding extracting all possible chars in string I've created following function. It creates RegEx with existing (allowed) chars in pattern.
If you know better method, let me know.

def extractCharsFromPattern(pattern: str) -> str:
    allowedChars = []
    
    # Convert string to set of chars
    pattern = ''.join(set(pattern))
    
    # Letters
    if re.findall(r"[a-zA-Z]", pattern):
        allowedChars.append("a-zA-Z")
        pattern = re.sub(r"[a-zA-Z]", "", pattern)
    # Digits
    if re.findall(r"[0-9]", pattern):
        allowedChars.append("0-9")
        pattern = re.sub(r"[0-9]", "", pattern)    
    # Special chars
    allowedChars.append(pattern)
    
    # Prepare in regex format
    allowedChars = "[" + "".join(allowedChars) + "]"
    
    return allowedChars
2
  • Please show us how you tried to solve the problem. Commented Apr 14, 2021 at 6:14
  • I did manual regex for each template, like '[A-Z]{3}[0-9]{4}[0-9]{2}' but it's impossible for such amount of patterns. Commented Apr 14, 2021 at 6:23

1 Answer 1

1

If your patterns are that simplistic then of course you can match on that to get a regex, for example:

patterns = ["A111AA1", "11AA-111A"]
for pattern in patterns:
    re_pattern = ''.join([r'\d' if c.isdigit() else r'[a-zA-Z]' if c.isalpha() else r'-' if c=='-' else '???' for c in pattern])
    print (pattern, '-->', re_pattern)

A111AA1   --> [a-zA-Z]\d\d\d[a-zA-Z][a-zA-Z]\d
11AA-111A --> \d\d[a-zA-Z][a-zA-Z]-\d\d\d[a-zA-Z]

From your comments, if you just want a character class, you'd chain it all together. Here is an example one-line but based on your requirements you'd put it in a function:

>>> s="AA-22"
>>> r = ('['                                   # start of character class
  +  ('a-z' if re.search(r'[a-z]', s) else '') # have a lowercase?
  + ('A-Z' if re.search(r'[A-Z]', s) else '')  # have an uppercase?
  + ('0-9' if re.search(r'[0-9]', s) else '')  # have a number?
  + ('-' if re.search(r'-', s) else '')        # have a dash
  + ']'                                        # end of character class
  +  '{' + str(len(s)) + '}'                   # enforce a length?
)
# '[A-Z0-9-]{5}'
>>> re.search(r, "BB-44").group(0)
# 'BB-44'
Sign up to request clarification or add additional context in comments.

6 Comments

Unfortunately not that simplistic, but I got your point, thanks! Any ideas how I can extract all types of chars from string?
@Oleksii what o you mean "all types of characters" ? You only have a number, dash, and letter in your question?
Yes, but it's just sample, it could be patterns with dots, space, etc. So what I mean - to extract all possible/allowed chars according to the pattern. Check my update of the question. For example: pattern = "A1-AA-1A". Means that string should contain only digits, letters and dash, all other chars removed ( re.sub(r'[^a-zA-Z0-9-]',"",string) )
@Oleksii sorry I don't follow: could you give an example or so of input/output?
@Oleksii see update, but I think your function in the question already does that. If you want to add lengths it'd need more logic.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.