convert string to regex pattern

Question

I want to find the pattern of a regular expression from a character string. My goal is to be able to reuse this pattern to find a string in another context but checking the pattern.

from sting "1example4whatitry2do", I want to find pattern like: [0-9]{1}[a-z]{7}[0-9]{1}[a-z]{8}[0-9]{1}[a-z]{2}

So I can reuse this pattern to find this other example of sting 2eytmpxe8wsdtmdry1uo

I can do a loop on each caracter, but I hope there is a fast way

Thanks for your help !

so you want to be able to generate a regex pattern from a single instance of what you're trying to match, then use that regex to search for additional matches? — Jethro Cao
– Jethro Cao, Commented Oct 12, 2019 at 7:43
and it can be assumed the strings will only contain alphanumeric characters? — Jethro Cao
– Jethro Cao, Commented Oct 12, 2019 at 8:36
Yes, I'm cleanning text before (remove extra caracters) and convert to lowercase But if you have a solution to mange it :) — Manu64
– Manu64, Commented Oct 12, 2019 at 9:00
What have you tried so far? What specifically can we help you with? — MisterMiyagi
– MisterMiyagi, Commented Oct 13, 2019 at 8:06

Patrick Artner · Accepted Answer · 2019-10-13 08:17:22Z

3

You can puzzle this out:

go over your strings characterwise
- if the character is a text character add a 't' to a list
- if the character is a number add a 'd' to a list
- if the character is something else, add itself to the list

Use itertools.groupby to group consecutive identical letters into groups. Create a pattern from the group-key and the length of the group using some string literal formatting.

Code:

from itertools import groupby
from string import ascii_lowercase

lower_case = set(ascii_lowercase) # set for faster lookup

def find_regex(p):
    cum = []
    for c in p:
        if c.isdigit():
            cum.append("d")
        elif c in lower_case:
            cum.append("t")
        else:
            cum.append(c)

    grp = groupby(cum) 
    return ''.join(f'\\{what}{{{how_many}}}' 
                   if how_many>1 else f'\\{what}' 
                   for what,how_many in ( (g[0],len(list(g[1]))) for g in grp))

pattern = "1example4...whatit.ry2do"

print(find_regex(pattern))

Output:

\d\t{7}\d\.{3}\t{6}\.\t{2}\d\t{2}

The ternary in the formatting removes not needed {1} from the pattern.

See:

str.isdigit()

If you now replace '\t'with '[a-z]' your regex should fit. You could also replace isdigit check using a regex r'\d' or a in set(string.digits) instead.

pattern = "1example4...whatit.ry2do"

pat = find_regex(pattern).replace(r"\t","[a-z]")
print(pat) # \d[a-z]{7}\d\.{3}[a-z]{6}\.[a-z]{2}\d[a-z]{2}

See

string module for ascii_lowercase and digits

edited Oct 13, 2019 at 8:17

answered Oct 12, 2019 at 9:33

Patrick Artner

51.9k10 gold badges50 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Patrick Artner Over a year ago

@Manu fixed some error - I somehow omitted the step of converting '\t' to a character range and I now use a set for character comparison what makes it faster. If it works, see: stackoverflow.com/help/someone-answers and meta.stackexchange.com/questions/5234/…

Collectives™ on Stack Overflow

convert string to regex pattern

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related