2

I want to find the pattern of a regular expression from a character string. My goal is to be able to reuse this pattern to find a string in another context but checking the pattern.

from sting "1example4whatitry2do", I want to find pattern like: [0-9]{1}[a-z]{7}[0-9]{1}[a-z]{8}[0-9]{1}[a-z]{2}

So I can reuse this pattern to find this other example of sting 2eytmpxe8wsdtmdry1uo

I can do a loop on each caracter, but I hope there is a fast way

Thanks for your help !

6
  • so you want to be able to generate a regex pattern from a single instance of what you're trying to match, then use that regex to search for additional matches? Commented Oct 12, 2019 at 7:43
  • Yes, you're right Commented Oct 12, 2019 at 8:22
  • and it can be assumed the strings will only contain alphanumeric characters? Commented Oct 12, 2019 at 8:36
  • Yes, I'm cleanning text before (remove extra caracters) and convert to lowercase But if you have a solution to mange it :) Commented Oct 12, 2019 at 9:00
  • What have you tried so far? What specifically can we help you with? Commented Oct 13, 2019 at 8:06

1 Answer 1

3

You can puzzle this out:

  • go over your strings characterwise
    • if the character is a text character add a 't' to a list
    • if the character is a number add a 'd' to a list
    • if the character is something else, add itself to the list

Use itertools.groupby to group consecutive identical letters into groups. Create a pattern from the group-key and the length of the group using some string literal formatting.

Code:

from itertools import groupby
from string import ascii_lowercase

lower_case = set(ascii_lowercase) # set for faster lookup

def find_regex(p):
    cum = []
    for c in p:
        if c.isdigit():
            cum.append("d")
        elif c in lower_case:
            cum.append("t")
        else:
            cum.append(c)

    grp = groupby(cum) 
    return ''.join(f'\\{what}{{{how_many}}}' 
                   if how_many>1 else f'\\{what}' 
                   for what,how_many in ( (g[0],len(list(g[1]))) for g in grp))

pattern = "1example4...whatit.ry2do"

print(find_regex(pattern))

Output:

\d\t{7}\d\.{3}\t{6}\.\t{2}\d\t{2}

The ternary in the formatting removes not needed {1} from the pattern.

See:

If you now replace '\t'with '[a-z]' your regex should fit. You could also replace isdigit check using a regex r'\d' or a in set(string.digits) instead.

pattern = "1example4...whatit.ry2do"

pat = find_regex(pattern).replace(r"\t","[a-z]")
print(pat) # \d[a-z]{7}\d\.{3}[a-z]{6}\.[a-z]{2}\d[a-z]{2}

See

Sign up to request clarification or add additional context in comments.

1 Comment

@Manu fixed some error - I somehow omitted the step of converting '\t' to a character range and I now use a set for character comparison what makes it faster. If it works, see: stackoverflow.com/help/someone-answers and meta.stackexchange.com/questions/5234/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.