Extract named group regex pattern from a compiled regex in Python

Question

I have a regex in Python that contains several named groups. However, patterns that match one group can be missed if previous groups have matched because overlaps don't seem to be allowed. As an example:

import re
myText = 'sgasgAAAaoasgosaegnsBBBausgisego'
myRegex = re.compile('(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))')

x = re.findall(myRegex,myText)
print(x)

Produces the output:

[('AAA', '')]

The 'long' group does not find a match because 'AAA' was used-up in finding a match for the preceding 'short' group.

I've tried to find a method to allow overlapping but failed. As an alternative, I've been looking for a way to run each named group separately. Something like the following:

for g in myRegex.groupindex.keys():
    match = re.findall(***regex_for_named_group_g***,myText)

Is it possible to extract the regex for each named group?

Ultimately, I'd like to produce a dictionary output (or similar) like:

{'short':'AAA',
 'long':'AAAaoasgosaegnsBBB'}

Any and all suggestions would be gratefully received.

No regex engine allows testing two valid matches at the same position. You may use overlapping groups in the pattern though, like in this demo — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Feb 19, 2018 at 1:13
Thanks for the info and link - that solution is really interesting. However, the regexes I will be using are being produced by a simple algorithm and will have the structure of each named group being separated by an '|' (OR) character. As a result, nesting the regexes won't be feasible in this instance. But a very useful tip all the same. — user1718097
– user1718097, Commented Feb 19, 2018 at 1:24
Ok, so you have to run the regexes separately, or discard regex altogether if possible. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Feb 19, 2018 at 7:42
@Wiktor Stribiżew Yes, I think you're right. I've added a bit of a hack to try to automate the process of running the regexes separately and collating the results in a dictionary. — user1718097
– user1718097, Commented Feb 19, 2018 at 13:01

Nathan Vērzemnieks · Accepted Answer · 2018-02-19 21:05:21Z

There really doesn't appear to be a nicer way to do this, but here's a another approach, along the lines of this other answer but somewhat simpler. It will work provided that a) your patterns will always formed as a series of named groups separated by pipes, and b) the named group patterns never contain named groups themselves.

The following would be my approach if you're interested in all matches of each pattern. The argument to re.split looks for a literal pipe followed by the (?=<, the beginning of a named group. It compiles each subpattern and uses the groupindex attribute to extract the name.

def nameToMatches(pattern, string):
    result = dict()
    for subpattern in re.split('\|(?=\(\?P<)', pattern):
        rx = re.compile(subpattern)
        name = list(rx.groupindex)[0]
        result[name] = rx.findall(string)
    return result

With your given text and pattern, returns {'long': ['AAAaoasgosaegnsBBB'], 'short': ['AAA']}. Patterns that don't match at all will have an empty list for their value.

If you only want one match per pattern, you can make it a bit simpler still:

def nameToMatch(pattern, string):
    result = dict()
    for subpattern in re.split('\|(?=\(\?P<)', pattern):
        match = re.search(subpattern, string)
        if match:
            result.update(match.groupdict())
    return result

This gives {'long': 'AAAaoasgosaegnsBBB', 'short': 'AAA'} for your givens. If one of the named groups doesn't match at all, it will be absent from the dict.

user1718097 · Accepted Answer · 2018-02-19 02:28:28Z

There didn't seem to be an obvious answer, so here's a hack. It needs a bit of finessing but basically it splits the original regex into its component parts and runs each group regex separately on the original text.

import re

myTextStr = 'sgasgAAAaoasgosaegnsBBBausgisego'
myRegexStr = '(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))'
myRegex = re.compile(myRegexStr)   # This is actually no longer needed

print("Full regex with multiple groups")
print(myRegexStr)

# Use a regex to split the original regex into separate regexes
# based on group names
mySplitGroupsRegexStr = '\(\?P<(\w+)>(\([\w\W]+?\))\)(?:\||\Z)'
mySplitGroupsRegex = re.compile(mySplitGroupsRegexStr)
mySepRegexesList = re.findall(mySplitGroupsRegex,myRegexStr)

print("\nList of separate regexes")
print(mySepRegexesList)

# Convert separate regexes to a dict with group name as key
# and regex as value
mySepRegexDict = {reg[0]:reg[1] for reg in mySepRegexesList}
print("\nDictionary of separate regexes with group names as keys")
print(mySepRegexDict)

# Step through each key and run the group regex on the original text.
# Results are stored in a dictionary with group name as key and
# extracted text as value.
myGroupRegexOutput = {}
for g,r in mySepRegexDict.items():
    m = re.findall(re.compile(r),myTextStr)
    myGroupRegexOutput[g] = m[0]

print("\nOutput of overlapping named group regexes")
print(myGroupRegexOutput)

The resulting output is:

Full regex with multiple groups
(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))

List of separate regexes
[('short', '(?:AAA)'), ('long', '(?:AAA.*BBB)')]

Dictionary of separate regexes with group names as keys
{'short': '(?:AAA)', 'long': '(?:AAA.*BBB)'}

Output of overlapping named group regexes
{'short': 'AAA', 'long': 'AAAaoasgosaegnsBBB'}

This might be useful to someone somewhere.

Collectives™ on Stack Overflow

Extract named group regex pattern from a compiled regex in Python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related