9

I have a regex in Python that contains several named groups. However, patterns that match one group can be missed if previous groups have matched because overlaps don't seem to be allowed. As an example:

import re
myText = 'sgasgAAAaoasgosaegnsBBBausgisego'
myRegex = re.compile('(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))')

x = re.findall(myRegex,myText)
print(x)

Produces the output:

[('AAA', '')]

The 'long' group does not find a match because 'AAA' was used-up in finding a match for the preceding 'short' group.

I've tried to find a method to allow overlapping but failed. As an alternative, I've been looking for a way to run each named group separately. Something like the following:

for g in myRegex.groupindex.keys():
    match = re.findall(***regex_for_named_group_g***,myText)

Is it possible to extract the regex for each named group?

Ultimately, I'd like to produce a dictionary output (or similar) like:

{'short':'AAA',
 'long':'AAAaoasgosaegnsBBB'}

Any and all suggestions would be gratefully received.

4
  • 1
    No regex engine allows testing two valid matches at the same position. You may use overlapping groups in the pattern though, like in this demo Commented Feb 19, 2018 at 1:13
  • Thanks for the info and link - that solution is really interesting. However, the regexes I will be using are being produced by a simple algorithm and will have the structure of each named group being separated by an '|' (OR) character. As a result, nesting the regexes won't be feasible in this instance. But a very useful tip all the same. Commented Feb 19, 2018 at 1:24
  • Ok, so you have to run the regexes separately, or discard regex altogether if possible. Commented Feb 19, 2018 at 7:42
  • @Wiktor Stribiżew Yes, I think you're right. I've added a bit of a hack to try to automate the process of running the regexes separately and collating the results in a dictionary. Commented Feb 19, 2018 at 13:01

2 Answers 2

3

There really doesn't appear to be a nicer way to do this, but here's a another approach, along the lines of this other answer but somewhat simpler. It will work provided that a) your patterns will always formed as a series of named groups separated by pipes, and b) the named group patterns never contain named groups themselves.

The following would be my approach if you're interested in all matches of each pattern. The argument to re.split looks for a literal pipe followed by the (?=<, the beginning of a named group. It compiles each subpattern and uses the groupindex attribute to extract the name.

def nameToMatches(pattern, string):
    result = dict()
    for subpattern in re.split('\|(?=\(\?P<)', pattern):
        rx = re.compile(subpattern)
        name = list(rx.groupindex)[0]
        result[name] = rx.findall(string)
    return result

With your given text and pattern, returns {'long': ['AAAaoasgosaegnsBBB'], 'short': ['AAA']}. Patterns that don't match at all will have an empty list for their value.

If you only want one match per pattern, you can make it a bit simpler still:

def nameToMatch(pattern, string):
    result = dict()
    for subpattern in re.split('\|(?=\(\?P<)', pattern):
        match = re.search(subpattern, string)
        if match:
            result.update(match.groupdict())
    return result

This gives {'long': 'AAAaoasgosaegnsBBB', 'short': 'AAA'} for your givens. If one of the named groups doesn't match at all, it will be absent from the dict.

Sign up to request clarification or add additional context in comments.

Comments

2

There didn't seem to be an obvious answer, so here's a hack. It needs a bit of finessing but basically it splits the original regex into its component parts and runs each group regex separately on the original text.

import re

myTextStr = 'sgasgAAAaoasgosaegnsBBBausgisego'
myRegexStr = '(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))'
myRegex = re.compile(myRegexStr)   # This is actually no longer needed

print("Full regex with multiple groups")
print(myRegexStr)

# Use a regex to split the original regex into separate regexes
# based on group names
mySplitGroupsRegexStr = '\(\?P<(\w+)>(\([\w\W]+?\))\)(?:\||\Z)'
mySplitGroupsRegex = re.compile(mySplitGroupsRegexStr)
mySepRegexesList = re.findall(mySplitGroupsRegex,myRegexStr)

print("\nList of separate regexes")
print(mySepRegexesList)

# Convert separate regexes to a dict with group name as key
# and regex as value
mySepRegexDict = {reg[0]:reg[1] for reg in mySepRegexesList}
print("\nDictionary of separate regexes with group names as keys")
print(mySepRegexDict)

# Step through each key and run the group regex on the original text.
# Results are stored in a dictionary with group name as key and
# extracted text as value.
myGroupRegexOutput = {}
for g,r in mySepRegexDict.items():
    m = re.findall(re.compile(r),myTextStr)
    myGroupRegexOutput[g] = m[0]

print("\nOutput of overlapping named group regexes")
print(myGroupRegexOutput)

The resulting output is:

Full regex with multiple groups
(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))

List of separate regexes
[('short', '(?:AAA)'), ('long', '(?:AAA.*BBB)')]

Dictionary of separate regexes with group names as keys
{'short': '(?:AAA)', 'long': '(?:AAA.*BBB)'}

Output of overlapping named group regexes
{'short': 'AAA', 'long': 'AAAaoasgosaegnsBBB'}

This might be useful to someone somewhere.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.