1

I get some string like this: \input{{whatever}{1}}\mypath{{path1}{path2}{path3}...{pathn}}\shape{{0.2}{0.3}} I would like to capture all the paths: path1, path2, ... pathn. I tried the re module in python. However, it does not support multiple capture. For example: r"\\mypath\{(\{[^\{\}\[\]]*\})*\}" will only return the last matched group. Applying the pattern to search(r"\mypath{{path1}{path2}})" will only return groups() as ("{path2}",)

Then I found an alternative way to do this:

    gpathRegexPat=r"(?:\\mypath\{)((\{[^\{\}\[\]]*\})*)(?:\})"
    gpathRegexCp=re.compile(gpathRegexPat)
    strpath=gpathRegexCp.search(r'\mypath{{sadf}{ad}}').groups()[0]
    >>> strpath
    '{sadf}{ad}'
    p=re.compile('\{([^\{\}\[\]]*)\}')
    >>> p.findall(strpath)
    ['sadf', 'ad']

or:

    >>> gpathRegexPat=r"\\mypath\{(\{[^{}[\]]*\})*\}"
    >>> gpathRegexCp=re.compile(gpathRegexPat, flags=re.I|re.U)
    >>> strpath=gpathRegexCp.search(r'\input{{whatever]{1}}\mypath{{sadf}{ad}}\shape{{0.2}{0.1}}').group()
    >>> strpath
    '\\mypath{{sadf}{ad}}'
    >>> p.findall(strpath)
    ['sadf', 'ad']

At this point, I thought, why not just use the findall on the original string? I may use: gpathRegexPat=r"(?:\\mypath\{)(?:\{[^\{\}\[\]]*\})*?\{([^\{\}\[\]]*)\}(?:\{[^\{\}\[\]]*\})*?(?:\})": if the first (?:\{[^\{\}\[\]]*\})*? matches 0 time and the 2nd (?:\{[^\{\}\[\]]*\})*? matches 1 time, it will capture sadf; if the first (?:\{[^\{\}\[\]]*\})*? matches 1 time, the 2nd one matches 0 time, it will capture ad. However, it will only return ['sadf'] with this regex.

With out all those extra patterns ((?:\\mypath\{) and (?:\})), it actually works:

    >>> p2=re.compile(r'(?:\{[^\{\}\[\]]*\})*?\{([^\{\}\[\]]*)\}(?:\{[^\{\}\[\]]*\})*?')
    >>> p2.findall(strpath)
    ['sadf', 'ad']
    >>> p2.findall('{adadd}{dfada}{adafadf}')
    ['adadd', 'dfada', 'adafadf']

Can anyone explain this behavior to me? Is there any smarter way to achieve the result I want?

0

2 Answers 2

2
re.findall("{([^{}]+)}",text)

should work

returns

['path1', 'path2', 'path3', 'pathn']

finally

my_path = r"\input{{whatever}{1}}\mypath{{path1}{path2}{path3}...{pathn}}\shape{{0.2}{0.3}}"
#get the \mypath part
my_path2 = [p for p in my_path.split("\\") if p.startswith("mypath")][0]
print re.findall("{([^{}]+)}",my_path2)

or even better

re.findall("{(path\d+)}",text) #will only return things like path<num> inside {}
Sign up to request clarification or add additional context in comments.

5 Comments

To respect Wang's path specification, the regular expression should be r"\{([^{}[\]]*)\}".
@Joran, thanks for reply, but consider a string buried inside other charaters : \input{{whatever}{1}}\mypath{{path1}{path2}{path3}...{pathn}}\shape{{0.1}{0.2}}. It will capture the parameters in input{} and \shape{} also. It still need to use 2-step capture I guess.
@ Joran Beasley, Thanks! So it is also 2-step. Can you explain a little bit more why the regex r"(?:\\mypath\{)(?:\{[^\{\}\[\]]*\})*?\{([^\{\}\[\]]*)\}(?:\{[^\{\}\[\]]*\})*?(?‌​:\})" can not return all paths with findall() either?
naw with regex you want to keep them short ... long regex's take a long time for me to decode ...
@Wang: to extract the sequence (step 1): L = re.findall(r"\\mypath{({.*?})}", input_text) (assume \mypath occurs multiple times). Step 2: for each item in the list L call re.findall("{([^{}]+)}", item)
1

You are right. It is not possible to return repeated subgroups inside a group. To do what you want, you can use a regular expression to capture the group and then use a second regular expression to capture the repeated subgroups.

In this case that would be something like: \\mypath{(?:\{.*?\})}. This will return {path1}{path2}{path3}

Then to find the repeating patterns of {pathn} inside that string, you can simply use \{(.*?)\}. This will match anything withing the braces. The .*? is a non-greedy version of .*, meaning it will return the shortest possible match instead of the longest possible match.

4 Comments

note: regex module supports repeated captures. Though they are not necessary in this case
@Sebastian. Thanks! This is not a build-in module I guess.
@Hans Then. Thanks! Can you explain a little bit more why the regex r"(?:\\mypath\{)(?:\{[^\{\}\[\]]*\})*?\{([^\{\}\[\]]*)\}(?:\{[^\{\}\[\]]*\})*?(?:\})" can not return all paths with findall() either?
That is because findall will return only the patterns that match your regular expression. Since you start with \mypath, findall will only return matches that start with that string, of which you have only one.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.