Python: Convert format string to regular expression

Question

The users of my app can configure the layout of certain files via a format string.

For example, the config value the user specifies might be:

layout = '%(group)s/foo-%(locale)s/file.txt'

I now need to find all such files that already exist. This seems easy enough using the glob module:

glob_pattern = layout % {'group': '*', 'locale': '*'}
glob.glob(glob_pattern)

However, now comes the hard part: Given the list of glob results, I need to get all those filename-parts that matched a given placeholder, for example all the different "locale" values.

I thought I would generate a regular expression for the format string that I could then match against the list of glob results (or then possibly skipping glob and doing all the matching myself).

But I can't find a nice way to create the regex with both the proper group captures, and escaping the rest of the input.

For example, this might give me a regex that matches the locales:

regex = layout % {'group': '.*', 'locale': (.*)}

But to be sure the regex is valid, I need to pass it through re.escape(), which then also escapes the regex syntax I have just inserted. Calling re.escape() first ruins the format string.

I know there's fnmatch.translate(), which would even give me a regex - but not one that returns the proper groups.

Is there a good way to do this, without a hack like replacing the placeholders with a regex-safe unique value etc.?

Is there possibly some way (a third party library perhaps?) that allows dissecting a format string in a more flexible way, for example splitting the string at the placeholder locations?

Duncan · Accepted Answer · 2010-04-16 17:46:56Z

2

Since you are using named placeholders, I'd use named groups. This seems to work:

import re
UNIQ='_UNIQUE_STRING_'
class MarkPlaceholders(dict):
    def __getitem__(self, key):
        return UNIQ+('(?P<%s>.*?)'%key)+UNIQ

def format_to_re(format):
    parts = (format % MarkPlaceholders()).split(UNIQ)
    for i in range(0, len(parts), 2):
        parts[i] = re.escape(parts[i])
    return ''.join(parts)

and then to test:

>>> layout = '%(group)s/foo-%(locale)s/file.txt'
>>> print format_to_re(layout)
(?P<group>.*?)\/foo\-(?P<locale>.*?)\/file\.txt
>>> pattern = re.compile(format_to_re(layout))
>>> print pattern.match('something/foo-en-gb/file.txt').groupdict()
{'locale': 'en-gb', 'group': 'something'}

answered Apr 16, 2010 at 17:46

Duncan

96.4k15 gold badges129 silver badges160 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

miracle2k Over a year ago

I had hoped to find a way other than using a unique identifier, but this is an interesting spin on that approach. In particular, I like that I'll only need a single unique separator, rather than one for every field that needs to match a different regular expression.

Duncan Over a year ago

If the unique separator worries you too much you could always include a number in it and increment the number until you get something that isn't in the string.

Jens Timmerman Over a year ago

ok, this works for strings, would it be possible to make this work for more constructs? like parsing "node%(id)03d" to "node(?P<id>\d\d\d)"

user97370 · Accepted Answer · 2010-04-16 17:15:54Z

1

You can try this; it works around your escaping problems.

unique = '_UNIQUE_STRING_'
assert unique not in layout
regexp = re.escape(layout % {'group': unique, 'locale': unique}).replace(unique, '(.*)')

answered Apr 16, 2010 at 17:15

user97370

Collectives™ on Stack Overflow

Python: Convert format string to regular expression

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related