7

The users of my app can configure the layout of certain files via a format string.

For example, the config value the user specifies might be:

layout = '%(group)s/foo-%(locale)s/file.txt'

I now need to find all such files that already exist. This seems easy enough using the glob module:

glob_pattern = layout % {'group': '*', 'locale': '*'}
glob.glob(glob_pattern)

However, now comes the hard part: Given the list of glob results, I need to get all those filename-parts that matched a given placeholder, for example all the different "locale" values.

I thought I would generate a regular expression for the format string that I could then match against the list of glob results (or then possibly skipping glob and doing all the matching myself).

But I can't find a nice way to create the regex with both the proper group captures, and escaping the rest of the input.

For example, this might give me a regex that matches the locales:

regex = layout % {'group': '.*', 'locale': (.*)}

But to be sure the regex is valid, I need to pass it through re.escape(), which then also escapes the regex syntax I have just inserted. Calling re.escape() first ruins the format string.

I know there's fnmatch.translate(), which would even give me a regex - but not one that returns the proper groups.

Is there a good way to do this, without a hack like replacing the placeholders with a regex-safe unique value etc.?

Is there possibly some way (a third party library perhaps?) that allows dissecting a format string in a more flexible way, for example splitting the string at the placeholder locations?

2 Answers 2

2

Since you are using named placeholders, I'd use named groups. This seems to work:

import re
UNIQ='_UNIQUE_STRING_'
class MarkPlaceholders(dict):
    def __getitem__(self, key):
        return UNIQ+('(?P<%s>.*?)'%key)+UNIQ

def format_to_re(format):
    parts = (format % MarkPlaceholders()).split(UNIQ)
    for i in range(0, len(parts), 2):
        parts[i] = re.escape(parts[i])
    return ''.join(parts)

and then to test:

>>> layout = '%(group)s/foo-%(locale)s/file.txt'
>>> print format_to_re(layout)
(?P<group>.*?)\/foo\-(?P<locale>.*?)\/file\.txt
>>> pattern = re.compile(format_to_re(layout))
>>> print pattern.match('something/foo-en-gb/file.txt').groupdict()
{'locale': 'en-gb', 'group': 'something'}
Sign up to request clarification or add additional context in comments.

3 Comments

I had hoped to find a way other than using a unique identifier, but this is an interesting spin on that approach. In particular, I like that I'll only need a single unique separator, rather than one for every field that needs to match a different regular expression.
If the unique separator worries you too much you could always include a number in it and increment the number until you get something that isn't in the string.
ok, this works for strings, would it be possible to make this work for more constructs? like parsing "node%(id)03d" to "node(?P<id>\d\d\d)"
1

You can try this; it works around your escaping problems.

unique = '_UNIQUE_STRING_'
assert unique not in layout
regexp = re.escape(layout % {'group': unique, 'locale': unique}).replace(unique, '(.*)')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.