5

I have a regular expression defined in a YAML configuration file.

To make things easier, I'll use a dictionary here instead:

rule_1 = {
    'kind': 'regex',
    'method': 'match',
    'args': None,
    'kwargs': {
        'pattern': "[a-z_]+",
        'flags': re.X,
        'string': 's_test.log',
    }
}

I want to be able to parse that rule in a function.

If we assume such values don't change, then I could do something like this.

Importing modules:

import re
from operator import methodcaller
from functools import partial

My first function below is able to adapt to changes in the regex method used:

def rule_parser_re_1(*, kind, method, args=None, kwargs=None):
    if args is None: args = []
    if kwargs is None: kwargs = {}
    mc = methodcaller(method, **kwargs)
    return mc(re)

It works as expected:

>>> rule_parser_re_1(**rule_1)
<re.Match object; span=(0, 6), match='s_test'>

Now, let's say I don't have the string to parse available at the time the configuration dictionary is defined.

e.g. Let's say it's a specific line in a file which is accessible at runtime only.

myfile = """
first line
second line
third line
"""

io_myfile = io.StringIO(myfile)

content = io_myfile.readlines()

My second rule, where "line_number" (i.e. an int) replaces "string" (i.e. a str).

rule_2 = {
    'kind': 'regex',
    'method': 'match',
    'args': None,
    'kwargs': {
        'pattern': "[a-z_]+",
        'flags': re.X,
        'line_number': 2,
    }
}

My understanding is that I should be able to solve this by defining a partial rule_parser_re function. Such function should behave like the original one called with pattern and flags, but without string.

I've come up with the below function:

def rule_parser_re_2(*, kind, method, args=None, kwargs=None):
    if args is None: args = []
    if kwargs is None: kwargs = {}

    if kind == 'regex' and method == 'match':
        pa = partial(re.match, pattern=kwargs['pattern'], flags=kwargs['flags'])
        return pa

Which also seems to work properly:

>>> r2 = rule_parser_re_2(**rule_2)
>>> r2(string=content[2])
<re.Match object; span=(0, 6), match='second'>

Although, I see two maintainability problems with the above implementation:

  1. I'm using that if statement which forces me to amend the function for every re method I want to support;
  2. I need to explicitly specify the arguments, instead of just unpacking "**kwargs"

My aims/doubts:

  • Is there any way to make the above function more dynamic and maintainable?
  • Are functools.partial() and operator.methodcaller() the right tools for the job?
  • If so, can they be combined together?

Thanks!

6
  • 1
    May be add another kwarg named lines=None. And pass contents in the second case. And inside the function check if kwargs contains line_number, if so, from kwargs, pop line_number and add string key with lines[<popped value>] Commented Jun 6, 2021 at 16:00
  • 1
    Do you want your function to return a function, or to return the regex result? Commented Jun 6, 2021 at 16:12
  • @Cyttorak - If I get your approach right, it consists in passing the file content (as a list of lines: i.e. content) to a new kwarg named lines. Then, I use line_number I get from dictionary rule_2 to get the proper item from content. Eventually, I modify rule_2 - or a copy of it - by replacing its key line_number with string and value content[<int>]. At this point, I can use the same approach used for rule_1. Is that correct? Commented Jun 6, 2021 at 21:07
  • @Stuart - thanks a lot for your question. I want to return a function containing the regex logic, along with the line number it's supposed to take from a file. The actual string/line will be made available at runtime only. Multiple instances of the same function, each one with different regex logic/line number, will be stored in a list. Another part of the program will go through multiple files and try to identify them, depending on which instance returns a match. Commented Jun 6, 2021 at 21:16
  • @muxevola correct. Does that work for you? Commented Jun 7, 2021 at 3:36

3 Answers 3

2

Since your second schema doesn’t match the signature of re.match (etc.), you need to write your own function. It can use a wrapper function with named arguments to adapt the interface (although this involves fixing a position for your invented line_number argument if you care about args). It can also use getattr, which is equivalent to certain trivial uses of operator.methodcaller:

def rule2(kind,method,args,kwargs):
  return _rule2(getattr(re,method),*args or (),**kwargs or {})
def _rule2(f,pattern,line_number,flags):
  return lambda content: f(pattern,content[line_number],flags)

Note that content is the parameter that remains, since having only the line number leaves the file contents unknown; since it is not directly a parameter for the underlying function, partial isn’t the right tool here.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for explaining why partial shouldn't be used here. Your solution is the closest to what I'm looking for, since I get a function whose only required argument is content (i.e. the file whose line line_number I want to parse). It's also the most succinct, readable and probably maintainable. I'm still trying to wrap my head around the fact you can "bind" the index within the indexing operator, and leave the list name unbound :)
@muxevola: You’re welcome. Just as a list can be thought of as a function from an index to an element, so can an integer be thought of as a function from a list to an element; cf. the Haskell function first or (more generally) a dual vector space.
I think I'm able to see what you mean by considering a list as "f: index -> element" and int as "f: list -> element." It'll take a little while to sink in, though :) I'm afraid I'm not able to make the connection with linear algebra, and by no means with Haskell. Anyway, thanks for sharing, I really appreciate it. It'll stay in the back of my mind.
This is definitely the solution to my use case. Although, taking into account what you've mentioned about the "schema not matching the signature of the method," and the concerns others have raised about the design, I've been re-thinking this. I will probably end up removing line_number and string from my schema. This way, whoever updates it can just look at the method's signature to be called and know what arguments to pass on. As per line_number, I'll probably define a fixed range of rows to be parsed, in order to identify the file, instead of having this information in the schema.
1

Rather than trying a partial or methodcaller, why not call the function directly, using only kwargs, but use the configuration to drive most of the kwargs/args contents? I use a closure for that, where the prepped "remembers" the configuration.

Notice that my final call does not care that string is the keyword for re.match. I found that your example has a fair bit of coupling to regex specific stuff, some of which like re.X could not be stored in a YAML without further manipulation.

Likewise, the partial/methodcaller way to call the function should not have to care which line number in a file the value comes from, that is too much coupling. If you must, add something else in the config, not under kwargs, that deals with runtime parameter acquisition.

So I changed things around a bit. I believe, but you may disagree, that when calling a parse rule, the calling function should not have to know how the argument is called. Well, that is, unless you rules are only regex in style, in which case you don't need a kind in the config.

This is a quick, imperfect, sketch of an alternative approach. Details will depend on how exactly you want to use this.

I also punted on the *args handling, though it could probably carried out the same way if you had to.

import importlib

rule_1 = {
    'kind': 're',
    'method': 'match',
    'args': None,
    "positional_mapper" : ["string"],
    'kwargs': {
        'pattern': "[a-z_]+",
        # I don't know how this would be stored in a YAML
        # 'flags': re.X,
        'string': 's_test.log',
    }
}

rule_2 = {
    'kind': 're',
    'method': 'match',
    'args': None,
    "positional_mapper" : ["string"],
    'kwargs': {
        'pattern': "[a-z_]+",
    }
}


def prep(config):

    mod = app_urls = importlib.import_module(config["kind"])
    f = getattr(mod, config["method"])

    pre_args = config.get("args") or []
    pre_kwargs = config.get("kwargs") or {}
    positional_mapper = config["positional_mapper"]

    def prepped(*args, **kwargs):

        kwargs2 = pre_kwargs.copy()

        for value, argname in zip(args, positional_mapper):
            kwargs2[argname] = value
        kwargs2.update(**kwargs)

        return f(**kwargs2)

    return prepped


parsed_rule1 = prep(rule_1)

print ("#1", parsed_rule1("second line"))
print ("#2", parsed_rule1())

parsed_rule2 = prep(rule_2)
print ("#3", parsed_rule2("second line"))
print ("#3.5", parsed_rule2(string="second line"))
print ("#4", parsed_rule2())

As expected, call #4 chokes as it is missing an argument to put into string.

#1 <re.Match object; span=(0, 6), match='second'>
#2 <re.Match object; span=(0, 6), match='s_test'>
#3 <re.Match object; span=(0, 6), match='second'>
#3.5 <re.Match object; span=(0, 6), match='second'>
Traceback (most recent call last):
  File "test_299_dyn.py:57", in <module>
    print ("#4", parsed_rule2())
  File "test_299_dyn.py:44", in prepped
    return f(**kwargs2)
TypeError: match() missing 1 required positional argument: 'string'

2 Comments

I believe that, with PyYAML's complex tags and constructors, it should be possible to store regex flags into a YAML file (i.e. as strings that will be interpreted/cast as/to that specific type in Python). Although, in my case, the YAML file is storing them as a list of uppercase letters (e.g. [X, I]) which are then taken, along with the regex pattern, and fused/tied together into a compiled regex.
@muxevola possibly. but that may also open YAML security issues, which is why many people prefer safe_load. as the YAML is a data file, it may be intentionally manipulated by someone who has access to the file system, but not the executables themselves.
1

You don't have to create a partial function. You can compile the pattern first then call the desired method with that:

rule_2 = {
    'kind': 'regex',
    'method': 'match',
    'args': None,
    'kwargs': {
        'pattern': "[a-z_]+",
        'flags': re.X,
        # 'line_number': 2, commented out this line
    }
}

content = ['', 'first line', 'second line', 'third line']

pattern = re.compile(**rule_2['kwargs'])
method = getattr(pattern, rule_2['method'])
>>> method(content[2])
<re.Match object; span=(0, 6), match='second'>

If you want to keep the line number, you can do something like this:

rule_2 = {
    'kind': 'regex',
    'method': 'match',
    'args': None,
    'kwargs': {
        'pattern': "[a-z_]+",
        'flags': re.X,
        'line_number': 2,
    }
}

content = ['', 'first line', 'second line', 'third line']
def rule_parser_re(*, kind, method, args=None, kwargs=None):
    copied_kwargs = kwargs.copy()
    line_number = copied_kwargs.pop('line_number')
    pattern = re.compile(**copied_kwargs)
    method = getattr(pattern, method)
    return method, line_number
    
parser, line_number = rule_parser_re(**rule_2)
>>> parser(content[line_number])
<re.Match object; span=(0, 6), match='second'>

4 Comments

Thanks a lot for your input. I need to have line_number in the dictionary, though. I've used a dictionary here for simplicity's sake. The actual code relies on a YAML configuration file. There are other non-regex rules in the file, so that's another reason to keep the same structure and have line_number in kwargs, instead of adding it as another key in rule_2. I also need to include the information provided by line_number in the returned function, since it's necessary to properly identify a file.
When I say "it's necessary to properly identify a file" I mean that the returned function is used as a file's parser/identifier. I want to find a match in a specific line (or a group of lines), preferably at the beginning of the file, so I don't have to read its whole content. The line number is part of the parsing/matching logic.
I edited the code to somewhat fit with your case. If you want to return a single function which also contains line_number information, this won't work for you. And I believe there is not an easy way to do that.
that's similar to what @Cyttorak has suggested in the comments above. It might be useful. I'll also take into account your advice about using a compiled regex and getattr() directly, instead of using operator.methodcaller().

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.