-3

I'm in a project that computes a lot of metrics with Python code. We need to document how each metric is computed, inside the code, for readability, and outside, in some Wiki, for non-technical people.

These rules can be resumed as:

To compute "number_of_sick_days":
- Read the attendance file (which is located in mycontainer.someprovider.net/attendance.csv)
- Keep only lines where "reason" is "sick day"
- Group by "store_location" and sum the number of days
- Store these metrics inside the database

These kind of rules can be read by non-technical people, and are really useful to keep track of how each metric is computed.

My goal is to have a way of documenting these rules inside the Python code (maybe some top-level documentation inside each module?) and a way of extracting these information and upload it to Confluence.

I want to avoid duplicating and copy/pasting this documentation.

4
  • This is related to the expert system idea. Consider using expert system shells. Neither CLIPSrules nor RefPerSys are coded in Python. Are you allowed to use non-Python open source code? Commented Jan 19, 2022 at 13:24
  • I did not know this "expert system" idea, I'll look into that. We can use whatever tool is available! Commented Jan 19, 2022 at 13:27
  • @BasileStarynkevitch Expert systems are highly irrelevant in the context of documenting business logic. Commented Jan 21, 2022 at 10:03
  • 1
    The example rules you posted don't look like business rules to me - they seem like a description of the implementation. Things like file locations, exact strings to search for in files, name of the field to group by, and storage location for the database are all technical matters. None of them are what I'd expect as an employee if I asked my manager or HR how sick days are counted or what the definition of "number of sick days" is. Commented Jan 21, 2022 at 14:03

1 Answer 1

1

There is no particularly good solution to this problem. You are very right to be concerned about avoiding duplication, because requiring a manual update step means that the various descriptions of the business logic can fall out of sync. This essentially leaves the following options:

  • The code is the source of truth, and a human-readable description is synthesized from it. Example of this strategy: documentation tools like Sphinx-Autodoc.

  • A human-readable description is the source of truth, and the executable code is synthesized from it. Example of this strategy: writing BDD tests in Gherkin/Cucumber.

  • Both descriptions must be updated together, but you have a strict change management process to ensure they stay synchronized.

I'm mentioning change management because doing that is feasible. Not everything has to be done on the level of software, some things can be done on a process-level. But this requires a degree of organizational maturity so that the process is actually followed. So let's focus on the other options instead.

Code as the source of truth

The agile manifesto prioritizes “working code over comprehensive documentation”, which often means “the code is the documentation”. This sometimes has a very negative connotation, but it's possible to write code in a manner so that it is very human-readable even by non-experts. For example, using various helper functions might allow your business rules to be expressed as follows:

def number_of_sick_days():
    data = read_attendance_file("mycontainer.someprovider.net/attendance.csv")
    data = keep_records(data, where={"reason": "sick day"})
    data = group_by(data, "store_location", for_each_group=sum)
    store_metrics_in_database(data)

Here, the function names serve as documentation of the various steps.

But this might not be ideal, since it is still code and that could be off-putting for non-technical folk.

So another approach would be to have separate representations for the code and the human-readable description, but keep them close together. For example:

  • You could put the human-readable description in a docstring. The docstring can be extracted programmatically and then fed into your wiki. Example of how it might look:

    def number_of_sick_days():
        """
        To compute "number_of_sick_days":
        - Read the attendance file (which is located in mycontainer.someprovider.net/attendance.csv)
        - Keep only lines where "reason" is "sick day"
        - Group by "store_location" and sum the number of days
        - Store these metrics inside the database
        """
    
        ...
    

    If you cannot extract the docstrings using an existing documentation tool like Sphinx-Autodoc, you can either import the module and then access the __doc__ attribute via reflection, or you could write a tool that parses your code with the ast module and then yields the docstrings.

  • You could describe the steps in specially-formatted comments and extract them using a simple script (or using the tokenize module if you want to do it correctly). This is probably going to be the easiest solution, even though it is comparatively fragile. For example, your business logic might look as follows:

    def number_of_sick_days():
        # STEP: Read the attendance file (which is located in mycontainer.someprovider.net/attendance.csv)
        data = csv.DictReader(requests.get("https://mycontainer.someprovider.net/attendance.csv").iter_lines())
    
        # STEP: Keep only lines where "reason" is "sick day"
        data = [record for record in data if record["reason"] == "sick day"]
    
        # STEP: Group by "store_location" and sum the number of days
        aggregated = collections.Counter()
        for record in data:
            aggregated[record["store_location"]] += record["days"]
    
        # STEP: Store these metrics inside the database
        conn.executemany(
            r"""
            INSERT INTO sick_days(store, days)
            VALUES(?, ?)
            ON CONFLICT(store) DO UDPATE SET days = excluded.days
            """,
            aggregated,
        )
    
  • You could implement the code in a Jupyter notebook, which would allow you to intermingle code with plain-text description. This is probably overkill for normal business rules, but in can be a very powerful technique for creating reports or automating data science tasks.

Human-readable description as the source of truth

It is foolish to attempt to parse natural language as if it were a programming language. Many attempts have been made to create programming languages that closely match the English language or another domain-specific syntax, but it has rarely worked. No one would say that Fortran, Cobol, or SQL are immediately comprehensible for non-technical folks.

But a sufficiently precise description of steps could be managed without turning this into a fully-fledged language. For example, the idea of Cucumber/Gherkin is to describe steps of a test with natural language, and then match the natural-language description against patterns in order to find a matching function to execute. For example, a basic system for doing this might be implemented as follows:

import re, inspect
from dataclasses import dataclass
from typing import Callable, Any

known_steps = []

@dataclass
class Step:
    pattern: re.Pattern[str]
    function: Callable[..., Any]

def step(pattern: str) -> Callable[[Callable[..., Any]], Step]:
    def decorator(function: Callable[..., Any]) -> Step:
        p = re.compile("^" + pattern + "$")
        sig = inspect.signature(function)
        if p.groups + 1 != len(sig.parameters)
            raise TypeError(f"step pattern captures {p.groups} groups, which doesn't match the function {function.__name__}: {sig}")
        s = Step(p, function)
        known_steps.append(s)
        return s

    return decorator

def interpret_steps(description: list[str]):
   data = None
   for step_description in description:
       data = resolve_step(step_description, data)
   return data

def resolve_step(description: str, data: Any) -> Any:
    for step in known_steps:
        m = step.pattern.matches(description)
        if m is not None:
            return step.function(data, *m.groups())
    raise TypeError(f"could not resolve step for description: {description}")

The steps might then look as follows:

@step(r"Read the attendance file [(]which is located in (.+)[)]")
def read_attendance_file(_, location: str) -> list[dict]:
    ...

@step(r'Keep only lines where "([^"]+)" is "([^"]+)"')
def keep_records(data: list[dict], field: str, value: str) -> list[dict]:
    ...

@step(r'Group by "([^"]+)" and sum the number of (\w+)')
def group_by(data: list[dict], group_by_field: str, aggregate_field: str) -> dict:
    ...

@step(r"Store these metrics inside the database")
def store_metrics_in_database(data: dict) -> None:
    ...

Compare also the “interpreter pattern”.

Now all you'd need is code to extract the steps from your wiki and to feed them into the interpret_steps() function.

This can be quite powerful because of the flexibility it affords to the human-readable representation: you are in control of the syntax. And this can be quite convenient if your entire business logic boils down to a smallish number of steps that are reused across different processes, where each step is a complicated program of its own.

But if those steps are smallish, then I would be worried that there would be a high risk of bugs at the interfaces between steps: the data formats shared between steps MUST match, but it's not apparent in the code how those steps will be combined. Normal QA and development techniques like linters, type checkers, debuggers, and test coverage tools are more difficult to apply as they only see individual steps, not their combination. Syntax errors in the step descriptions might be difficult to debug.

Personal recommendations

If Confluence weren't a hard requirement, I would write the human-readable description into a module docstring and use Sphinx to render this documentation as a website. As the docstrings are in the same file as the code it describes, it will be easier to keep the two descriptions in sync. Nevertheless, some process would be required. For example, a code review checklist might include an item to consider whether the human-readable description would have to be updated.

Since you want to push the description into an external system, I would strongly consider building custom tooling to extract this information from your source code. Whether using docstrings or magic comments, whether the description is extracted by your software itself or by an external tool, you have a large degree of flexibility in finding a representation that allows you to conveniently keep the human-readable description close to the code it relates to. You could update the external site automatically as part of your build/deployment process, assuming it exposes a suitable API.

1
  • Thank you su much for your detailed answer, I'll look into the technical solutions you provided! Commented Jan 21, 2022 at 14:25

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.