1

I wrote code for a data analysis project, but it's becoming unwieldy and I'd like to find a better way of structuring it so I can share it with others.

For the sake of brevity, I have something like the following:

def process_raw_text(txt_file):
    # do stuff
    return token_text

def tag_text(token_text):
    # do stuff
    return tagged

def bio_tag(tagged):
    # do stuff
    return bio_tagged

def restructure(bio_tagged):
    # do stuff
    return(restructured)

print(restructured)

Basically I'd like the program to run through all of the functions sequentially and print the output.

In looking into ways to structure this, I read up on classes like the following:

class Calculator():

    def add(x, y):
        return x + y

    def subtract(x, y):
        return x - y

This seems useful when structuring a project to allow individual functions to be called separately, such as the add function with Calculator.add(x,y), but I'm not sure it's what I want.

Is there something I should be looking into for a sequential run of functions (that are meant to structure the data flow and provide readability)? Ideally, I'd like all functions to be within "something" I could call once, that would in turn run everything within it.

4
  • 1
    Why not wrap them in a function like main() that just calls them in order? Commented Aug 21, 2015 at 11:26
  • This is on the vague side. There are all sorts of ways to create a module or program that turns an unstructured assortment of functions into an integrated solution but you have not provided any idea for what sort of problem you are trying to solve. Are you trying to create a library similar to math and you are simply trying to create a single namespace? If not -- then what? Commented Aug 21, 2015 at 11:34
  • I suggest using a workflow management framework like luigi. It helps lets your declare a full dependency graph of your tasks and might be a bit more than you need, but it will result in more structured, readable and extensible code. Commented Aug 21, 2015 at 11:35
  • @JohnColeman, I'm attempting to create a solution for a named entity recognizer. It would take in raw text and produce the structured named entities. My thought is that structuring the code into processing, tagging, etc. would help readability, but I'm unable to implement it this way. Does that help? Commented Aug 21, 2015 at 11:41

3 Answers 3

2

Chain together the output from each function as the input to the next:

def main():
    print restructure(bio_tag(tag_text(process_raw_text(txt_file))

if __name__ == '__main__':
    main()

@SvenMarnach makes a nice suggestion. A more general solution is to realise that this idea of repeatedly using the output as the input for the next in a sequence is exactly what the reduce function does. We want to start with some input txt_file:

def main():
    pipeline = [process_raw_text, tag_text, bio_tag, restructure]
    print reduce(apply, pipeline, txt_file)
Sign up to request clarification or add additional context in comments.

1 Comment

To have the function names in left-to-right order, you could also do reduce(apply, [process_raw_text, tag_text, bio_tag, restructure], txt_file).
1

There's nothing preventing you from creating a class (or set of classes) that represent that you want to manage with implementations that will call the functions you need in a sequence.

class DataAnalyzer():
    # ...
    def your_method(self, **kwargs):
        # call sequentially, or use the 'magic' proposed by others
        # but internally to your class and not visible to clients
        pass

The functions themselves could remain private within the module, which seem to be implementation details.

Comments

1

you can implement a simple dynamic pipeline just using modules and functions.

my_module.py

def 01_process_raw_text(txt_file):
    # do stuff
    return token_text

def 02_tag_text(token_text):
    # do stuff
    return tagged

my_runner.py

import my_module

if __name__ == '__main__':
    funcs = sorted([x in my_module.__dict__.iterkeys() if re.match('\d*.*', x)])

    data = initial_data

    for f in funcs:
        data = my_module.__dict__[f](data)

1 Comment

oh wait... you can't start the name of a python function with a digit.... well you get the idea - name them process_01_foo() and so on

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.