1

I'm looking for the name for a procedure which handles output from one function in several others (trying to find better words for my problem). Some pseudo/actual code would be really helpful.

I have written the following code:

def read_data():
    read data from a file
    create df
    return df

def parse_data():
    sorted_df = read_data()
    count lines
    sort by date
    return sorted_df

def add_new_column(): 
    new_column_df = parse_data()
    add new column
    return new_column_df

def create_plot():
    plot_data = add_new_column()
    create a plot
    display chart

What I'm trying to understand is how to skip a function, e.g. create following chain read_data() -> parse_data() -> create_plot().

As the code looks right now (due to all return values and how they are passed between functions) it requires me to change input data in the last function, create_plot().

I suspect that I'm creating logically incorrect code.

Any thoughts?

Original code:

import pandas as pd
import matplotlib.pyplot as plt

# Read csv files in to data frame
def read_data():
    raw_data = pd.read_csv('C:/testdata.csv', sep=',', engine='python', encoding='utf-8-sig').replace({'{':'', '}':'', '"':'', ',':' '}, regex=True)
    return raw_data

def parse_data(parsed_data):
    ...
    # Convert CreationDate column into datetime
    raw_data['CreationDate'] = pd.to_datetime(raw_data['CreationDate'], format='%Y-%m-%d %H:%M:%S', errors='coerce')
    raw_data.sort_values(by=['CreationDate'], inplace=True, ascending=True)
    parsed_data = raw_data
    return parsed_data

raw_data = read_files()
parsed = parsed_data(raw_data)
2
  • 3
    Use function parameters, e.g. def parse_data(data). Instead of having parse_data call read_data, pass that data from one to the other: parse_data(read_data()). This way each function is independent and you can chain them flexibly. Commented Oct 20, 2019 at 14:05
  • docs.python.org/3/tutorial/controlflow.html#defining-functions Commented Oct 20, 2019 at 14:07

2 Answers 2

3

Pass the data in instead of just effectively "nesting" everything. Any data that a function requires should ideally be passed in to the function as a parameter:

def read_data():
    read data from a file
    create df
    return df

def parse_data(sorted_df):
    count lines
    sort by date
    return sorted_df

def add_new_column(new_column_df):
    add new column
    return new_column_df

def create_plot(plot_data):  
    create a plot
    display chart

df = read_data()
parsed = parse_data(df)
added = add_new_column(parsed)
create_plot(added)

Try to make sure functions are only handling what they're directly responsible for. It isn't parse_data's job to know where the data is coming from or to produce the data, so it shouldn't be worrying about that. Let the caller handle that.

The way I have things set up here is often referred to as "piping" or "threading". Information "flows" from one function into the next. In a language like Clojure, this could be written as:

(-> (read-data)
    (parse-data)
    (add-new-column)
    (create-plot))

Using the threading macro -> which frees you up from manually needing to handle data passing. Unfortunately, Python doesn't have anything built in to do this, although it can be achieved using external modules.


Also note that since dataframes seem to be mutable, you don't actually need to return the altered ones them from the functions. If you're just mutating the argument directly, you could just pass the same data frame to each of the functions in order instead of placing it in intermediate variables like parsed and added. The way I'm showing here is a general way to set things up, but it can be altered depending on your exact use case.

Sign up to request clarification or add additional context in comments.

8 Comments

I'm trying to redo the code, but facing some issues with local variable sorted_df referenced before before assignment. I mark your answer as solution, I think I'm just too tired right now.
@ToreDjerberg I would need to see the code causing that error to be able to help with that.
I think the issue is in the parse_data function. When parsing the data I use following code snippet (pseudo): def parse_data(): df['Column1'] = df['Column1'].convert_to_date. When executing, df is marked as referenced before assignment.
@ToreDjerberg Note that I showed that that data needs to be passed in using a parameter. Review my parse_data function again and note how I'm calling it.
@ToreDjerberg Just a typo. You create a variable called data (data = read_files()), but then you try to refer to it as raw_data instead (parse_data(raw_data)). Just use data: parse_data(data).
|
-1

Use class to contain your code

class DataManipulation:
    def __init__(self, path):
        self.df = pd.DataFrame()
        self.read_data(path)

    @staticmethod
    def new(file_path):
        return DataManipulation(path)

    def read_data(self, path):
        read data from a file
        self.df = create df

    def parse_data(self):
        use self.df
        count lines
        sort by date
        return self

    def add_new_column(self):
        use self.df
        add new column
        return self

    def create_plot(self):
        plot_data = add_new_column()
        create a plot
        display chart
        return self

And then,

 d = DataManipulation.new(filepath).parse_data().add_column().create_plot()

2 Comments

Why .new? Just DataManipulation(filepath) will do the same thing. It's not really a great idea to use a class whose methods must be called in a specific order. Without calling parse_data, the rest of the methods won't do anything. Explicitly returning self all the time is… questionable. If you want a fluent interface, it's okay, but this doesn't seem like a good use case for a fluent interface.
There are different design patterns in python. One of them is the builder pattern which is quite popular. This answer is inspired from that pattern. In actual pattern when you do new() the function returns a builder object which hides the attributes and adds functionality. Class itself doesn't specify any order. It has functions that return self to implement chaining. If you need an example pandas itself uses it. @deceze @Tore Djerberg

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.