1

I have a series of steps (functions) which I need to run on raw dataset to prepare the dataset for modeling. I want to concatenate all the cleaning steps one after the other and want to use each step as functions. It is similar to sklearn Pipeline function however I don't have any fit or transform fucntion.

xx = [2,3,4]
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('double', double(xx)),
    ('triple', triple(xx))
])

predicted = pipeline.fit(xx).predict(xx)

I tried using reduce and lambda function from functools -

from functools import reduce
xx = 4
pipeline = [lambda x: x * 3, lambda x: x + 1, lambda x: x / 2]
val = reduce(lambda x, f: f(x), pipeline, xx)
print(val) 

Is there a better way of accomplishing this - making code modular and automated running for multiple datasets. As of now I work on Jupyter notebook. I can always add new functions / modify functions.. without impacting others. Please suggest.

2
  • 1
    nothing is really bad in your approach. Except I would operate on named functions and not keep them in a mutable list Commented Nov 12, 2019 at 10:37
  • 1. I agree - I also used named functions. The above example is just to show! I would like to give functions in a pipeline and want to avoid providing the input parameters to each function while creating a pipeline - Any ideas. What is the best way to achieve it? Commented Nov 13, 2019 at 4:23

1 Answer 1

1

It seems that you could use functions to achieve that, although less fancy, but powerful never the less.

Let's say you have a couple of preprocessing steps, pre_step1, pre_step2, etc. You could define a function called pipeline, and feed the returned value of previous step to the next function within pipeline. Code snippet is as follows:

def preprocessing_step1(rawdata):
  # do something here
  return processed_data

def preprocessing_step2(rawdata):
  # do something here
  return processed_data

def preprocessing_step3(rawdata):
  # do something here
  return processed_data

def pipeline(rawdata):
  # run steps sequentially
  data = preprocessing_step1(rawdata)
  data = preprocessing_step2(data)
  processed_data = preprocessing_step3(data)

  return processed_data

If you find this helpful, I could show you how to go through all datasets that you have using generator function in Python.

Sign up to request clarification or add additional context in comments.

1 Comment

sounds good, looking for a any other way of creating pipeline if possible. In current pipeline function suggested, we need to give output of 1 layer as input to another .. However, definitely a simple elegant solution. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.