Gradient Boosting Class/Module using sklearn. (My first Python Module)

Question

This is a follow-up question to My last question. From then on, I think that I have improved quite a lot thanks to a large amount of feedback I received from my last post. This will be the last time I ask for feedback based on this code. I am looking for any feedback on best practices or coding style but to be honest any kind of tip, information or feedback is greatly accepted. This is my first real python module so I can imagine there is quite a bit of strange or simply wrong stuff.

import os.path
import math
from typing import (Tuple, List)

import pandas as pd
import numpy as np

from sklearn import ensemble
from sklearn.externals import joblib
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split


def prepare_dataset(path: str) -> List[pd.DataFrame]:
    df = pd.read_csv(path)
    features = df.drop('days_remaning', axis=1)
    output = df['days_remaning']
    return train_test_split(features, output, test_size=0.3)


class Model():
    def __init__(self, product: str) -> None:
        self.product = product
        self.save_path = f'models/trained_{product}_classifier_model.pkl'
        (self.x_train, self.x_test, self.y_train, self.y_test) = prepare_dataset(
            f'datasets/{product}_dataset.csv')
        if os.path.isfile(self.save_path):
            print(f'Loading Local Model, product: {product}.')
            self.model = joblib.load(self.save_path)
        else:
            print(f'Creating New Model, product: {product}.')
            self.model = self.create_model()

    def create_model(self) -> ensemble.GradientBoostingRegressor:
        '''Create & fit a GB model using GridSearch optimised parameters.'''
        params = dict(
            learning_rate=0.01,
            loss='ls',
            max_depth=4,
            max_features=0.3,
            min_samples_leaf=9,
            n_estimators=500
        )
        model = ensemble.GradientBoostingRegressor(**params)
        model.fit(self.x_train, self.y_train)
        return model

    def use_model(self, data: List[List[int]]) -> np.ndarray:
        return self.model.predict(data)

    def print_model_output(self, data: List[List[int]]) -> None:
        for output in self.use_model(data):
            print('In about {} to {} days you will need to buy new {}.'.format(
                math.floor(output), math.ceil(output), self.product))

    def save_model(self) -> None:
        joblib.dump(self.model, self.save_path)

    def calculate_error_rates(self) -> Tuple[float]:
        train_error = mean_absolute_error(
            self.y_train, self.model.predict(self.x_train))
        test_error = mean_absolute_error(
            self.y_test, self.model.predict(self.x_test))
        return train_error, test_error

    def print_error_rates(self) -> None:
        (train_error, test_error) = self.calculate_error_rates()
        print(f'Training Set Mean Absolute Error: {train_error:.4f}')
        print(f'Test Set Mean Absolute Error: {test_error:.4f}')

    def print_features_importance(self):
        features_labels = np.array(['days_after_restock', 'left_content'])
        importance = self.model.feature_importances_
        features_indexes_by_importance = importance.argsort()
        for feature in features_indexes_by_importance:
            print(
                f'{features_labels[feature]} - {importance[feature] * 100.0:.2f}%')


if __name__ == '__main__':
    EGGS = Model('eggs')
    EGGS.print_error_rates()
    EGGS.print_features_importance()
    EGGS.print_model_output([[2, 7], [7, 2]])

Snowbody · Accepted Answer · 2017-12-08 14:16:08Z

General: docstrings should explain the parameters.

features = df.drop('days_remaning', axis=1)
output = df['days_remaning']

You really ought to put 'days_remaning' into its own variable, declared either via a constant or read in from a configuration file, so that if the field name in your csv changes, you don't have to make the change in multiple places. Similar feeling for the file paths.

       print(f'Loading Local Model, product: {product}.')

This is not output of the program, but just logging its progress. So you should not be using print() but instead the logging module. That way you can configure where the logging goes (and if it's shown at all).

    if os.path.isfile(self.save_path):
        print(f'Loading Local Model, product: {product}.')
        self.model = joblib.load(self.save_path)
    else:
        print(f'Creating New Model, product: {product}.')
        self.model = self.create_model()

This seems odd to me. If the file already exists you load it, but if it doesn't exist you save it? This is actually a race condition since whoever is running this code doesn't know what behavior will happen, based on whether the file exists or is deleted. However I'm not sure what should be done here instead.

    params = dict(
        learning_rate=0.01,
        loss='ls',
        max_depth=4,
        max_features=0.3,
        min_samples_leaf=9,
        n_estimators=500
    )

The dict initializer syntax is more pythonic, and may save a function call.

    params = {
        learning_rate : 0.01,
        loss : 'ls',
        max_depth : 4,
        max_features : 0.3,
        min_samples_leaf : 9,
        n_estimators : 500
    }

but why are you even creating the params variable at all? Just call the function with named parameters!

def print_model_output(self, data: List[List[int]]) -> None:
    for output in self.use_model(data):
        print('In about {} to {} days you will need to buy new {}.'.format(
            math.floor(output), math.ceil(output), self.product))

This doesn't make sense. self.product won't change through this loop. I suspect there shouldn't be a loop at all, as there's never more than one product / output.

def calculate_error_rates(self) -> Tuple[float]:
    train_error = mean_absolute_error(
        self.y_train, self.model.predict(self.x_train))
    test_error = mean_absolute_error(
        self.y_test, self.model.predict(self.x_test))
    return train_error, test_error

This code strongly suggests that instead of two variables y_whatever and x_whatever you should instead have a 2-element tuple called whatever. Then you could get rid of some of this duplicated code.

    features_labels = np.array(['days_after_restock', 'left_content'])

Why an np.array here? What's wrong with a regular list?

The if/else statement uses a local model if available otherwise it will create a new one. And the loop in print_model_output is needed because I can pass multiple Lists to it for example: EGGS.print_model_output([[2, 7], [7, 2]]). @Snowbody — Vera Perrone
– Vera Perrone, Commented Dec 8, 2017 at 14:29
But @VeraPerrone it you pass multiple lists the output will not make sense! You'll need eggs between 5 and 6 days, as well as between 2 and 3 days! And there's no way to connect these back to the inputs! — Snowbody
– Snowbody, Commented Dec 8, 2017 at 19:12
You are right that it should not make sense... I guess you can't have a different amount of items at the same time. lol @Snowbody — Vera Perrone
– Vera Perrone, Commented Dec 8, 2017 at 19:18
The only way it would make sense is if you passed a list of tuples where the first element is the product and the second is the usage history. Then extract the appropriate element from each tuple. — Snowbody
– Snowbody, Commented Dec 8, 2017 at 19:39

Stack Exchange Network

Gradient Boosting Class/Module using sklearn. (My first Python Module)

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

Gradient Boosting Class/Module using sklearn. (My first Python Module)

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions