How to use sklearn fit_transform with pandas and return dataframe instead of numpy array?

Question

I want to apply scaling (using StandardScaler() from sklearn.preprocessing) to a pandas dataframe. The following code returns a numpy array, so I lose all the column names and indeces. This is not what I want.

features = df[["col1", "col2", "col3", "col4"]]
autoscaler = StandardScaler()
features = autoscaler.fit_transform(features)

A "solution" I found online is:

features = features.apply(lambda x: autoscaler.fit_transform(x))

It appears to work, but leads to a deprecationwarning:

/usr/lib/python3.5/site-packages/sklearn/preprocessing/data.py:583: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.

I therefore tried:

features = features.apply(lambda x: autoscaler.fit_transform(x.reshape(-1, 1)))

But this gives:

Traceback (most recent call last): File "./analyse.py", line 91, in features = features.apply(lambda x: autoscaler.fit_transform(x.reshape(-1, 1))) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 3972, in apply return self._apply_standard(f, axis, reduce=reduce) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 4081, in _apply_standard result = self._constructor(data=results, index=index) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 226, in init mgr = self._init_dict(data, index, columns, dtype=dtype) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 363, in _init_dict dtype=dtype) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 5163, in _arrays_to_mgr arrays = _homogenize(arrays, index, dtype) File "/usr/lib/python3.5/site-packages/pandas/core/frame.py", line 5477, in _homogenize raise_cast_failure=False) File "/usr/lib/python3.5/site-packages/pandas/core/series.py", line 2885, in _sanitize_array raise Exception('Data must be 1-dimensional') Exception: Data must be 1-dimensional

How do I apply scaling to the pandas dataframe, leaving the dataframe intact? Without copying the data if possible.

Kevin · Accepted Answer · 2016-11-22 01:02:55Z

139

You could convert the DataFrame as a numpy array using as_matrix(). Example on a random dataset:

Edit: Changing as_matrix() to values, (it doesn't change the result) per the last sentence of the as_matrix() docs above:

Generally, it is recommended to use ‘.values’.

import pandas as pd
import numpy as np #for the random integer example
df = pd.DataFrame(np.random.randint(0.0,100.0,size=(10,4)),
              index=range(10,20),
              columns=['col1','col2','col3','col4'],
              dtype='float64')

Note, indices are 10-19:

In [14]: df.head(3)
Out[14]:
    col1    col2    col3    col4
    10  3   38  86  65
    11  98  3   66  68
    12  88  46  35  68

Now fit_transform the DataFrame to get the scaled_features array:

from sklearn.preprocessing import StandardScaler
scaled_features = StandardScaler().fit_transform(df.values)

In [15]: scaled_features[:3,:] #lost the indices
Out[15]:
array([[-1.89007341,  0.05636005,  1.74514417,  0.46669562],
       [ 1.26558518, -1.35264122,  0.82178747,  0.59282958],
       [ 0.93341059,  0.37841748, -0.60941542,  0.59282958]])

Assign the scaled data to a DataFrame (Note: use the index and columns keyword arguments to keep your original indices and column names:

scaled_features_df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)

In [17]:  scaled_features_df.head(3)
Out[17]:
    col1    col2    col3    col4
10  -1.890073   0.056360    1.745144    0.466696
11  1.265585    -1.352641   0.821787    0.592830
12  0.933411    0.378417    -0.609415   0.592830

Edit 2:

Came across the sklearn-pandas package. It's focused on making scikit-learn easier to use with pandas. sklearn-pandas is especially useful when you need to apply more than one type of transformation to column subsets of the DataFrame, a more common scenario. It's documented, but this is how you'd achieve the transformation we just performed.

from sklearn_pandas import DataFrameMapper

mapper = DataFrameMapper([(df.columns, StandardScaler())])
scaled_features = mapper.fit_transform(df.copy(), 4)
scaled_features_df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)

edited Nov 22, 2016 at 1:02

answered Mar 1, 2016 at 13:25

Kevin

8,2275 gold badges39 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Louic Over a year ago

Thank you for the answer, but the problem still is that the rows are renumbered when the new dataframe is created from the array. The original dataframe does not contain consecutively numbered rows because some of them have been removed. I suppose I could also add an index=[...] keyword with the old index values. If you update your answer accordingly I can accept it.

Kevin Over a year ago

I hope the edit helps, I think your intuition about setting the index values from the first df was correct. The numbers I used are consecutive...(just wanted to show you can reset them to anything and range(10,20) was best I could think of. But it will work with any random index on the original df. HTH!

WestCoastProjects Over a year ago

I see that you have the last step as converting the output of the DataFrameMapper to a DataFrame .. so the output is not already a DataFrame ?

Nerxis Over a year ago

@StephenBoesch: Yes, the output is not DataFrame. If you want to get it directly from mapper, you have to use df_out=True option for DataFrameMapper.

constantstranger Over a year ago

@Kevin You'd probably want to use df.to_numpy() these days instead of df.values, as recommended in the docs.

cody · Accepted Answer · 2019-01-04 21:00:16Z

37

import pandas as pd    
from sklearn.preprocessing import StandardScaler

df = pd.read_csv('your file here')
ss = StandardScaler()
df_scaled = pd.DataFrame(ss.fit_transform(df),columns = df.columns)

The df_scaled will be the 'same' dataframe, only now with the scaled values

edited Jan 4, 2019 at 21:00

cody

11.2k3 gold badges27 silver badges39 bronze badges

answered Jan 4, 2019 at 20:32

Joe

4114 silver badges3 bronze badges

4 Comments

leokury Over a year ago

But this does not maintain data types

gosuto Over a year ago

Won't all data types become floats anyway since that is the only output of the scaler? What other outputs do you expect from it? @leokury

Patricia Over a year ago

In current versions, you must add the parameter index=df.index in order to keep the index from the original data frame.

Connor Over a year ago

This is the better answer.

DataJanitor · Accepted Answer · 2024-04-30 11:40:58Z

18

Since sklearn Version 1.2, estimators can return a DataFrame keeping the column names.

This can be configured per estimator by calling the set_output method or globally by setting set_config(transform_output="pandas")

Configuring a single estimator

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().set_output(transform="pandas")

Setting a global configuration

from sklearn import set_config
set_config(transform_output="pandas")

(See Release Highlights for scikit-learn 1.2, specifically the section on "Pandas output with set_output API.")

edited Apr 30, 2024 at 11:40

answered Dec 8, 2022 at 16:17

DataJanitor

1,8152 gold badges15 silver badges31 bronze badges

Comments

Jim · Accepted Answer · 2021-03-04 23:23:31Z

16

Reassigning back to df.values preserves both index and columns.

df.values[:] = StandardScaler().fit_transform(df)

answered Mar 4, 2021 at 23:23

Jim

1,9091 gold badge15 silver badges22 bronze badges

2 Comments

Hindol Over a year ago

Did not work for me in the latest version of pandas.

Jim Over a year ago

I just tried it with pandas 1.4.2, (released 2 April 2022) and it works there.

zzHQzz · Accepted Answer · 2020-03-19 09:09:57Z

12

features = ["col1", "col2", "col3", "col4"]
autoscaler = StandardScaler()
df[features] = autoscaler.fit_transform(df[features])

answered Mar 19, 2020 at 9:09

zzHQzz

3213 silver badges10 bronze badges

4 Comments

Piotr Labunski Over a year ago

While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value.

Vega Over a year ago

This now throws a: "SettingWithCopyError: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead"

jajamaharaja Over a year ago

@Vega how do you deal with this?

Quinten C Over a year ago

This is the reason I came here but I have not found an awnser yet. I asked this new question about it stackoverflow.com/questions/72232036/…

user15590289 · Accepted Answer · 2021-12-28 09:35:09Z

8

This worked with MinMaxScaler in getting back the array values to original dataframe. It should work on StandardScaler as well.

data_scaled = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)

where, data_scaled is the new data frame, scaled_features = the array post normalization, df = original dataframe for which we need the index and columns back.

answered Dec 28, 2021 at 9:35

user15590289

1 Comment

user4933 Over a year ago

Underrated answer :D

Avtandil Chakhnashvili · Accepted Answer · 2022-03-21 08:05:39Z

5

Works for me:

from sklearn.preprocessing import StandardScaler

cols = list(train_df_x_num.columns)
scaler = StandardScaler()
train_df_x_num[cols] = scaler.fit_transform(train_df_x_num[cols])

answered Mar 21, 2022 at 8:05

Avtandil Chakhnashvili

511 silver badge1 bronze badge

1 Comment

Community Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Fredrik · Accepted Answer · 2021-01-29 12:24:32Z

2

This is what I did:

X.Column1 = StandardScaler().fit_transform(X.Column1.values.reshape(-1, 1))

answered Jan 29, 2021 at 12:24

Fredrik

211 bronze badge

1 Comment

Krishna Chaurasia Over a year ago

Please consider adding explanation to the code for easier understanding.

Guillaume Chevalier · Accepted Answer · 2020-03-06 02:13:07Z

You can mix multiple data types in scikit-learn using Neuraxle:

Option 1: discard the row names and column names

from neuraxle.pipeline import Pipeline
from neuraxle.base import NonFittableMixin, BaseStep

class PandasToNumpy(NonFittableMixin, BaseStep):
    def transform(self, data_inputs, expected_outputs): 
        return data_inputs.values

pipeline = Pipeline([
    PandasToNumpy(),
    StandardScaler(),
])

Then, you proceed as you intended:

features = df[["col1", "col2", "col3", "col4"]]  # ... your df data
pipeline, scaled_features = pipeline.fit_transform(features)

Option 2: to keep the original column names and row names

You could even do this with a wrapper as such:

from neuraxle.pipeline import Pipeline
from neuraxle.base import MetaStepMixin, BaseStep

class PandasValuesChangerOf(MetaStepMixin, BaseStep):
    def transform(self, data_inputs, expected_outputs): 
        new_data_inputs = self.wrapped.transform(data_inputs.values)
        new_data_inputs = self._merge(data_inputs, new_data_inputs)
        return new_data_inputs

    def fit_transform(self, data_inputs, expected_outputs): 
        self.wrapped, new_data_inputs = self.wrapped.fit_transform(data_inputs.values)
        new_data_inputs = self._merge(data_inputs, new_data_inputs)
        return self, new_data_inputs

    def _merge(self, data_inputs, new_data_inputs): 
        new_data_inputs = pd.DataFrame(
            new_data_inputs,
            index=data_inputs.index,
            columns=data_inputs.columns
        )
        return new_data_inputs

df_scaler = PandasValuesChangerOf(StandardScaler())

Then, you proceed as you intended:

features = df[["col1", "col2", "col3", "col4"]]  # ... your df data
df_scaler, scaled_features = df_scaler.fit_transform(features)

Skopyk · Accepted Answer · 2024-12-30 08:11:37Z

Check out the official set_output API. It allows to configure transformers to output pandas DataFrames. Quoting their example here:

scaler = StandardScaler().set_output(transform="pandas")

scaler.fit(X_train)
X_test_scaled = scaler.transform(X_test)
X_test_scaled.head() # gives pd.DataFrame with correct columns!

Old answer below

The path of least resistance and most scalability is writing your custom transformer. Here's an example:

# custom transformer

class myWrapper(TransformerMixin, BaseEstimator):
    def __init__(self, *, scikitScaler):
        self.scikitScaler = scikitScaler
        # class attribute and init argument must be the same
        # throws error in BaseEstimator otherwise

    def fit(self, df, y=None):
        self.scikitScaler.fit(df)
        return self # scikit API

    def transform(self, df):
        df.loc[:,:] = self.scikitScaler.transform(df)
        return df # scikit API


# example usage

my_wrapper = myWrapper(StandardScaler())
features = ["col1", "col2", "col3", "col4"]
my_wrapper.fit_transform(df[features])

The good thing is, an instance of any scaler, or transformer for that matter, can become an argument for myWrapper() instantiation. You could also add a self.to_change attribute in fit to conditionally remember columns you'd like to change, and use it like df[:,self.to_change] in transform.

However, scikit works on np.ndarrays, and pd.DataFrames are just good at pretending to be ndarrays the first time they are fed to scikit transformers. For a quick hand-on preprocessing, using this wrapper is fine. If you wanted to make a pipeline though, to preserve the dataframe you'd need to wrap every scikit transformer.

Hassan K · Accepted Answer · 2020-01-30 11:40:01Z

-1

You can try this code, this will give you a DataFrame with indexes

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston # boston housing dataset

dt= load_boston().data
col= load_boston().feature_names

# Make a dataframe
df = pd.DataFrame(data=dt, columns=col)

# define a method to scale data, looping thru the columns, and passing a scaler
def scale_data(data, columns, scaler):
    for col in columns:
        data[col] = scaler.fit_transform(data[col].values.reshape(-1, 1))
    return data

# specify a scaler, and call the method on boston data
scaler = StandardScaler()
df_scaled = scale_data(df, col, scaler)

# view first 10 rows of the scaled dataframe
df_scaled[0:10]

answered Jan 30, 2020 at 11:40

Hassan K

235 bronze badges

1 Comment

Louic Over a year ago

Thanks for your answer, but the solutions given as accepted answer are much better. Also, it can be done with dask-ml: from dask_ml.preprocessing import StandardScaler; StandardScaler().fit_transform(df)

abysslover · Accepted Answer · 2021-02-02 01:43:43Z

-1

You could directly assign a numpy array to a data frame by using slicing.

from sklearn.preprocessing import StandardScaler
features = df[["col1", "col2", "col3", "col4"]]
autoscaler = StandardScaler()
features[:] = autoscaler.fit_transform(features.values)

edited Feb 2, 2021 at 1:43

answered Jan 19, 2021 at 7:06

abysslover

7885 silver badges14 bronze badges

Collectives™ on Stack Overflow

How to use sklearn fit_transform with pandas and return dataframe instead of numpy array?

12 Answers 12

5 Comments

4 Comments

Comments

2 Comments

4 Comments

1 Comment

1 Comment

1 Comment

You can mix multiple data types in scikit-learn using Neuraxle:

Option 1: discard the row names and column names

Option 2: to keep the original column names and row names

Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

5 Comments

4 Comments

Comments

2 Comments

4 Comments

1 Comment

1 Comment

1 Comment

You can mix multiple data types in scikit-learn using Neuraxle:

Option 1: discard the row names and column names

Option 2: to keep the original column names and row names

Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related