0

I am trying to create a new dataframe (df_new) from an specific function (stats.boxcox) applied to an original dateframe (df). Note that the stats.boxcox function creates two new variables, which are captured by df_new[column] and lam. Although my code is able to do the task is very slow. Is there a way to do this in a more efficient way in Python?

import numpy as np
import pandas as pd
from scipy import stats

df_new = pd.DataFrame()
for column in list(df):
        df_new[column], lam = stats.boxcox(df[column])

2 Answers 2

1

Could you provide a sample Dataframe? I can think of...

df.apply(lambda x: stats.boxcox(x))

But is hard to find a suitable way without a sample database. (I cannot comment given I don't have the enough reputation) Thank you!

Sign up to request clarification or add additional context in comments.

4 Comments

you can try with the follwoing dataframe df = pd.DataFrame(np.random.randint(1,100,size=(100, 4)), columns=list('ABCD'))
What are the dimensions of your database? That dimension is irrelevant to function speed. Problem is mainly boxcox function, I don't really think it can be vectorised and improved its speed.
Rows: 300.000, columns: 200
After researching about the topic, it seems that some papers have been written about Boxcox transformation in Big Data. I cannot access the paper to provide further information, but if this is a matter for your company, I would be happy to conduct a research about the topic. Currently, I will try to time a few options to see which one is the most efficient. I am sorry!
1

I have profiled the function. I have one solution, it may work only for the sample database you gave out.

Timer unit: 1e-07 s

Total time: 0.629038 s
File: C:\Users\@user@\AppData\Roaming\Python\Python37\site-packages\scipy\stats\morestats.py
Function: boxcox at line 948

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   948                                           def boxcox(x, lmbda=None, alpha=None):
   949                                               r"""
   950                                               Return a dataset transformed by a Box-Cox power transformation.
   951                                           
   952                                               Parameters
   953                                               ----------
   954                                               x : ndarray
   955                                                   Input array.  Must be positive 1-dimensional.  Must not be constant.
   956                                               lmbda : {None, scalar}, optional
   957                                                   If `lmbda` is not None, do the transformation for that value.
   958                                           
   959                                                   If `lmbda` is None, find the lambda that maximizes the log-likelihood
   960                                                   function and return it as the second output argument.
   961                                               alpha : {None, float}, optional
   962                                                   If ``alpha`` is not None, return the ``100 * (1-alpha)%`` confidence
   963                                                   interval for `lmbda` as the third output argument.
   964                                                   Must be between 0.0 and 1.0.
   965                                           
   966                                               Returns
   967                                               -------
   968                                               boxcox : ndarray
   969                                                   Box-Cox power transformed array.
   970                                               maxlog : float, optional
   971                                                   If the `lmbda` parameter is None, the second returned argument is
   972                                                   the lambda that maximizes the log-likelihood function.
   973                                               (min_ci, max_ci) : tuple of float, optional
   974                                                   If `lmbda` parameter is None and ``alpha`` is not None, this returned
   975                                                   tuple of floats represents the minimum and maximum confidence limits
   976                                                   given ``alpha``.
   977                                           
   978                                               See Also
   979                                               --------
   980                                               probplot, boxcox_normplot, boxcox_normmax, boxcox_llf
   981                                           
   982                                               Notes
   983                                               -----
   984                                               The Box-Cox transform is given by::
   985                                           
   986                                                   y = (x**lmbda - 1) / lmbda,  for lmbda > 0
   987                                                       log(x),                  for lmbda = 0
   988                                           
   989                                               `boxcox` requires the input data to be positive.  Sometimes a Box-Cox
   990                                               transformation provides a shift parameter to achieve this; `boxcox` does
   991                                               not.  Such a shift parameter is equivalent to adding a positive constant to
   992                                               `x` before calling `boxcox`.
   993                                           
   994                                               The confidence limits returned when ``alpha`` is provided give the interval
   995                                               where:
   996                                           
   997                                               .. math::
   998                                           
   999                                                   llf(\hat{\lambda}) - llf(\lambda) < \frac{1}{2}\chi^2(1 - \alpha, 1),
  1000                                           
  1001                                               with ``llf`` the log-likelihood function and :math:`\chi^2` the chi-squared
  1002                                               function.
  1003                                           
  1004                                               References
  1005                                               ----------
  1006                                               G.E.P. Box and D.R. Cox, "An Analysis of Transformations", Journal of the
  1007                                               Royal Statistical Society B, 26, 211-252 (1964).
  1008                                           
  1009                                               Examples
  1010                                               --------
  1011                                               >>> from scipy import stats
  1012                                               >>> import matplotlib.pyplot as plt
  1013                                           
  1014                                               We generate some random variates from a non-normal distribution and make a
  1015                                               probability plot for it, to show it is non-normal in the tails:
  1016                                           
  1017                                               >>> fig = plt.figure()
  1018                                               >>> ax1 = fig.add_subplot(211)
  1019                                               >>> x = stats.loggamma.rvs(5, size=500) + 5
  1020                                               >>> prob = stats.probplot(x, dist=stats.norm, plot=ax1)
  1021                                               >>> ax1.set_xlabel('')
  1022                                               >>> ax1.set_title('Probplot against normal distribution')
  1023                                           
  1024                                               We now use `boxcox` to transform the data so it's closest to normal:
  1025                                           
  1026                                               >>> ax2 = fig.add_subplot(212)
  1027                                               >>> xt, _ = stats.boxcox(x)
  1028                                               >>> prob = stats.probplot(xt, dist=stats.norm, plot=ax2)
  1029                                               >>> ax2.set_title('Probplot after Box-Cox transformation')
  1030                                           
  1031                                               >>> plt.show()
  1032                                           
  1033                                               """
  1034         2        153.0     76.5      0.0      x = np.asarray(x)
  1035         2         55.0     27.5      0.0      if x.ndim != 1:
  1036                                                   raise ValueError("Data must be 1-dimensional.")
  1037                                           
  1038         2         41.0     20.5      0.0      if x.size == 0:
  1039                                                   return x
  1040                                           
  1041         2     168219.0  84109.5      2.7      if np.all(x == x[0]):
  1042                                                   raise ValueError("Data must not be constant.")
  1043                                           
  1044         2      67990.0  33995.0      1.1      if any(x <= 0):
  1045                                                   raise ValueError("Data must be positive.")
  1046                                           
  1047         2         47.0     23.5      0.0      if lmbda is not None:  # single transformation
  1048         1     161912.0 161912.0      2.6          return special.boxcox(x, lmbda)
  1049                                           
  1050                                               # If lmbda=None, find the lmbda that maximizes the log-likelihood function.
  1051         1    5891911.0 5891911.0     93.7      lmax = boxcox_normmax(x, method='mle')
  1052         1         29.0     29.0      0.0      y = boxcox(x, lmax)
  1053                                           
  1054         1         15.0     15.0      0.0      if alpha is None:
  1055         1          8.0      8.0      0.0          return y, lmax
  1056                                               else:
  1057                                                   # Find confidence interval
  1058                                                   interval = _boxcox_conf_interval(x, lmax, alpha)
  1059                                                   return y, lmax, interval

As you can see, most of the time consumed by the function passes by calculating lambda. Improving the efficiency of the lambda obtainer function should be matter of research, I am sure, but here is a quick fix if your data resembles my sample dataframe.

In:

    %%timeit
for column in df.columns:
    stats.boxcox(df[column])

Out:

1min 59s ± 2.46 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

If you print the results, in my example it resulted that all lambdas were around some float: 0.71...

(array([24.3917785 , 26.59336098, 28.42778141, ...,  0.89856936,
       20.09087919, 25.34396741]), 0.7191265555780741)
(array([32.21200102, 21.42019555, 17.61088955, ...,  2.37689377,
       11.43847546,  4.80996571]), 0.7186638451956158)
(array([32.04912451,  5.34619315,  2.37388797, ..., 28.25042847,
       21.65944327, 35.68435748]), 0.717094388354133)
(array([24.12034838, 20.22153029, 17.46007125, ...,  9.20987077,
       27.79432177, 28.38850624]), 0.7152672101519897)
(array([24.43175536, 29.97646547, 15.44100467, ..., 29.67889106,
       33.75136616, 18.01618903]), 0.719690932457849)
(array([31.3006977 , 14.24153427,  8.80686258, ..., 27.74442602,
       29.54262716,  5.35448321]), 0.7182204065752503)
(array([ 0.89885059, 33.21042971,  7.41516615, ..., 26.66002733,
       32.05761174,  2.37938055]), 0.719960442157806)
(array([18.75921571, 20.15657425, 32.38744267, ..., 32.09731377,
       34.95687043, 33.82390653]), 0.7203446358711867)
(array([30.61614136, 16.82387108, 23.61599906, ..., 26.74368558,
       26.43727409, 26.43727409]), 0.7171650690241015)
(array([32.03243895, 18.46213843, 15.23999702, ..., 33.70140582,
       34.52403407,  7.82011755]), 0.7141993257439302)
(array([16.39388107, 20.23652878, 25.38777257, ..., 27.81775952,
        3.63937585, 12.98507701]), 0.7155422854115251)
(array([23.84209605, 24.47289056, 32.48229038, ..., 29.90484308,
       11.37093225,  6.8765052 ]), 0.7158092390730108)

You can make the mean of the first N values and use it for the rest. After running the whole database no value was off value +- standard deviation. Use it as second argument in stats.boxcox function.

In:

%%timeit
for column in df.columns:
    stats.boxcox(df[column], 0.715)

Out:

3.67 s ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As I told you, it could be easier to conduct a research either with the original database or with the backup of a company. Good luck!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.