3

If I memray the following code, df.stack() allocates 22MB, when the df is only 5MB.

import numpy as np
import pandas as pd

columns = list('abcdefghijklmnopqrstuvwxyz')
df = pd.DataFrame(np.random.randint(0,100,size=(1000, 26*26)), columns=pd.MultiIndex.from_product([columns, columns]))
print(df.memory_usage().sum()) # 5408128, ~5MB
df.stack() # reshape: (1000,26*26) -> (1000*26,26)

Why DataFrame.stack() consumes so much memory? It allocates 30% on dropna and remaining 70% re-allocating the underlying array 3 times to reshape. Shall I move to native numpy.reshape or is there anything I can do to make it slimmer?

2
  • 3
    why do you perform a stack? how you are going to use this stacked dataframe after? Commented Jan 3, 2023 at 18:20
  • It is meaningful from a data perspective, i.e. it is a timeseries of square matrices - in this case it is an API requirement shape=(Mutilindex(time,columns), columns). I agree that the unstack, flatten version with 26*26 cols is computationally handier Commented Jan 3, 2023 at 22:23

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.