How to optimize memory use in pandas

Question

I try to merge 3 files which is about 3GB, 200Kb and 200kb using pandas, and my computer has 32G memory, but it still end with MemoryError. Is there any way to avoid this problem? My merge code is below:

product = pd.read_csv("../data/process_product.csv", header=0)
product["bandID"] = pd.factorize(product.Band)[0]
product = product.drop('Band', 1)
product = product.drop('Info', 1)

town_state = pd.read_csv("../data/town_state.csv", header=0)
dumies = pd.get_dummies(town_state.State)
town_state = pd.concat([town_state, dumies], axis=1)
town_state["townID"] = pd.factorize(town_state.Town)[0]
town_state = town_state.drop('State', 1)
town_state = town_state.drop('Town', 1)
train = pd.read_csv("../data/train.csv", header=0)

result = pd.merge(train, town_state, on="Agencia_ID", how='left')
result = pd.merge(result, product, on="Producto_ID", how='left')
result.to_csv("../data/train_data.csv")

Could you gzip your files and upload them somewhere and post links here ? Of course if your data isn't sensible (doesn't contain any customer info, emails, etc.). I could try to optimize it on my notebook that has 16GB of RAM — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Jun 26, 2016 at 9:16

MaxU - stand with Ukraine · Accepted Answer · 2016-06-26 09:42:00Z

Here is my "micro"-optimization attempt:

you don't use (need) the Info column from the process_product.csv, so there is no need to read it:

cols = [<list of columns, EXCEPT Info column>]
product = pd.read_csv("../data/process_product.csv", usecols=cols)
product['Band'] = pd.factorize(product.Band)[0]
product.rename(columns={'Band':'bandID'}, inplace=True)

we could try to save some memory on dumies variable - use get_dummies() on-the-fly and use sparse=True parameter:

town_state = pd.concat([town_state, pd.get_dummies(town_state.State, sparse=True)], axis=1)
del town_state['State']
town_state['Town'] = pd.factorize(town_state.Town)[0]
town_state.rename(columns={'Town':'townID'}, inplace=True)

try to save on result DF, remove town_state from memory ASAP:

train = pd.merge(train, town_state, on="Agencia_ID", how='left')
del town_state
train = pd.merge(train, product, on="Producto_ID", how='left')
del product

PS i don't know which file/DF is the biggest one (32GB), so i made an assumtion that it's a train DF. If it's product DF, then i would do it this way:

product = pd.merge(train, product, on="Producto_ID", how='left')
del train
product = pd.merge(product, town_state, on="Agencia_ID", how='left')
del town_state

Collectives™ on Stack Overflow

How to optimize memory use in pandas

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related