1

I try to merge 3 files which is about 3GB, 200Kb and 200kb using pandas, and my computer has 32G memory, but it still end with MemoryError. Is there any way to avoid this problem? My merge code is below:

product = pd.read_csv("../data/process_product.csv", header=0)
product["bandID"] = pd.factorize(product.Band)[0]
product = product.drop('Band', 1)
product = product.drop('Info', 1)

town_state = pd.read_csv("../data/town_state.csv", header=0)
dumies = pd.get_dummies(town_state.State)
town_state = pd.concat([town_state, dumies], axis=1)
town_state["townID"] = pd.factorize(town_state.Town)[0]
town_state = town_state.drop('State', 1)
town_state = town_state.drop('Town', 1)
train = pd.read_csv("../data/train.csv", header=0)

result = pd.merge(train, town_state, on="Agencia_ID", how='left')
result = pd.merge(result, product, on="Producto_ID", how='left')
result.to_csv("../data/train_data.csv")
1
  • Could you gzip your files and upload them somewhere and post links here ? Of course if your data isn't sensible (doesn't contain any customer info, emails, etc.). I could try to optimize it on my notebook that has 16GB of RAM Commented Jun 26, 2016 at 9:16

1 Answer 1

1

Here is my "micro"-optimization attempt:

you don't use (need) the Info column from the process_product.csv, so there is no need to read it:

cols = [<list of columns, EXCEPT Info column>]
product = pd.read_csv("../data/process_product.csv", usecols=cols)
product['Band'] = pd.factorize(product.Band)[0]
product.rename(columns={'Band':'bandID'}, inplace=True)

we could try to save some memory on dumies variable - use get_dummies() on-the-fly and use sparse=True parameter:

town_state = pd.concat([town_state, pd.get_dummies(town_state.State, sparse=True)], axis=1)
del town_state['State']
town_state['Town'] = pd.factorize(town_state.Town)[0]
town_state.rename(columns={'Town':'townID'}, inplace=True)

try to save on result DF, remove town_state from memory ASAP:

train = pd.merge(train, town_state, on="Agencia_ID", how='left')
del town_state
train = pd.merge(train, product, on="Producto_ID", how='left')
del product

PS i don't know which file/DF is the biggest one (32GB), so i made an assumtion that it's a train DF. If it's product DF, then i would do it this way:

product = pd.merge(train, product, on="Producto_ID", how='left')
del train
product = pd.merge(product, town_state, on="Agencia_ID", how='left')
del town_state
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.