Python Pandas MemoryError

Question

I have those packages installed:

python: 2.7.3.final.0
python-bits: 64
OS: Linux
machine: x86_64
processor: x86_64
byteorder: little
pandas: 0.13.1

This is the dataframe info:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 421570 entries, 2010-02-05 00:00:00 to 2012-10-26 00:00:00
Data columns (total 5 columns):
Store           421570 non-null int64
Dept            421570 non-null int64
Weekly_Sales    421570 non-null float64
IsHoliday       421570 non-null bool
Date_Str        421570 non-null object
dtypes: bool(1), float64(1), int64(2), object(1)None

this is a sample how data look like:

Store,Dept,Date,Weekly_Sales,IsHoliday
1,1,2010-02-05,24924.5,FALSE
1,1,2010-02-12,46039.49,TRUE
1,1,2010-02-19,41595.55,FALSE
1,1,2010-02-26,19403.54,FALSE
1,1,2010-03-05,21827.9,FALSE
1,1,2010-03-12,21043.39,FALSE
1,1,2010-03-19,22136.64,FALSE
1,1,2010-03-26,26229.21,FALSE
1,1,2010-04-02,57258.43,FALSE

I load the file and index it as follows:

df_train = pd.read_csv('train.csv')
df_train['Date_Str'] = df_train['Date']
df_train['Date'] = pd.to_datetime(df_train['Date'])
df_train = df_train.set_index(['Date'])

when I the following operation with a 400K rows file,

df_train['_id'] = df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)

or

df_train['try'] = df_train['Store'] * df_train['Dept']

it causes an error:

Traceback (most recent call last):
  File "rock.py", line 85, in <module>
    rock.pandasTest()
  File "rock.py", line 31, in pandasTest
    df_train['_id'] = df_train['Store'].astype(str) +'_' + df_train['Dept'].astype('str')
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/ops.py", line 480, in wrapper
    return_indexers=True)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/tseries/index.py", line 976, in join
    return_indexers=return_indexers)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/index.py", line 1304, in join
    return_indexers=return_indexers)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/index.py", line 1345, in _join_non_unique
    how=how, sort=True)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/tools/merge.py", line 465, in _get_join_indexers
    return join_func(left_group_key, right_group_key, max_groups)
  File "join.pyx", line 152, in pandas.algos.full_outer_join (pandas/algos.c:34716)
MemoryError

However, it works fine with a small file.

What is the question?

jsalonen
– jsalonen

2014-05-30 14:01:40 +00:00
Commented May 30, 2014 at 14:01 — jsalonen
– jsalonen, Commented May 30, 2014 at 14:01
Also how do you load the data? Add the code etc.

jsalonen
– jsalonen

2014-05-30 14:03:22 +00:00
Commented May 30, 2014 at 14:03 — jsalonen
– jsalonen, Commented May 30, 2014 at 14:03

joris · Accepted Answer · 2014-05-31 11:19:57Z

2

I can also reproduce it on 0.13.1, but the issue does not occur in 0.12 or in 0.14 (released yesterday), so it seems a bug in 0.13.
So, maybe try to upgrade your pandas version, as the vectorized way is much faster as the apply (5s vs >1min on my machine), and using less peak memory (200Mb vs 980Mb, with %memit) on 0.14

Using your sample data repeated 50000 times (leading to a df of 450k rows), and using the apply_id function of @jsalonen:

In [23]: pd.__version__ 
Out[23]: '0.14.0'

In [24]: %timeit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
1 loops, best of 3: 5.42 s per loop

In [25]: %timeit df_train.apply(apply_id, 1)
1 loops, best of 3: 1min 11s per loop

In [26]: %load_ext memory_profiler

In [27]: %memit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
peak memory: 201.75 MiB, increment: 0.01 MiB

In [28]: %memit df_train.apply(apply_id, 1)
peak memory: 982.56 MiB, increment: 780.79 MiB

answered May 31, 2014 at 11:19

joris

140k37 gold badges257 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

dcc Over a year ago

BTW, I found an alternative way that has comparable performance with astype(str) in terms of memory usage and speed: df_train['Store'].map(str)+'' + df_train['Dept'].map(str) + '' + df_train['Date_Str'].map(str)

Dennis Golomazov Over a year ago

thanks for mentioning %memit, didn't know about it before.

jsalonen · Accepted Answer · 2014-05-30 14:22:08Z

1

Try generating the _id field with DataFrame.apply call:

def apply_id(x):
    x['_id'] = "{}_{}_{}".format(x['Store'], x['Dept'], x['Date_Str'])
    return x

df_train = df_train.apply(apply_id, 1)

When using apply the id generation is performed per row resulting in minimal overhead in memory allocation.

edited May 30, 2014 at 14:22

answered May 30, 2014 at 14:16

jsalonen

30.7k15 gold badges93 silver badges111 bronze badges

10 Comments

dcc Over a year ago

yeah, this way works, but in this thread, stackoverflow.com/questions/23950658/…, it is said the vectorized function is faster than using apply call, and from my experiments it seems true. The vectorized functions tend to use more memory than apply call, but the confusion is that I still have lots of memory left when the memory error occurs

jsalonen Over a year ago

I'm guessing that vectorized functions need to keep the whole vectors in memory while performing the operation and in your case that's way too much memory required. Also I think you can get MemoryError even before you actually run ouf of memory. Python is probably trying to allocate huge chunk of memory and it fails -> doesn't show any increase in memory consumption as it fails instantly.

joris Over a year ago

Indeed, but the strange thing is this should not happen at all for a dataframe of this size (and I also can't reproduce it)

jsalonen Over a year ago

Well note that you are not only appending three values but also converting them to strings. Vectorized string values -> BOOM

joris Over a year ago

Sorry, I can also reproduce it on 0.13.1, but the issue does not occur in 0.12 or in 0.14 (released yesterday), so it seems a bug in 0.13. So, maybe try to upgrade your pandas version, as the vectorized way is much faster as the apply (5s vs >1min on my machine), and using less peak memory (200Mb vs 980Mb, with %memit) on 0.14.

|

Collectives™ on Stack Overflow

Python Pandas MemoryError

2 Answers 2

2 Comments

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related