2

I have those packages installed:

python: 2.7.3.final.0
python-bits: 64
OS: Linux
machine: x86_64
processor: x86_64
byteorder: little
pandas: 0.13.1

This is the dataframe info:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 421570 entries, 2010-02-05 00:00:00 to 2012-10-26 00:00:00
Data columns (total 5 columns):
Store           421570 non-null int64
Dept            421570 non-null int64
Weekly_Sales    421570 non-null float64
IsHoliday       421570 non-null bool
Date_Str        421570 non-null object
dtypes: bool(1), float64(1), int64(2), object(1)None

this is a sample how data look like:

Store,Dept,Date,Weekly_Sales,IsHoliday
1,1,2010-02-05,24924.5,FALSE
1,1,2010-02-12,46039.49,TRUE
1,1,2010-02-19,41595.55,FALSE
1,1,2010-02-26,19403.54,FALSE
1,1,2010-03-05,21827.9,FALSE
1,1,2010-03-12,21043.39,FALSE
1,1,2010-03-19,22136.64,FALSE
1,1,2010-03-26,26229.21,FALSE
1,1,2010-04-02,57258.43,FALSE

I load the file and index it as follows:

df_train = pd.read_csv('train.csv')
df_train['Date_Str'] = df_train['Date']
df_train['Date'] = pd.to_datetime(df_train['Date'])
df_train = df_train.set_index(['Date'])

when I the following operation with a 400K rows file,

df_train['_id'] = df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)

or

df_train['try'] = df_train['Store'] * df_train['Dept']

it causes an error:

Traceback (most recent call last):
  File "rock.py", line 85, in <module>
    rock.pandasTest()
  File "rock.py", line 31, in pandasTest
    df_train['_id'] = df_train['Store'].astype(str) +'_' + df_train['Dept'].astype('str')
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/ops.py", line 480, in wrapper
    return_indexers=True)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/tseries/index.py", line 976, in join
    return_indexers=return_indexers)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/index.py", line 1304, in join
    return_indexers=return_indexers)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/index.py", line 1345, in _join_non_unique
    how=how, sort=True)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/tools/merge.py", line 465, in _get_join_indexers
    return join_func(left_group_key, right_group_key, max_groups)
  File "join.pyx", line 152, in pandas.algos.full_outer_join (pandas/algos.c:34716)
MemoryError

However, it works fine with a small file.

2
  • What is the question? Commented May 30, 2014 at 14:01
  • Also how do you load the data? Add the code etc. Commented May 30, 2014 at 14:03

2 Answers 2

2

I can also reproduce it on 0.13.1, but the issue does not occur in 0.12 or in 0.14 (released yesterday), so it seems a bug in 0.13.
So, maybe try to upgrade your pandas version, as the vectorized way is much faster as the apply (5s vs >1min on my machine), and using less peak memory (200Mb vs 980Mb, with %memit) on 0.14

Using your sample data repeated 50000 times (leading to a df of 450k rows), and using the apply_id function of @jsalonen:

In [23]: pd.__version__ 
Out[23]: '0.14.0'

In [24]: %timeit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
1 loops, best of 3: 5.42 s per loop

In [25]: %timeit df_train.apply(apply_id, 1)
1 loops, best of 3: 1min 11s per loop

In [26]: %load_ext memory_profiler

In [27]: %memit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
peak memory: 201.75 MiB, increment: 0.01 MiB

In [28]: %memit df_train.apply(apply_id, 1)
peak memory: 982.56 MiB, increment: 780.79 MiB
Sign up to request clarification or add additional context in comments.

2 Comments

BTW, I found an alternative way that has comparable performance with astype(str) in terms of memory usage and speed: df_train['Store'].map(str)+'' + df_train['Dept'].map(str) + '' + df_train['Date_Str'].map(str)
thanks for mentioning %memit, didn't know about it before.
1

Try generating the _id field with DataFrame.apply call:

def apply_id(x):
    x['_id'] = "{}_{}_{}".format(x['Store'], x['Dept'], x['Date_Str'])
    return x

df_train = df_train.apply(apply_id, 1)

When using apply the id generation is performed per row resulting in minimal overhead in memory allocation.

10 Comments

yeah, this way works, but in this thread, stackoverflow.com/questions/23950658/…, it is said the vectorized function is faster than using apply call, and from my experiments it seems true. The vectorized functions tend to use more memory than apply call, but the confusion is that I still have lots of memory left when the memory error occurs
I'm guessing that vectorized functions need to keep the whole vectors in memory while performing the operation and in your case that's way too much memory required. Also I think you can get MemoryError even before you actually run ouf of memory. Python is probably trying to allocate huge chunk of memory and it fails -> doesn't show any increase in memory consumption as it fails instantly.
Indeed, but the strange thing is this should not happen at all for a dataframe of this size (and I also can't reproduce it)
Well note that you are not only appending three values but also converting them to strings. Vectorized string values -> BOOM
Sorry, I can also reproduce it on 0.13.1, but the issue does not occur in 0.12 or in 0.14 (released yesterday), so it seems a bug in 0.13. So, maybe try to upgrade your pandas version, as the vectorized way is much faster as the apply (5s vs >1min on my machine), and using less peak memory (200Mb vs 980Mb, with %memit) on 0.14.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.