Add multiple columns to a Pandas dataframe quickly

Question

I'm writing some performance-sensitive code in which I have to add a large number of columns to a Pandas dataframe quickly.

I've managed to get a 3x improvement over naively repeating df[foo] = bar by constructing a second dataframe from a dict and concatenating them:

def mkdf1(n):
    df = pd.DataFrame(index=range(10,20), columns=list('qwertyuiop'))
    for i in xrange(n):
        df['col%d' % i] = range(i, 10+i)
    return df

def mkdf2(n):
    df = pd.DataFrame(index=range(10,20), columns=list('qwertyuiop'))
    newcols = {}
    for i in xrange(n):
        newcols['col%d' % i] = range(i, 10+i)
    return pd.concat([df, pd.DataFrame(newcols, index=df.index)], axis=1)

The timings show substantial improvement:

%timeit -r 1 mkdf1(100)
100 loops, best of 1: 16.6 ms per loop

%timeit -r 1 mkdf2(100)
100 loops, best of 1: 5.5 ms per loop

Are there any other optimizations I can make here?

EDIT: Also, the concat call is taking much longer in my real code than my toy example; in particular the get_result function takes a lot longer despite the production df having fewer rows and I can't figure out why. Any advice on how to speed this up would be appreciated.

JohnE · Accepted Answer · 2015-03-22 16:51:42Z

5

I'm a little confused at exactly what your dataframe should look like, but it's easy to speed this up a lot with a general technique. Basically for pandas/numpy speed you want to avoid for and any concat/merge/join/append, if possible.

Your best bet here is most likely to use numpy to create an array that will be the input to a dataframe and then name the columns however you like. Both of these operations should be trivial as far as computation time.

Here's the numpy part, it looks like you already know how to construct column names.

%timeit pd.DataFrame(  np.ones([10,100]).cumsum(axis=0) 
                     + np.ones([10,100]).cumsum(axis=1) )
10000 loops, best of 3: 158 µs per loop

I think you are trying to make something like this? (If not, just check out numpy if you aren't familiar with it, it has all sorts of array operations that should make it very easy to do whatever you are trying to do here).

In [63]: df.ix[:5,:10]
Out[63]: 
   0   1   2   3   4   5   6   7   8   9   10
0   2   3   4   5   6   7   8   9  10  11  12
1   3   4   5   6   7   8   9  10  11  12  13
2   4   5   6   7   8   9  10  11  12  13  14
3   5   6   7   8   9  10  11  12  13  14  15
4   6   7   8   9  10  11  12  13  14  15  16
5   7   8   9  10  11  12  13  14  15  16  17

edited Mar 22, 2015 at 16:51

answered Mar 22, 2015 at 16:44

JohnE

30.7k9 gold badges86 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Ben Kuhn Over a year ago

Huh. For some reason I had it in my head that DataFrames stored by column, not by block, so the df would have to chop up the numpy array and it wouldn't be faster. Anyway, thanks so much for explaining that!

Ben Kuhn Over a year ago

Two short follow-up questions: 1) if I'm starting with a dataframe df already supplied to me, is it reasonable to construct a numpy array for the columns I want to add, and then concat that to df.values at the numpy level? Or should I create an entirely new numpy array, write df.values into it, and then write my new column into it?

Ben Kuhn Over a year ago

2) if I have a mixed-type dataframe, will this still work (with a numpy array with dtype=object)? Or does Pandas do weird things with mixed types that would cause this to slow down?

JohnE Over a year ago

Sorry, I don't really have any general advice for (1) or (2) but I also don't see any problems with what you are planning to do. Honestly, I just go for simple and readable solutions first. If those are too slow, then try alternate ways and post specific performance problems here.

JohnE Over a year ago

Regarding your first comment, note that pandas is built on top of numpy so they generally get along very well. You can often do stuff like a+b where a is a dataframe and b is a numpy array, without even thinking about it or pre-converting.

|

Collectives™ on Stack Overflow

Add multiple columns to a Pandas dataframe quickly

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related