I'm writing some performance-sensitive code in which I have to add a large number of columns to a Pandas dataframe quickly.
I've managed to get a 3x improvement over naively repeating df[foo] = bar by constructing a second dataframe from a dict and concatenating them:
def mkdf1(n):
df = pd.DataFrame(index=range(10,20), columns=list('qwertyuiop'))
for i in xrange(n):
df['col%d' % i] = range(i, 10+i)
return df
def mkdf2(n):
df = pd.DataFrame(index=range(10,20), columns=list('qwertyuiop'))
newcols = {}
for i in xrange(n):
newcols['col%d' % i] = range(i, 10+i)
return pd.concat([df, pd.DataFrame(newcols, index=df.index)], axis=1)
The timings show substantial improvement:
%timeit -r 1 mkdf1(100)
100 loops, best of 1: 16.6 ms per loop
%timeit -r 1 mkdf2(100)
100 loops, best of 1: 5.5 ms per loop
Are there any other optimizations I can make here?
EDIT: Also, the concat call is taking much longer in my real code than my toy example; in particular the get_result function takes a lot longer despite the production df having fewer rows and I can't figure out why. Any advice on how to speed this up would be appreciated.