Efficiently reshaping arrays in a pandas dataframe column

Question

I've got a following question:

Let's consider a pandas dataframe like this:

Width  Height  Bitmap

67     56    <1d numpy array with length 67 * 56>
59     71    <1d numpy array with length 59 * 71>
61     73    <1d numpy array with length 61 * 73>
...    ...   ...

Now, I would like to apply numpy.reshape() function to each row in Bitmap column. As a result, it should look like:

Width  Height  Bitmap

67     56    <2d numpy array with shape 67x56 >
59     71    <2d numpy array with shape 59x71 >
61     73    <2d numpy array with shape 61x73>
...    ...   ...

I have a working solution that looks like this:

for idx, bitmap in df['bitmap'].iteritems():
    df['bitmap'][idx] = np.reshape(bitmap, (df['width'][idx], df['height'][idx]))

My dataframe with bitmaps is quite huge (1,200,000 rows), so I would like to apply np.reshape() efficiently. Is it possible?

Divakar · Accepted Answer · 2017-05-12 10:53:23Z

3

I would keep the loop, but would try to reduce the computations once we go inside the loop by precomputing/storing the width and height values in an array and then accessing them inside the loop. Accessing an array should be hopefully faster. Also, we would modify the shape param, instead of reshaping in the loop.

Thus, the implementation would be -

def arr1d_2D(df):
    r = df.width.values
    c = df.height.values
    n = df.shape[0]
    for i in range(n):
        df.iloc[i,2].shape = (r[i],c[i])

We can go all NumPy here to work with underlying data for the bitmap column and this should be much faster -

def arr1d_2D_allNumPy(df):
    r = df.width.values
    c = df.height.values
    n = df.shape[0]
    b = df['bitmap'].values
    for i in range(n):
        b[i].shape = (r[i],c[i])

Sample run -

In [9]: df
Out[9]: 
   width  height                                bitmap
0      3       2                    [0, 1, 7, 4, 8, 1]
1      2       2                          [7, 3, 8, 6]
2      2       4              [6, 8, 6, 4, 7, 0, 6, 2]
3      4       3  [8, 6, 5, 2, 2, 2, 4, 3, 3, 3, 1, 8]
4      4       3  [3, 8, 4, 8, 6, 4, 2, 3, 8, 7, 7, 4]

In [10]: arr1d_2D_allNumPy(df)

In [11]: df
Out[11]: 
   width  height                                        bitmap
0      3       2                      [[0, 1], [7, 4], [8, 1]]
1      2       2                              [[7, 3], [8, 6]]
2      2       4                  [[6, 8, 6, 4], [7, 0, 6, 2]]
3      4       3  [[8, 6, 5], [2, 2, 2], [4, 3, 3], [3, 1, 8]]
4      4       3  [[3, 8, 4], [8, 6, 4], [2, 3, 8], [7, 7, 4]]

Runtime test

Approaches -

def org_app(df):   # Original approach
    for idx, bitmap in df['bitmap'].iteritems():
        df['bitmap'][idx] = np.reshape(bitmap, (df['width'][idx], \
                                                df['height'][idx]))

Timings -

In [43]: # Setup input dataframe and two copies for testing
    ...: a = np.random.randint(1,5,(1000,2))
    ...: df = pd.DataFrame(a, columns=(('width','height')))
    ...: n = df.shape[0]
    ...: randi = np.random.randint
    ...: df['bitmap'] = [randi(0,9,(df.iloc[i,0]*df.iloc[i,1])) for i in range(n)]
    ...: 
    ...: df_copy1 = df.copy()
    ...: df_copy2 = df.copy()
    ...: df_copy3 = df.copy()
    ...: 

In [44]: %timeit org_app(df_copy1)
1 loops, best of 3: 26 s per loop

In [45]: %timeit arr1d_2D(df_copy2)
10 loops, best of 3: 115 ms per loop

In [46]: %timeit arr1d_2D_allNumPy(df_copy3)
1000 loops, best of 3: 475 µs per loop

In [47]: 26000000/475.0  # Speedup with allNumPy version over original
Out[47]: 54736.84210526316

Crazy 50,000x+ speedup and just goes to show the better ways to access data, specially array data within pandas dataframes.

edited May 12, 2017 at 10:53

answered May 12, 2017 at 10:10

Divakar

222k19 gold badges273 silver badges374 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

bmiselis Over a year ago

Wow, it really is much faster than the solution I presented in the question. Is it safe to reshape arrays without a reshape function, but doing this manually?

Divakar Over a year ago

@bartekm3 That's right. Specially with .iloc we already have access to the underlying array data, so it's less messier and is just as efficient.

bmiselis Over a year ago

Thank you for your help then :)

bmiselis Over a year ago

Let's see: the dataframe had 100 examples (I took a small slice from the original dataset, since the original approach is too slow to make it on a full data), but I think that's enough to make a comparison: original approach: 6.05 s ± 24.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) your approach: 10.3 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) which gives me approx. 2400x boost. Impressive. edit: full numpy approach: 58.2 µs ± 905 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each). approx. 27000 boost.

Allen Qin · Accepted Answer · 2017-05-12 10:42:35Z

0

Will this work?

b2 = []
Temp = df.apply(lambda x: b2.append(x.Bitmap.reshape(x.Width,x.Height)), axis=1)
df.Bitmap = b2

edited May 12, 2017 at 10:42

answered May 12, 2017 at 9:51

Allen Qin

20k9 gold badges55 silver badges68 bronze badges

4 Comments

bmiselis Over a year ago

Seems like it tries to apply this to each of the columns: AttributeError: ("'Series' object has no attribute 'bitmap'", 'occurred at index width')

bmiselis Over a year ago

Now the error is Exception: Data must be 1-dimensional. From the traceback it's not clear in which place the exception occurred.

Allen Qin Over a year ago

Ok, I will have another look.

Allen Qin Over a year ago

It's not quite elegant but can you try again?

Collectives™ on Stack Overflow

Efficiently reshaping arrays in a pandas dataframe column

2 Answers 2

4 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related