2

I've got a following question:

Let's consider a pandas dataframe like this:

Width  Height  Bitmap

67     56    <1d numpy array with length 67 * 56>
59     71    <1d numpy array with length 59 * 71>
61     73    <1d numpy array with length 61 * 73>
...    ...   ...

Now, I would like to apply numpy.reshape() function to each row in Bitmap column. As a result, it should look like:

Width  Height  Bitmap

67     56    <2d numpy array with shape 67x56 >
59     71    <2d numpy array with shape 59x71 >
61     73    <2d numpy array with shape 61x73>
...    ...   ...

I have a working solution that looks like this:

for idx, bitmap in df['bitmap'].iteritems():
    df['bitmap'][idx] = np.reshape(bitmap, (df['width'][idx], df['height'][idx]))

My dataframe with bitmaps is quite huge (1,200,000 rows), so I would like to apply np.reshape() efficiently. Is it possible?

2 Answers 2

3

I would keep the loop, but would try to reduce the computations once we go inside the loop by precomputing/storing the width and height values in an array and then accessing them inside the loop. Accessing an array should be hopefully faster. Also, we would modify the shape param, instead of reshaping in the loop.

Thus, the implementation would be -

def arr1d_2D(df):
    r = df.width.values
    c = df.height.values
    n = df.shape[0]
    for i in range(n):
        df.iloc[i,2].shape = (r[i],c[i])

We can go all NumPy here to work with underlying data for the bitmap column and this should be much faster -

def arr1d_2D_allNumPy(df):
    r = df.width.values
    c = df.height.values
    n = df.shape[0]
    b = df['bitmap'].values
    for i in range(n):
        b[i].shape = (r[i],c[i])

Sample run -

In [9]: df
Out[9]: 
   width  height                                bitmap
0      3       2                    [0, 1, 7, 4, 8, 1]
1      2       2                          [7, 3, 8, 6]
2      2       4              [6, 8, 6, 4, 7, 0, 6, 2]
3      4       3  [8, 6, 5, 2, 2, 2, 4, 3, 3, 3, 1, 8]
4      4       3  [3, 8, 4, 8, 6, 4, 2, 3, 8, 7, 7, 4]

In [10]: arr1d_2D_allNumPy(df)

In [11]: df
Out[11]: 
   width  height                                        bitmap
0      3       2                      [[0, 1], [7, 4], [8, 1]]
1      2       2                              [[7, 3], [8, 6]]
2      2       4                  [[6, 8, 6, 4], [7, 0, 6, 2]]
3      4       3  [[8, 6, 5], [2, 2, 2], [4, 3, 3], [3, 1, 8]]
4      4       3  [[3, 8, 4], [8, 6, 4], [2, 3, 8], [7, 7, 4]]

Runtime test

Approaches -

def org_app(df):   # Original approach
    for idx, bitmap in df['bitmap'].iteritems():
        df['bitmap'][idx] = np.reshape(bitmap, (df['width'][idx], \
                                                df['height'][idx]))

Timings -

In [43]: # Setup input dataframe and two copies for testing
    ...: a = np.random.randint(1,5,(1000,2))
    ...: df = pd.DataFrame(a, columns=(('width','height')))
    ...: n = df.shape[0]
    ...: randi = np.random.randint
    ...: df['bitmap'] = [randi(0,9,(df.iloc[i,0]*df.iloc[i,1])) for i in range(n)]
    ...: 
    ...: df_copy1 = df.copy()
    ...: df_copy2 = df.copy()
    ...: df_copy3 = df.copy()
    ...: 

In [44]: %timeit org_app(df_copy1)
1 loops, best of 3: 26 s per loop

In [45]: %timeit arr1d_2D(df_copy2)
10 loops, best of 3: 115 ms per loop

In [46]: %timeit arr1d_2D_allNumPy(df_copy3)
1000 loops, best of 3: 475 µs per loop

In [47]: 26000000/475.0  # Speedup with allNumPy version over original
Out[47]: 54736.84210526316

Crazy 50,000x+ speedup and just goes to show the better ways to access data, specially array data within pandas dataframes.

Sign up to request clarification or add additional context in comments.

4 Comments

Wow, it really is much faster than the solution I presented in the question. Is it safe to reshape arrays without a reshape function, but doing this manually?
@bartekm3 That's right. Specially with .iloc we already have access to the underlying array data, so it's less messier and is just as efficient.
Thank you for your help then :)
Let's see: the dataframe had 100 examples (I took a small slice from the original dataset, since the original approach is too slow to make it on a full data), but I think that's enough to make a comparison: original approach: 6.05 s ± 24.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) your approach: 10.3 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) which gives me approx. 2400x boost. Impressive. edit: full numpy approach: 58.2 µs ± 905 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each). approx. 27000 boost.
0

Will this work?

b2 = []
Temp = df.apply(lambda x: b2.append(x.Bitmap.reshape(x.Width,x.Height)), axis=1)
df.Bitmap = b2

4 Comments

Seems like it tries to apply this to each of the columns: AttributeError: ("'Series' object has no attribute 'bitmap'", 'occurred at index width')
Now the error is Exception: Data must be 1-dimensional. From the traceback it's not clear in which place the exception occurred.
Ok, I will have another look.
It's not quite elegant but can you try again?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.