simplifying routine in python with numpy array or pandas

Question

The initial problem is the following: I have an initial matrix with let say 10 lines and 12 rows. For all lines, I want to sum two rows together. At the end I must have 10 lines but with only 6 rows. Currently, I am doing the following for loop in python (using initial which is a pandas DataFrame)

for i in range(0,12,2):
  coarse[i]=initial.iloc[:,i:i+1].sum(axis=1)

In fact, I am quite sure that something more efficient is possible. I am thinking something like list comprehension but for a DataFrame or a numpy array. Does anybody have an idea ?

Moreover I would want to know if it is better to manipulate large numpy arrays or pandas DataFrame.

A dataframe has rows and columns. I assume that your reference to 'lines' above was actually columns because you explicitly mentioned rows. Your sample code above, however, is adding pairs of columns. — Alexander
– Alexander, Commented Mar 31, 2016 at 18:47

Alexander · Accepted Answer · 2016-03-31 21:40:03Z

1

Let's create a small sample dataframe to illustrate the solution:

np.random.seed(0)
df = pd.DataFrame(np.random.rand(6, 3))

>>> df
          0         1         2
0  0.548814  0.715189  0.602763
1  0.544883  0.423655  0.645894
2  0.437587  0.891773  0.963663
3  0.383442  0.791725  0.528895
4  0.568045  0.925597  0.071036
5  0.087129  0.020218  0.832620

You can use slice notation to select every other row starting from the first row (::2) and starting from the second row (1::2). iloc is for integer indexing. You need to select the values at these locations, and add them together. The result is a numpy array that you could then convert back into a DataFrame if required.

>>> df.iloc[::2].values + df.iloc[1::2].values
array([[ 1.09369669,  1.13884417,  1.24865749],
       [ 0.82102873,  1.68349804,  1.49255768],
       [ 0.65517386,  0.94581504,  0.9036559 ]])

You use values to remove the indexing. This is what happens otherwise:

>>> df.iloc[::2] + df.iloc[1::2].values
          0         1         2
0  1.093697  1.138844  1.248657
2  0.821029  1.683498  1.492558
4  0.655174  0.945815  0.903656

>>> df.iloc[::2].values + df.iloc[1::2]
          0         1         2
1  1.093697  1.138844  1.248657
3  0.821029  1.683498  1.492558
5  0.655174  0.945815  0.903656

For a more general solution:

df = pd.DataFrame(np.random.rand(9, 3))
n = 3  # Number of consecutive rows to group.
df['group'] = [idx // n for idx in range(len(df.index))]

df.groupby('group').sum()
              0         1         2
group                              
0      1.531284  2.030617  2.212320
1      1.038615  1.737540  1.432551
2      1.695590  1.971413  1.902501

edited Mar 31, 2016 at 21:40

answered Mar 31, 2016 at 18:31

Alexander

111k32 gold badges212 silver badges208 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

orpheu Over a year ago

Thanks a lot for you answer, the problem is that I have to repeat this operation many time for different size of "regrouping sum". For example, my matrix may have 15 rows and 3 columns. Then I have to compute first the sum of rows grouped by 3 (row0+row1+row2 , then row3+row4+row5... up to the 3 final rows) ; in a second time I will want to repeat the operation but grouping rows by 5 (row0+...+row4 ; ... ; row10+...+row14). I hope I'm clear enough ! Do you know any method faster than the one I proposed and for which I can adapt easily the grouping size ?

orpheu Over a year ago

Thanks a lot Alexander. Just one last question : can you explain me what " idx // n " means or give me some website which explain this ?

Alexander Over a year ago

@orpheu it is integer division, eg 5 // 2 = 2

Collectives™ on Stack Overflow

simplifying routine in python with numpy array or pandas

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related