0

The initial problem is the following: I have an initial matrix with let say 10 lines and 12 rows. For all lines, I want to sum two rows together. At the end I must have 10 lines but with only 6 rows. Currently, I am doing the following for loop in python (using initial which is a pandas DataFrame)

for i in range(0,12,2):
  coarse[i]=initial.iloc[:,i:i+1].sum(axis=1)

In fact, I am quite sure that something more efficient is possible. I am thinking something like list comprehension but for a DataFrame or a numpy array. Does anybody have an idea ?

Moreover I would want to know if it is better to manipulate large numpy arrays or pandas DataFrame.

2
  • Could you add a sample representative input? Commented Mar 31, 2016 at 18:32
  • A dataframe has rows and columns. I assume that your reference to 'lines' above was actually columns because you explicitly mentioned rows. Your sample code above, however, is adding pairs of columns. Commented Mar 31, 2016 at 18:47

1 Answer 1

1

Let's create a small sample dataframe to illustrate the solution:

np.random.seed(0)
df = pd.DataFrame(np.random.rand(6, 3))

>>> df
          0         1         2
0  0.548814  0.715189  0.602763
1  0.544883  0.423655  0.645894
2  0.437587  0.891773  0.963663
3  0.383442  0.791725  0.528895
4  0.568045  0.925597  0.071036
5  0.087129  0.020218  0.832620

You can use slice notation to select every other row starting from the first row (::2) and starting from the second row (1::2). iloc is for integer indexing. You need to select the values at these locations, and add them together. The result is a numpy array that you could then convert back into a DataFrame if required.

>>> df.iloc[::2].values + df.iloc[1::2].values
array([[ 1.09369669,  1.13884417,  1.24865749],
       [ 0.82102873,  1.68349804,  1.49255768],
       [ 0.65517386,  0.94581504,  0.9036559 ]])

You use values to remove the indexing. This is what happens otherwise:

>>> df.iloc[::2] + df.iloc[1::2].values
          0         1         2
0  1.093697  1.138844  1.248657
2  0.821029  1.683498  1.492558
4  0.655174  0.945815  0.903656

>>> df.iloc[::2].values + df.iloc[1::2]
          0         1         2
1  1.093697  1.138844  1.248657
3  0.821029  1.683498  1.492558
5  0.655174  0.945815  0.903656

For a more general solution:

df = pd.DataFrame(np.random.rand(9, 3))
n = 3  # Number of consecutive rows to group.
df['group'] = [idx // n for idx in range(len(df.index))]

df.groupby('group').sum()
              0         1         2
group                              
0      1.531284  2.030617  2.212320
1      1.038615  1.737540  1.432551
2      1.695590  1.971413  1.902501
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks a lot for you answer, the problem is that I have to repeat this operation many time for different size of "regrouping sum". For example, my matrix may have 15 rows and 3 columns. Then I have to compute first the sum of rows grouped by 3 (row0+row1+row2 , then row3+row4+row5... up to the 3 final rows) ; in a second time I will want to repeat the operation but grouping rows by 5 (row0+...+row4 ; ... ; row10+...+row14). I hope I'm clear enough ! Do you know any method faster than the one I proposed and for which I can adapt easily the grouping size ?
Thanks a lot Alexander. Just one last question : can you explain me what " idx // n " means or give me some website which explain this ?
@orpheu it is integer division, eg 5 // 2 = 2

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.