2

I have following pandas dataframe

 Code      Sum      Quantity
 0         -12      0
 1          23      0
 2         -10      0
 3         -12      0
 4         100      0
 5         102      201
 6          34      0
 7         -34      0
 8         -23      0
 9         100      0
 10        100      0
 11        102      300
 12        -23       0
 13        -25       0
 14        100      123
 15        167      167  

My desired dataframe is

Code      Sum      Quantity    new_sum
0         -12      0          -12
1          23      0           23
2         -10      0          -10
3         -12      0          -12
4         100      0           0
5         102      201         202 
6          34      0           34
7         -34      0          -34
8         -23      0          -23
9         100      0           0
10        100      0           0
11        102      300         302
12        -23       0          -23
13        -25       0          -25
14        100      123         100 
15        167      167         167

Logic is:

First I will check for non zero values in the quantity column. In the above sample data we got the first non zero occurance of quantity at index 4, which is 201. Then I want to add column sum till I get negative value in the row.

I have written a code, which uses nested if statements.However,it takes lot of time to execute the code because of multiple if's and row wise comparison.

current_stock = 0
for i in range(len(test)):
    if(test['Quantity'][i] != 0):
        current_stock = test['Sum'][i]
        if(test['Sum'][i-1] > 0):
            current_stock = current_stock + test['Sum'][i-1]
            test['new_sum'][i-1] = 0
            if(test['Sum'][i-2] > 0):
                current_stock = current_stock + test['Sum'][i-2]
                test['new_sum'][i-2] = 0
                if(test['Sum'][i-3] > 0):
                    current_stock = current_stock + test['Sum'][i-3]
                    test['new_sum'][i-3] = 0
                else:
                    test['new_sum'][i] = current_stock
            else:
                test['new_sum'][i] = current_stock
        else:
            test['new_sum'][i] = current_stock
    else:
        test['new_sum'][i] =  test['Sum'][i]

Is there any better way to do it?

4
  • Your test code refers to a column called 'stock_volume' but no such column exists. When you say "Then I want to add column sum till I get positive values in the row above index 4", what does that mean? If you add 100 to 0, you get 100 which is positive; not 202. Should they perhaps be greater than the value in 'Quantity'? And in your code, it looks like you only check the 3 previous rows, but nothing about that appears in your logic. Commented Sep 22, 2018 at 19:35
  • @fuglede Edited the question Commented Sep 23, 2018 at 6:32
  • Thanks; I still don't follow why you only check 3 rows in your own example, when that does not appear to be part of the logic, and in the answer below that restriction is removed. Commented Sep 23, 2018 at 9:06
  • It's not the ifs, it's the for. Commented Sep 23, 2018 at 9:43

1 Answer 1

2

Let's look at three solutions and provide performance comparisons at the end.

One approach that tries to stay close to pandas would be the following:

def f1(df):
    # Group together the elements of df.Sum that might have to be added
    pos_groups = (df.Sum <= 0).cumsum()
    pos_groups[df.Sum <= 0] = -1
    # Create the new column and populate it with what is in df.Sum
    df['new_sum'] = df.Sum
    # Find the indices of the new column that need to be calculated as a sum
    indices = df[df.Quantity > 0].index
    for i in indices:
        # Find the relevant group of positive integers to be summed, ensuring
        # that we only consider those that come /before/ the one to be calculated
        group = pos_groups[:i+1] == pos_groups[i]
        # Zero out all the elements that will be part of the sum
        df.new_sum[:i+1][group] = 0
        # Calculate the actual sum and store that
        df.new_sum[i] = df.Sum[:i+1][group].sum()

f1(df)

One place where there's possibly room for improvement would be in pos_groups[:i+1] == pos_groups[i] which checks all i+1 elements when, depending on what your data looks like, it could probably get away with checking a fraction of those. However, chances are this is still more efficient in practice. If not, you may want to iterate by hand to find the groups:

def f2(sums, quantities):
    new_sums = np.copy(sums)
    indices = np.where(quantities > 0)[0]
    for i in indices:
        a = i
        while sums[a] > 0:
            s = new_sums[a]
            new_sums[a] = 0
            new_sums[i] += s
            a -= 1
    return new_sums

df['new_sum'] = f2(df.Sum.values, df.Quantity.values)

Finally, depending once again on what your data looks like, there is a decent chance that the latter approach can be improved using Numba:

from numba import jit
f3 = jit(f2)
df['new_sum'] = f3(df.Sum.values, df.Quantity.values)

For the data provided in the question (which might very well be too small to provide a proper picture) the performance tests looks as follows:

In [13]: %timeit f1(df)
5.32 ms ± 77.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [14]: %timeit df['new_sum'] = f2(df.Sum.values, df.Quantity.values)
190 µs ± 5.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each

In [18]: %timeit df['new_sum'] = f3(df.Sum.values, df.Quantity.values)
178 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Here, most of the time is spent on updating the data frame. If the data were 1000 times larger, the Numba solution would end up being a clear winner:

In [28]: df_large = pd.concat([df]*1000).reset_index()

In [29]: %timeit f1(df_large)
5.82 s ± 63.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [30]: %timeit df_large['new_sum'] = f2(df_large.Sum.values, df_large.Quantity.values)
6.27 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [31]: %timeit df_large['new_sum'] = f3(df_large.Sum.values, df_large.Quantity.values)
215 µs ± 5.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sign up to request clarification or add additional context in comments.

4 Comments

@fudlede Thank you so much for your response. Can we include one more condition in above logic. e.g quantity is non zero for consecutive rows then I do not want to add numbers.I want to add them as it is in new_sum column. I have edited the question.
What would happen if the final -25 were a 25 instead?
Had it been 25 and corresponding Quantity column is not zero new_sum will be 25,if it is 0 then it will get added to 100
Any solution for above scenario?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.