How to split the values inside the Dataframe using python

Question

My dataframe :

df
Object      quantity
A             3
B             4
C             10
D             11

My desired result:

df
Object      quantity
A             3
B             4
C             4
C             4
C             2
D             4
D             4
D             3

My Goal here is to split the value stored in the column2 "quantity" so that it should be 4 or less than 4.

Which method I can use to solve this problem? any suggestion would be appreciated.

Simon · Accepted Answer · 2018-09-23 10:28:41Z

1

Something like this could work. For each group where quantity is greater than 4, apply a function that does the row splitting and stores into a temp dataframe, then combine everything together to get your desired output:

df = pd.DataFrame({'idx': ['A', 'B', 'C', 'D'],
                   'quantity': [3, 4, 10, 11]})

def split_quant(df):
    quantities = ([4]*(df['quantity'].iat[0] // 4)) + [df['quantity'].iat[0] % 4]

    temp = pd.DataFrame({'idx': df['idx'].iat[0],
                         'quantity': quantities
                         }, index=range(len(quantities)))
    temp = temp[temp['quantity']!=0]

    return temp

df_split = df[df['quantity'] > 4].groupby('idx').apply(split_quant)

output = df[df['quantity'] <= 4].append(df_split).reset_index(drop=True)

writer = pd.ExcelWriter('output.xlsx')
output.to_excel(writer, 'Sheet1', index=False)
writer.save()

The above will give you the following output dataframe:

  idx  quantity
0   A         3
1   B         4
2   C         4
3   C         4
4   C         2
5   D         4
6   D         4
7   D         3

EDIT:

I took the liberty of running some timing tests of the various methods. Using Pandas' groupby and apply saves a lot of time and avoids the nested loops over the input data (although I'm sure theres an even faster way that can avoid the apply as well...)

Mine:

5.49 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@Iqbal Basyar:

22.8 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

@sobek

17.7 ms ± 922 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Sep 23, 2018 at 10:28

answered Sep 22, 2018 at 20:54

Simon

10.2k16 gold badges69 silver badges123 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

i1100362 Over a year ago

Hi, Thanks. Your method worked well. Only problem I have is when the number is divisible by 4. ex. 12,16,20etc. I'm getting an extra item with value "0". Is there any possible way to eliminate that while saving it in excel?

Simon Over a year ago

within the function you can eliminate 0 with something like temp = temp[temp['quantity']!=0]. That will just reassign temp so that it doesnt contain any rows that are 0

Simon Over a year ago

saving to excel is pretty straightforward with pandas to_excel method. Docs/example here: pandas.pydata.org/pandas-docs/stable/generated/…

Simon Over a year ago

just updated the answer to handle 0 and to save excel output

Iqbal Basyar · Accepted Answer · 2018-09-22 21:05:45Z

Unfortunately, Pandas did not support this feature. So you have to create a new dataframe based on your old dataframe.

For each item in old dataframe, calculate

old_quantity = n * 4 + rest_quantitity

So in the new dataframe, you will add n item(s) with quantity of 4 , plus one with quantity of rest_quantity (if rest_quantity is not zero)

df = df = pd.DataFrame({'item': ["A","B","C"], 'qty': [3, 8,11]})
new_df = pd.DataFrame({'Item': [], 'qty': []})

for idx, item in df.iterrows():    
  if item['qty'] > 4 :
      n = item['qty'] // 4
      r = item['qty'] % 4 
      for _ in range(n):
          new_df.loc[len(new_df)] = [item['item'], 4]
      if r > 0 :
          new_df.loc[len(new_df)] = [item['item'], r]
  else :
      new_df.loc[len(new_df)] = [item['item'], item['qty']]

df

    item qty
0   A   3
1   B   8
2   C   11

new_df

   Item qty
0   A   3.0
1   B   4.0
2   B   4.0
3   C   4.0
4   C   4.0
5   C   3.0

sobek · Accepted Answer · 2018-09-22 21:38:06Z

1

This works but as far as pandas is concerned, it's not pretty nor is it fast:

df = pd.DataFrame({'idx': ['A', 'B', 'C', 'D', 'E', 'F', 'G'],
                   'quantity': [1., 2., 3., 4., 5., 6., 7.]})

df['factor'] = df.quantity // 4.
df['modulo'] = df.quantity % 4.

res = pd.DataFrame({'idx': [], 'quantity': []})

for idx, row in df.iterrows():
    for idxx in range(int(row.factor)):
        res = res.append({'idx': row.idx, 'quantity': 4.},
                         ignore_index=True)
    if row.modulo > 0:
        res = res.append({'idx': row.idx, 'quantity': row.modulo},
                         ignore_index=True)

In [24]: df
Out[24]: 
  idx  quantity
0   A       1.0
1   B       2.0
2   C       3.0
3   D       4.0
4   E       5.0
5   F       6.0
6   G       7.0

In [22]: res
Out[22]: 
  idx  quantity
0   A       1.0
1   B       2.0
2   C       3.0
3   D       4.0
4   E       4.0
5   E       1.0
6   F       4.0
7   F       2.0
8   G       4.0
9   G       3.0

edited Sep 22, 2018 at 21:38

answered Sep 22, 2018 at 19:15

sobek

1,43612 silver badges32 bronze badges

6 Comments

i1100362 Over a year ago

Hi, I need to split the values of E,F&G and it should be 4 or below 4. that means i have to repeat the items. in this case E=4 and E=1 so that i have two E and both the values are 4 and below 4

sobek Over a year ago

@user10309160 Hmm ok, i really did NOT get that from your initial question, so you want to split each number > 4 into summands of size less than 4.

i1100362 Over a year ago

Performance is not an issue. And Thanks, all the suggestions mentioned here works. But I have a small problem.I am working this in an excel i used

writer = pd.ExcelWriter("Rule.xlsx",engine="openpyxl") excel=openpyxl.load_workbook("Rule.xlsx") writer.book=excel writer.save() writer.close()

but its not overwriting in my excel sheet. Can you tell me why?

sobek Over a year ago

@user10309160 Is there any reason you can't do df.to_excel(writer, 'sheet_name'); writer.save() instead of the second line?

i1100362 Over a year ago

I figured it out. Thanks for the help.

|

Collectives™ on Stack Overflow

How to split the values inside the Dataframe using python

3 Answers 3

4 Comments

Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related