2

I have a dataframe with stores and its invoices numbers and I need to find the missing consecutive invoices numbers per Store, for example:

df1 = pd.DataFrame()
df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C','D','D']
df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203','204','206']
    Store   Invoice
0   A   1
1   A   2
2   A   5
3   A   6
4   A   8
5   B   20
6   B   23
7   B   24
8   B   30
9   C   200
10  C   202
11  C   203
12  D   204
13  D   206

And I want a dataframe like this:

    Store   MissInvoice
0   A   3
1   A   4
2   A   7
3   B   21
4   B   22
5   B   25
6   B   26
7   B   27
8   B   28
9   B   29
10  C   201
11  D   205

Thanks in advance!

2
  • Note that the DataFrame constructor and the shown data do not match exactly ;) Commented Nov 29, 2022 at 18:48
  • Thanks @Gustavo I updated ;) Commented Nov 29, 2022 at 19:44

2 Answers 2

4

You can use groupby.apply to compute a set difference with the range from the min to max value. Then explode:

(df1.astype({'Invoice': int})
    .groupby('Store')['Invoice']
    .apply(lambda s: set(range(s.min(), s.max())).difference(s))
    .explode().reset_index()
)

NB. if you want to ensure having sorted values, use lambda s: sorted(set(range(s.min(), s.max())).difference(s)).

Output:

   Store Invoice
0      A       3
1      A       4
2      A       7
3      B      21
4      B      22
5      B      25
6      B      26
7      B      27
8      B      28
9      B      29
10     C     201
11     D     205
Sign up to request clarification or add additional context in comments.

Comments

1

Here's an approach:

import pandas as pd
import numpy as np

df1 = pd.DataFrame()
df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C']
df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203']
df1['Invoice'] = df1['Invoice'].astype(int)

df2 = df1.groupby('Store')['Invoice'].agg(['min','max'])
df2['MissInvoice'] = [[]]*len(df2)
for store,row in df2.iterrows():
    df2.at[store,'MissInvoice'] = np.setdiff1d(np.arange(row['min'],row['max']+1), 
                                  df1.loc[df1['Store'] == store, 'Invoice'])
df2 = df2.explode('MissInvoice').drop(columns = ['min','max']).reset_index()

The resulting dataframe df2:

   Store MissInvoice
0      A           3
1      A           4
2      A           7
3      B          21
4      B          22
5      B          25
6      B          26
7      B          27
8      B          28
9      B          29
10     C         201

Note: Store D is absent from the dataframe in my code because it is omitted from the lines in the question defining df1.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.