How to find missing rows in csv using Pandas?

Question

My CSV file looks something like this

location StartDate EndDate
Austin  10/24/20. 10/31/20
Austin  11/28/20. 12/05/20
Austin  12/26/20. 01/02/21
Austin  10/10/20  10/17/20
Austin  10/03/20. 10/10/20
Kansas  10/24/20. 10/31/20
Kansas  11/28/20. 12/05/20
Kansas  12/26/20. 01/02/21
Kansas  10/03/20. 10/10/20
Tampa   10/24/20. 10/31/20
Tampa   11/28/20. 12/05/20
Tampa   10/03/20. 10/10/20

As you can see Kansas is missing 10/10/20 - 10/17/20 and Tampa is missing 2 records for 10/10 and 12/26. Is there a way to find this missing records from the file using Pandas and python?

Perhaps calculate the time delta and if greater than a threshold, flag as missing. What have you tried / researched so far? — s3dev
– s3dev, Commented Sep 29, 2020 at 17:30

Quang Hoang · Accepted Answer · 2020-09-29 17:32:08Z

4

Let's try pivot and unstack:

(df.pivot(*df)
   .stack(dropna=False)
      .loc[lambda x: x.isna()]
)

Output:

location  StartDate 
Kansas    2020-10-10   NaT
Tampa     2020-10-10   NaT
          2020-12-26   NaT
dtype: datetime64[ns]

answered Sep 29, 2020 at 17:32

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

RichieV Over a year ago

df.pivot(*df) is a nice trick but shouldn't we assume that sample dataframes are minimized for example, and production data would have different shape?

G. Anderson Over a year ago

@RichieV I would strongly disagree, and expect a user to provide a representative enough sample of their data to be able to give an acceptable answer. Otherwise we get into a problem with a moving target where the sample keeps getting updated with edge cases

RichieV Over a year ago

@G.Anderson I don't mean to make a big fuzz about this, I agree that it is the questioner's responsibility to clarify how the sample is expected to grow... but I've frequently seen answers being updated precisely because they didn't provide a generalized solution

RichieV · Accepted Answer · 2020-09-29 17:36:04Z

3

You can use unstack and stack(dropna=False)

df = df.groupby(['StartDate', 'EndDate', 'location']).size().unstack()
df = df.stack(dropna=False).rename('count').reset_index()
missing = df[df['count'].isna()]

Output

print(missing)
    StartDate   EndDate location  count
4    10/10/20  10/17/20   Kansas    NaN
5    10/10/20  10/17/20    Tampa    NaN
14  12/26/20.  01/02/21    Tampa    NaN

Basically you are making a square matrix for all StartDate and all location. When you unstack pandas places a NaN if the combination of row/column labels is not in the dataframe. Then when you stack pandas drops those NaN by default, but you can pass dropna parameter to keep them precisely for this use case.

edited Sep 29, 2020 at 17:36

answered Sep 29, 2020 at 17:30

RichieV

5,1832 gold badges13 silver badges24 bronze badges

Collectives™ on Stack Overflow

How to find missing rows in csv using Pandas?

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related