2

My CSV file looks something like this

location StartDate EndDate
Austin  10/24/20. 10/31/20
Austin  11/28/20. 12/05/20
Austin  12/26/20. 01/02/21
Austin  10/10/20  10/17/20
Austin  10/03/20. 10/10/20
Kansas  10/24/20. 10/31/20
Kansas  11/28/20. 12/05/20
Kansas  12/26/20. 01/02/21
Kansas  10/03/20. 10/10/20
Tampa   10/24/20. 10/31/20
Tampa   11/28/20. 12/05/20
Tampa   10/03/20. 10/10/20

As you can see Kansas is missing 10/10/20 - 10/17/20 and Tampa is missing 2 records for 10/10 and 12/26. Is there a way to find this missing records from the file using Pandas and python?

1
  • 1
    Perhaps calculate the time delta and if greater than a threshold, flag as missing. What have you tried / researched so far? Commented Sep 29, 2020 at 17:30

2 Answers 2

4

Let's try pivot and unstack:

(df.pivot(*df)
   .stack(dropna=False)
      .loc[lambda x: x.isna()]
)

Output:

location  StartDate 
Kansas    2020-10-10   NaT
Tampa     2020-10-10   NaT
          2020-12-26   NaT
dtype: datetime64[ns]
Sign up to request clarification or add additional context in comments.

3 Comments

df.pivot(*df) is a nice trick but shouldn't we assume that sample dataframes are minimized for example, and production data would have different shape?
@RichieV I would strongly disagree, and expect a user to provide a representative enough sample of their data to be able to give an acceptable answer. Otherwise we get into a problem with a moving target where the sample keeps getting updated with edge cases
@G.Anderson I don't mean to make a big fuzz about this, I agree that it is the questioner's responsibility to clarify how the sample is expected to grow... but I've frequently seen answers being updated precisely because they didn't provide a generalized solution
3

You can use unstack and stack(dropna=False)

df = df.groupby(['StartDate', 'EndDate', 'location']).size().unstack()
df = df.stack(dropna=False).rename('count').reset_index()
missing = df[df['count'].isna()]

Output

print(missing)
    StartDate   EndDate location  count
4    10/10/20  10/17/20   Kansas    NaN
5    10/10/20  10/17/20    Tampa    NaN
14  12/26/20.  01/02/21    Tampa    NaN

Basically you are making a square matrix for all StartDate and all location. When you unstack pandas places a NaN if the combination of row/column labels is not in the dataframe. Then when you stack pandas drops those NaN by default, but you can pass dropna parameter to keep them precisely for this use case.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.