1

I have the below sample dataframe. Each hour has 5 instances. Is there any module or a way to find the missing data in a given column in Python. For instance for hour 2, instance 3 is missing. How can we identify this missing instance in a larger dataset dynamically in Python and print a message that an instance is missing.

Date          Hour  Instance 
2022-10-20     1      1 
2022-10-20     1      2
2022-10-20     1      3
2022-10-20     1      4
2022-10-20     1      5
2022-10-20     2      1
2022-10-20     2      2 
2022-10-20     2      4
2022-10-20     2      5

Thank you.

1
  • 1
    Aggregate to determine the unique value counts of Instance for each hour. Filter for hours whose value counts are not 5. Next, create two lists - unique Dates and unique Hours. Now aggregate over each Instance. Wherever the count is off, that's the Date/Hour with a missing pattern Commented Jan 9, 2023 at 19:13

3 Answers 3

2

Using a crosstab:

df2 = pd.crosstab(df['Hour'], df['Instance'])

out = df2[df2.eq(0)].stack().reset_index()[['Hour', 'Instance']]

Output:

   Hour  Instance
0     2         3
Sign up to request clarification or add additional context in comments.

Comments

1
(df.set_index(["Hour", "Instance"])
   .unstack()
   .isna().where(lambda fr: fr)
   .stack()
   .reset_index()
   [["Hour", "Instance"]])
  • move Hour & Instance pairs to the index
  • then unstack the last one, i.e., Instance
    • this will have a reindexing effect on hours: all of them will have 1 to 5
    • and the missing ones will be marked as NaN
  • now mask the non-NaNs to become NaN, and NaNs to be True...
  • ...because when stack'ed, NaNs will go and only original missings will present
  • move Hour & Instance back to columns and select them to show the desired pairs

to get

   Hour  Instance
0     2         3

Meaning, there was only 1 instance missing, and it was Hour 2's Instance 3.

Comments

0

First, define a function that checks if there is a missing instance.

def is_data_missing(array):
    """Return True when data is missing, ie array is different from range."""
    return list(array) != list(range(1, len(array) + 1))

Then you can apply it to your DataFrame, grouping by hour first.

>>> df.groupby("Hour").apply(lambda x: is_range(x["Instance"].values))
Hour
1    False 
2     True

The resulting DataFrame gives you the hours when data was missing. You can then iterate through the items to print a message.

>>> for hour, is_missing in df_missing.items():
...     if is_missing:
...         print(f"At hour {hour} data is missing")
At hour 2 data is missing

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.