How to check if a pattern of rows is missing in a pandas dataframe in Python

Question

I have the below sample dataframe. Each hour has 5 instances. Is there any module or a way to find the missing data in a given column in Python. For instance for hour 2, instance 3 is missing. How can we identify this missing instance in a larger dataset dynamically in Python and print a message that an instance is missing.

Date          Hour  Instance 
2022-10-20     1      1 
2022-10-20     1      2
2022-10-20     1      3
2022-10-20     1      4
2022-10-20     1      5
2022-10-20     2      1
2022-10-20     2      2 
2022-10-20     2      4
2022-10-20     2      5

Thank you.

Aggregate to determine the unique value counts of Instance for each hour. Filter for hours whose value counts are not 5. Next, create two lists - unique Dates and unique Hours. Now aggregate over each Instance. Wherever the count is off, that's the Date/Hour with a missing pattern — inspectorG4dget
– inspectorG4dget, Commented Jan 9, 2023 at 19:13

mozway · Accepted Answer · 2023-01-09 19:51:50Z

2

Using a crosstab:

df2 = pd.crosstab(df['Hour'], df['Instance'])

out = df2[df2.eq(0)].stack().reset_index()[['Hour', 'Instance']]

Output:

   Hour  Instance
0     2         3

answered Jan 9, 2023 at 19:51

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mustafa Aydın · Accepted Answer · 2023-01-09 19:29:07Z

1

(df.set_index(["Hour", "Instance"])
   .unstack()
   .isna().where(lambda fr: fr)
   .stack()
   .reset_index()
   [["Hour", "Instance"]])

move Hour & Instance pairs to the index
then unstack the last one, i.e., Instance
- this will have a reindexing effect on hours: all of them will have 1 to 5
- and the missing ones will be marked as NaN
now mask the non-NaNs to become NaN, and NaNs to be True...
...because when stack'ed, NaNs will go and only original missings will present
move Hour & Instance back to columns and select them to show the desired pairs

to get

   Hour  Instance
0     2         3

Meaning, there was only 1 instance missing, and it was Hour 2's Instance 3.

answered Jan 9, 2023 at 19:29

Mustafa Aydın

18.4k4 gold badges21 silver badges43 bronze badges

Comments

edd313 · Accepted Answer · 2023-01-09 19:27:28Z

0

First, define a function that checks if there is a missing instance.

def is_data_missing(array):
    """Return True when data is missing, ie array is different from range."""
    return list(array) != list(range(1, len(array) + 1))

Then you can apply it to your DataFrame, grouping by hour first.

>>> df.groupby("Hour").apply(lambda x: is_range(x["Instance"].values))
Hour
1    False 
2     True

The resulting DataFrame gives you the hours when data was missing. You can then iterate through the items to print a message.

>>> for hour, is_missing in df_missing.items():
...     if is_missing:
...         print(f"At hour {hour} data is missing")
At hour 2 data is missing

edited Jan 9, 2023 at 19:27

answered Jan 9, 2023 at 19:20

edd313

1,5792 gold badges12 silver badges25 bronze badges

Collectives™ on Stack Overflow

How to check if a pattern of rows is missing in a pandas dataframe in Python

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related