If you find some counts missing or get error: ValueError: Length mismatch: Expected axis has nnnn elements, new values have mmmm elements, read here:
1. Count duplicate rows with NaN entries:
The accepted solution is great and believed to have been helpful to many members. In a recent task, I found it can be further fine-tuned to support complete counting of a dataframe with NaN entries. Pandas supports missing entries or null values as NaN values. Let's see what's the output for this use case when our dataframe contains NaN entries:
Col1 Col2 Col3 Col4
0 ABC 123 XYZ NaN # group #1 of 3
1 ABC 123 XYZ NaN # group #1 of 3
2 ABC 678 PQR def # group #2 of 1
3 MNO 890 EFG abc # group #3 of 4
4 MNO 890 EFG abc # group #3 of 4
5 CDE 234 567 xyz # group #4 of 2
6 ABC 123 XYZ NaN # group #1 of 3
7 CDE 234 567 xyz # group #4 of 2
8 MNO 890 EFG abc # group #3 of 4
9 MNO 890 EFG abc # group #3 of 4
Applying the code:
df.groupby(df.columns.tolist(),as_index=False).size()
gives:
Col1 Col2 Col3 Col4 size
0 ABC 678 PQR def 1
1 CDE 234 567 xyz 2
2 MNO 890 EFG abc 4
Oh, how come the count of Group #1 with 3 duplicate rows is missing?!
For some Pandas version, you may get an error instead: ValueError: Length mismatch: Expected axis has nnnn elements, new values have mmmm elements
Solution:
Use the parameter dropna= for the .groupby() function, as follows:
df.groupby(df.columns.tolist(), as_index=False, dropna=False).size()
gives:
Col1 Col2 Col3 Col4 size
0 ABC 123 XYZ NaN 3 # <=== count of rows with `NaN`
1 ABC 678 PQR def 1
2 CDE 234 567 xyz 2
3 MNO 890 EFG abc 4
The count of duplicate rows with NaN can be successfully output with dropna=False. This parameter has been supported since Pandas version 1.1.0
2. Alternative Solution
Another way to count duplicate rows with NaN entries is as follows:
df.value_counts(dropna=False).reset_index(name='count')
gives:
Col1 Col2 Col3 Col4 count
0 MNO 890 EFG abc 4
1 ABC 123 XYZ NaN 3
2 CDE 234 567 xyz 2
3 ABC 678 PQR def 1
Here, we use the .value_counts() function with also the parameter dropna=False. However, this parameter has been supported only recently since Pandas version 1.3.0 If your version is older than this, you'll need to use the .groupby() solution if you want to get complete counts for rows with NaN entries.
You will see that the output is in different sequence than the previous result. The counts are sorted in descending order. If you want to get unsorted result, you can specify sort=False:
df.value_counts(dropna=False, sort=False).reset_index(name='count')
it gives the same result as the df.groupby(df.columns.tolist(), as_index=False, dropna=False).size() solution:
Col1 Col2 Col3 Col4 count
0 ABC 123 XYZ NaN 3
1 ABC 678 PQR def 1
2 CDE 234 567 xyz 2
3 MNO 890 EFG abc 4
Note that this .value_counts() solution supports dataframes both with and without NaN entries and can be used as a general solution.
In fact, in the underlying implementation codes .value_counts() calls GroupBy.size to get the counts: click the link to see the underlying codes: counts = self.groupby(subset, dropna=dropna).grouper.size()
Hence, for this use case, .value_counts() and the .groupby() solution in the accepted solution are actually doing the same thing. We should be able to use the .value_counts() function to get the desired counts of duplicate rows equally well.
Use of .value_counts() function to get counts of duplicate rows has the additional benefit that its syntax is simpler. You can simply use df.value_counts() or df.value_counts(dropna=False) depending on whether your dataframe contains NaN or not. Chain with .reset_index() if you want the result as a dataframe instead of a Series.