Further question: how would I get the mode (ie the most common entry) rather than the minimum? In other words, is there a simple way to replace .min in df['min_year'] = s.unstack(level=-1).min(axis=1) to grab the most common, rather than the smallest number?
Using Python. I have a DataFrame with three columns:
Author | Title | Review
Each entry under Review includes multiple years (ie '88 '89 '87). I want to sort by lowest year in each row's cell. Ie, I want all the rows where '87 is the lowest grouped together.
If I do
df.index = df['Review'].str.extractall(r'(\'\d\d)')
df = df.sort_index(ascending=False).reset_index(drop=True)
I get:
ValueError: Length mismatch: Expected axis has 1005046 elements, new values have 2449016 elements
Ie: my original DataFrame has 1005046 rows, but b/c each row on average has about 2.4 years, I end up with 2449016 extracted years.
The problem seems to be that the function extractall creates a new row for each instance of the pattern, so I end up with 2.449x more rows than I started with.
Here's the output when I call:
print(df['Review'].str.extractall(r'(\'\d\d)').head(10))
output:
0
match
0 0 '69
1 '69
2 '69
1 0 '99
1 '99
2 0 '97
3 0 '86
1 '86
4 0 '96
6 0 '81
Ie: The zero row in the original df had three instances of '69, which creates three separate rows after using extractall. I need to sort each original row by the smallest year, maintaining everything else about the df.