1

Further question: how would I get the mode (ie the most common entry) rather than the minimum? In other words, is there a simple way to replace .min in df['min_year'] = s.unstack(level=-1).min(axis=1) to grab the most common, rather than the smallest number?

Using Python. I have a DataFrame with three columns:

Author | Title | Review

Each entry under Review includes multiple years (ie '88 '89 '87). I want to sort by lowest year in each row's cell. Ie, I want all the rows where '87 is the lowest grouped together.

If I do

df.index = df['Review'].str.extractall(r'(\'\d\d)')
df = df.sort_index(ascending=False).reset_index(drop=True)

I get:

ValueError: Length mismatch: Expected axis has 1005046 elements, new values have 2449016 elements

Ie: my original DataFrame has 1005046 rows, but b/c each row on average has about 2.4 years, I end up with 2449016 extracted years.

The problem seems to be that the function extractall creates a new row for each instance of the pattern, so I end up with 2.449x more rows than I started with.

Here's the output when I call:

print(df['Review'].str.extractall(r'(\'\d\d)').head(10))

output:

               0
    match     
0 0      '69
  1      '69
  2      '69
1 0      '99
  1      '99
2 0      '97
3 0      '86
  1      '86
4 0      '96
6 0      '81

Ie: The zero row in the original df had three instances of '69, which creates three separate rows after using extractall. I need to sort each original row by the smallest year, maintaining everything else about the df.

1 Answer 1

1

Convert the result of extractall to a series:

s = df['Review'].str.extractall(r'(\'\d\d)').squeeze()

Use the str accessor to convert the values to int:

s = s.str.replace("'", "").astype(int)

Unstack to put extracted values back into rows (with the original index):

s.unstack(level=-1)

Finally, I wouldn't put the lowest year in an index, but rather a column:

df['min_year'] = s.unstack(level=-1).min(axis=1)
df = df.sort_values(by='min_year').drop(['min_year'], axis=1)
Sign up to request clarification or add additional context in comments.

6 Comments

The trick is that the years aren't adjacent like that. The rows look like this: Intpr- v49 - Ja '95 - pl03 - [251-500] - TT - v51 - Ap '94 - pl97 - [51-250]
Oh. Can you show the head of df['Review'].str.extractall(r'(\'\d\d)')? First 10 rows maybe.
Yeah. Looks like this: 0 0 '69 1 '69 2 '69 1 0 '99 1 '99 2 0 '97 3 0 '86 1 '86
Except there are three columns: 0 | 0 | '69 is a single column, and so forth. So the extractall function is creating a new row each time there's more than one year in the original row.
Further question: how would I get the mode (ie the most common entry) rather than the minimum? In other words, is there a simple way to replace .min in df['min_year'] = s.unstack(level=-1).min(axis=1) to grab the most common, rather than the smallest number?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.