sort DataFrame by substrings in rows

Question

Further question: how would I get the mode (ie the most common entry) rather than the minimum? In other words, is there a simple way to replace .min in df['min_year'] = s.unstack(level=-1).min(axis=1) to grab the most common, rather than the smallest number?

Using Python. I have a DataFrame with three columns:

Author | Title | Review

Each entry under Review includes multiple years (ie '88 '89 '87). I want to sort by lowest year in each row's cell. Ie, I want all the rows where '87 is the lowest grouped together.

If I do

df.index = df['Review'].str.extractall(r'(\'\d\d)')
df = df.sort_index(ascending=False).reset_index(drop=True)

I get:

ValueError: Length mismatch: Expected axis has 1005046 elements, new values have 2449016 elements

Ie: my original DataFrame has 1005046 rows, but b/c each row on average has about 2.4 years, I end up with 2449016 extracted years.

The problem seems to be that the function extractall creates a new row for each instance of the pattern, so I end up with 2.449x more rows than I started with.

Here's the output when I call:

print(df['Review'].str.extractall(r'(\'\d\d)').head(10))

output:

               0
    match     
0 0      '69
  1      '69
  2      '69
1 0      '99
  1      '99
2 0      '97
3 0      '86
  1      '86
4 0      '96
6 0      '81

Ie: The zero row in the original df had three instances of '69, which creates three separate rows after using extractall. I need to sort each original row by the smallest year, maintaining everything else about the df.

IanS · Accepted Answer · 2017-10-12 13:48:48Z

1

Convert the result of extractall to a series:

s = df['Review'].str.extractall(r'(\'\d\d)').squeeze()

Use the str accessor to convert the values to int:

s = s.str.replace("'", "").astype(int)

Unstack to put extracted values back into rows (with the original index):

s.unstack(level=-1)

Finally, I wouldn't put the lowest year in an index, but rather a column:

df['min_year'] = s.unstack(level=-1).min(axis=1)
df = df.sort_values(by='min_year').drop(['min_year'], axis=1)

edited Oct 12, 2017 at 13:48

answered Oct 11, 2017 at 17:49

IanS

16.3k9 gold badges64 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Slothrop Over a year ago

The trick is that the years aren't adjacent like that. The rows look like this: Intpr- v49 - Ja '95 - pl03 - [251-500] - TT - v51 - Ap '94 - pl97 - [51-250]

IanS Over a year ago

Oh. Can you show the head of df['Review'].str.extractall(r'(\'\d\d)')? First 10 rows maybe.

Slothrop Over a year ago

Yeah. Looks like this: 0 0 '69 1 '69 2 '69 1 0 '99 1 '99 2 0 '97 3 0 '86 1 '86

Slothrop Over a year ago

Except there are three columns: 0 | 0 | '69 is a single column, and so forth. So the extractall function is creating a new row each time there's more than one year in the original row.

Slothrop Over a year ago

Further question: how would I get the mode (ie the most common entry) rather than the minimum? In other words, is there a simple way to replace .min in df['min_year'] = s.unstack(level=-1).min(axis=1) to grab the most common, rather than the smallest number?

|

Collectives™ on Stack Overflow

sort DataFrame by substrings in rows

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related