Filter pandas DataFrame by string length within group

Question

Let's say I have the following data

import pandas as pd
df = pd.DataFrame(data=[[1, 'a'], [1, 'aaa'], [1, 'aa'], 
                        [2, 'bb'], [2, 'bbb'], 
                        [3, 'cc']], 
                  columns=['key', 'text'])

   key text
0    1    a
1    1  aaa
2    1   aa
3    2   bb
4    2  bbb
5    3   cc

What I would like to do is group by the key variable and sort the data within each group by the length of text and end up with a single Series of index values to use to reindex the dataframe. I thought I could just do something like this:

df.groupby('key').text.str.len().sort_values(ascending=False).index

But it said I need to use apply, so I tried this:

df.groupby('key').apply(lambda x: x.text.str.len().sort_values(ascending=False).index, axis=1)

But that told me that lambda got an unexpected keyword: axis.

I'm relatively new to pandas, so I'm not sure how to go about this. Also, my goal is to simply deduplicate the data such that for each key, I keep the value with the longest value of text. The expected output is:

   key text
1    1  aaa
4    2  bbb
5    3   cc

If there's an easier way to do this than what I'm attempting, I'm open to that as well.

Woody Pride · Accepted Answer · 2017-06-20 18:12:41Z

5

No need for the intermediate step. You can get a series with the string lengths like this:

df['text'].str.len()

Now juut groupby key, and return the value indexed where the length of the string is largest using idxmax()

In [33]: df.groupby('key').agg(lambda x: x.loc[x.str.len().idxmax()])
Out[33]:
    text
key
1    aaa
2    bbb
3     cc

answered Jun 20, 2017 at 18:12

Woody Pride

14k10 gold badges51 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Scott Boston · Accepted Answer · 2017-06-20 18:07:36Z

3

df.groupby('key', as_index=False).apply(lambda x: x[x.text.str.len() == x.text.str.len().max()])

Output:

     key text
0 1    1  aaa
1 4    2  bbb
2 5    3   cc

answered Jun 20, 2017 at 18:07

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Comments

Riley Hun · Accepted Answer · 2017-06-20 18:12:22Z

1

def get_longest_string(row):
    return [x for x in row.tolist() if len(x) == max([len(x) for x in row.tolist()])]

res = df.groupby('key')['text'].apply(get_longest_string).reset_index()

Output:

   key   text
0    1  [aaa]
1    2  [bbb]
2    3   [cc]

answered Jun 20, 2017 at 18:12

Riley Hun

2,8318 gold badges44 silver badges95 bronze badges

Collectives™ on Stack Overflow

Filter pandas DataFrame by string length within group

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related