Python pandas sort_values() with nested list

Question

I want to sort a nested dict in pyhon via pandas.

import pandas as pd 

# Data structure (nested list):
# {
#   category_name: [[rank, id], ...],
#   ...
# }

all_categories = {
    "category_name1": [[2, 12345], [1, 32512], [3, 32382]],
    "category_name2": [[3, 12345], [9, 25318], [1, 24623]]
}

df = pd.DataFrame(all_categories.items(), columns=['Category', 'Rank'])
df.sort_values(['Rank'], ascending=True, inplace=True) # this only sorts the list of lists

Can anyone tell me how I can get to my goal? I can't figure it out. Via panda it's possible to sort_values() by the second column, but I can't figure out how to sort the nested dict/list.

I want to sort ascending by the rank, not the id.

I changed it, I see why you are confused. I meant to sort by the rank, not the id. Based on the data structure sample. — Patrick
– Patrick, Commented Jun 13, 2021 at 16:10

tdy · Accepted Answer · 2021-06-14 16:19:24Z

5

The fastest option is to apply sort() (note that the sorting occurs in place, so don't assign back to df.Rank in this case):

df.Rank.apply(list.sort)

Or apply sorted() with a custom key and assign back to df.Rank:

df.Rank = df.Rank.apply(lambda row: sorted(row, key=lambda x: x[0]))

Output in either case:

>>> df
         Category                                  Rank
0  category_name1  [[1, 32512], [2, 12345], [3, 32382]]
1  category_name2  [[1, 24623], [3, 12345], [9, 25318]]

This is the perfplot of sort() vs sorted() vs explode():

import perfplot

def explode(df):
    df = df.explode('Rank')
    df['rank_num'] = df.Rank.str[0]
    df = df.sort_values(['Category', 'rank_num']).groupby('Category', as_index=False).agg(list)
    return df

def apply_sort(df):
    df.Rank.apply(list.sort)
    return df

def apply_sorted(df):
    df.Rank = df.Rank.apply(lambda row: sorted(row, key=lambda x: x[0]))
    return df

perfplot.show(
    setup=lambda n: pd.concat([df] * n),
    n_range=[2 ** k for k in range(25)],
    kernels=[explode, apply_sort, apply_sorted],
    equality_check=None,
)

To filter rows by list length, mask the rows with str.len() and loc[]:

mask = df.Rank.str.len().ge(10)
df.loc[mask, 'Rank'].apply(list.sort)

edited Jun 14, 2021 at 16:19

answered Jun 14, 2021 at 3:57

tdy

42k42 gold badges124 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Patrick Over a year ago

Thank you very much, also for the insights and the perfplot. I'm just curious: you know how to ignore all Rank entrys with less then N=10 entries.

tdy Over a year ago

@Patrick you're welcome. to filter lists by length, you can mask those rows with str.len() and loc[] (answer updated)

Patrick Over a year ago

Thank you again, but the code doesn't count or show only entries with n>10, it's the same output from before. I mean the len(all_categories['Rank']) > 10

tdy Over a year ago

@Patrick correct, the current code just limits the sorting to the masked rows but still retains all the rows. if you want those other rows to be removed, you can do something like this instead: df = df.loc[mask]; df.Rank.apply(list.sort)

he xiao · Accepted Answer · 2021-06-14 03:31:04Z

1

Try

df = pd.DataFrame(all_categories.items(), columns=['Category', 'Rank']).explode('Rank')
df['Rank'] = df['Rank'].apply(lambda x: sorted(x))

df = df.groupby('Category').agg(list).reset_index()

to dict

dict(df.agg(list, axis=1).values)

answered Jun 14, 2021 at 3:31

he xiao

111 bronze badge

Comments

irc1209 · Accepted Answer · 2021-06-13 15:16:09Z

0

Try:

df = pd.DataFrame(all_categories.items(), columns=['Category', 'Rank'])
df.set_index('Rank', inplace=True)
df.sort_index(inplace=True)
df.reset_index(inplace=True)

Or:

df = pd.DataFrame(all_categories.items(), columns=['Category', 'Rank'])
df = df.set_index('Rank').sort_index().reset_index()

answered Jun 13, 2021 at 15:16

irc1209

617 bronze badges

1 Comment

Patrick Over a year ago

It's not working, same result as above, if I sort the list of lists. It's not even sorted by id

Vishnudev Krishnadas · Accepted Answer · 2021-06-14 04:36:35Z

0

It is much more efficient to use df.explode and then sort the values. It will be vectorized.

df = df.explode('Rank')
df['rank_num'] = df.Rank.str[0]

df.sort_values(['Category', 'rank_num'])
  .groupby('Category', as_index=False)
  .agg(list)

Output

         Category                                  Rank   rank_num
0  category_name1  [[1, 32512], [2, 12345], [3, 32382]]  [1, 2, 3]
1  category_name2  [[1, 24623], [3, 12345], [9, 25318]]  [1, 3, 9]

answered Jun 14, 2021 at 4:36

Vishnudev Krishnadas

11k2 gold badges29 silver badges58 bronze badges

1 Comment

tdy Over a year ago

i did some timings and explode seems to be slower than apply in this case (i guess because explode still requires groupby+agg)

Collectives™ on Stack Overflow

Python pandas sort_values() with nested list

4 Answers 4

4 Comments

Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related