Pandas groupby assign nested array of ndarrays back to dataframe

Question

I have a dataset where I use groupby and a comparison based on two columns and get as result numpy arrays. What I try to do, is to put them back to the dataframe.

Logic: I have this dataframe df with the following columns: id, cluster, a, b. Pasting here for reproduction purposes:

individual  cluster a   b
9710556 0   180.82  140
9710556 0   180.82  140
9710556 0   202.32  145
9710556 1   218.32  145
9710556 1   250.82  140

I try to find for every row the number of a, b values that are strictly less (in both values) than other a,b values within every id (onIndiv column below) and also within every id and cluster (onIndivCluster column below). This is is the desired output I expect:

individual  cluster a   b   onIndiv onIndivCluster
9710556 0   180.82  140 2   1
9710556 0   180.82  140 2   1
9710556 0   202.32  145 0   0
9710556 1   218.32  145 0   0
9710556 1   250.82  140 0   0

This is a function I came up with which does this:

def ranker(df):
  values = df[["a", "b"]].values
  result = values[:, None] < values
  return np.logical_and.reduce(result, axis = 2).sum(axis = 1)

df.groupby("individual").apply(ranker)
Out[192]: 
id
9710556    [2, 2, 0, 0, 0]
dtype: object

small.groupby(["individual", "cluster"]).apply(ranker)

Out[169]:
individual  cluster
9710556     0          [1, 1, 0]
            1             [0, 0]
dtype: object

How can I assign these results to the original dataframe to get my desired output?

jezrael · Accepted Answer · 2019-11-26 15:08:27Z

1

Unfortunately apply here want aggegate rows, so get lists, so use one column DataFrame for prevent it:

def ranker(df):
  values = df[["a", "b"]].values
  result = values[:, None] < values
  a = np.logical_and.reduce(result, axis = 2).sum(axis = 1)
  return pd.DataFrame({0:a}, index=df.index)

df['onIndiv'] = df.groupby("individual").apply(ranker)
df['onIndivCluster'] = df.groupby(["individual", "cluster"]).apply(ranker)
print (df)
   individual  cluster       a    b  onIndiv  onIndivCluster
0     9710556        0  180.82  140        2               1
1     9710556        0  180.82  140        2               1
2     9710556        0  202.32  145        0               0
3     9710556        1  218.32  145        0               0
4     9710556        1  250.82  140        0               0

Or add new column in function, for more flexible solution is used lambda function with new column name:

def ranker(df, name):
  values = df[["a", "b"]].values
  result = values[:, None] < values
  df[name] = np.logical_and.reduce(result, axis = 2).sum(axis = 1)
  return df

df = df.groupby("individual").apply(lambda x: ranker(x, 'onIndiv'))
df = df.groupby(["individual", "cluster"]).apply(lambda x: ranker(x, 'onIndivCluster'))

print (df)
   individual  cluster       a    b  onIndiv  onIndivCluster
0     9710556        0  180.82  140        2               1
1     9710556        0  180.82  140        2               1
2     9710556        0  202.32  145        0               0
3     9710556        1  218.32  145        0               0
4     9710556        1  250.82  140        0               0

edited Nov 26, 2019 at 15:08

answered Nov 26, 2019 at 13:18

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Emil Mirzayev Over a year ago

thank you for your reply! can you please explain me the logic behind returning a dataframe?

jezrael Over a year ago

@EmilMirzayev - Unfortunately apply here want aggegate rows, so get lists, if return one column DataFrame you can prevent it or if create new column if function it working too.

magraf · Accepted Answer · 2019-11-26 13:41:45Z

0

Check out pandas df.rank() function. It makes stuff like that very easy.

Once, you have got the ranks of both columns, you could simply select the highest rank from both. However, from what I understood your basic assumption also includes a dilemma:

if rows i and j have the properties a_i > a_j and b_i < b_j, who gets the higher rank ;) - you will probably have to decide on a first and second level of ranking.

answered Nov 26, 2019 at 13:41

magraf

4605 silver badges9 bronze badges

2 Comments

Emil Mirzayev Over a year ago

Yes, I have used ranks for another purpose, where I was not comparing based on BOTH values at the same time. Rank is very handy

magraf Over a year ago

So if you want the strict number, I would guess, you could just use the max value per row of the ranks from both columns as a counter?

Collectives™ on Stack Overflow

Pandas groupby assign nested array of ndarrays back to dataframe

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related