1

I have a dataset where I use groupby and a comparison based on two columns and get as result numpy arrays. What I try to do, is to put them back to the dataframe.

Logic: I have this dataframe df with the following columns: id, cluster, a, b. Pasting here for reproduction purposes:

individual  cluster a   b
9710556 0   180.82  140
9710556 0   180.82  140
9710556 0   202.32  145
9710556 1   218.32  145
9710556 1   250.82  140

I try to find for every row the number of a, b values that are strictly less (in both values) than other a,b values within every id (onIndiv column below) and also within every id and cluster (onIndivCluster column below). This is is the desired output I expect:

individual  cluster a   b   onIndiv onIndivCluster
9710556 0   180.82  140 2   1
9710556 0   180.82  140 2   1
9710556 0   202.32  145 0   0
9710556 1   218.32  145 0   0
9710556 1   250.82  140 0   0

This is a function I came up with which does this:

def ranker(df):
  values = df[["a", "b"]].values
  result = values[:, None] < values
  return np.logical_and.reduce(result, axis = 2).sum(axis = 1)

df.groupby("individual").apply(ranker)
Out[192]: 
id
9710556    [2, 2, 0, 0, 0]
dtype: object

small.groupby(["individual", "cluster"]).apply(ranker)

Out[169]:
individual  cluster
9710556     0          [1, 1, 0]
            1             [0, 0]
dtype: object

How can I assign these results to the original dataframe to get my desired output?

2 Answers 2

1

Unfortunately apply here want aggegate rows, so get lists, so use one column DataFrame for prevent it:

def ranker(df):
  values = df[["a", "b"]].values
  result = values[:, None] < values
  a = np.logical_and.reduce(result, axis = 2).sum(axis = 1)
  return pd.DataFrame({0:a}, index=df.index)

df['onIndiv'] = df.groupby("individual").apply(ranker)
df['onIndivCluster'] = df.groupby(["individual", "cluster"]).apply(ranker)
print (df)
   individual  cluster       a    b  onIndiv  onIndivCluster
0     9710556        0  180.82  140        2               1
1     9710556        0  180.82  140        2               1
2     9710556        0  202.32  145        0               0
3     9710556        1  218.32  145        0               0
4     9710556        1  250.82  140        0               0

Or add new column in function, for more flexible solution is used lambda function with new column name:

def ranker(df, name):
  values = df[["a", "b"]].values
  result = values[:, None] < values
  df[name] = np.logical_and.reduce(result, axis = 2).sum(axis = 1)
  return df

df = df.groupby("individual").apply(lambda x: ranker(x, 'onIndiv'))
df = df.groupby(["individual", "cluster"]).apply(lambda x: ranker(x, 'onIndivCluster'))

print (df)
   individual  cluster       a    b  onIndiv  onIndivCluster
0     9710556        0  180.82  140        2               1
1     9710556        0  180.82  140        2               1
2     9710556        0  202.32  145        0               0
3     9710556        1  218.32  145        0               0
4     9710556        1  250.82  140        0               0
Sign up to request clarification or add additional context in comments.

2 Comments

thank you for your reply! can you please explain me the logic behind returning a dataframe?
@EmilMirzayev - Unfortunately apply here want aggegate rows, so get lists, if return one column DataFrame you can prevent it or if create new column if function it working too.
0

Check out pandas df.rank() function. It makes stuff like that very easy.

Once, you have got the ranks of both columns, you could simply select the highest rank from both. However, from what I understood your basic assumption also includes a dilemma:

if rows i and j have the properties a_i > a_j and b_i < b_j, who gets the higher rank ;) - you will probably have to decide on a first and second level of ranking.

2 Comments

Yes, I have used ranks for another purpose, where I was not comparing based on BOTH values at the same time. Rank is very handy
So if you want the strict number, I would guess, you could just use the max value per row of the ranks from both columns as a counter?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.