Slice multiple dataframes based in different ranges values in a specific column and categorize them in new columns

Question

Is there any way to select values within 5 certain ranges for a given column, and to each different dataframe, apply in a new column, a label?

I mean, I have a list a of dataframes. All dataframes have 2 columns and share the same first column, but differs in the second (header and values). For example:

I would like to:

For each dataframe on the list, perform a calculation which gives the probability of that value occur within 1 of 5 different range. Append a new column with those values;
For each dataframe on the list, attach the respective range label in another new column.

Where the ranges are:

*Range_Values* -> *Range_Label*

   **[0]**     ->   'l1'

  **]0,1]**    ->   'l2'

 **]1,10]**    ->   'l3'

**]10,100]**   ->   'l4'

  **>100**        'l5'

This 2 steps approaches would led to something like:

>> list_dfs[df1]
   GeneID    A    Prob_val     Exp_prof
      1     0.3     0.4         'l2'
      2     0.0     0.2         'l1'
      3     143     0.2         'l5'
      4      9      0.2         'l3'
      5     0.6     0.4         'l2'

Vivek Kalyanarangan · Accepted Answer · 2018-08-27 11:16:54Z

1

You have to first define the bins and labels -

bins = [0, 1, 10, 100, float("inf")]
labels = ['l1', 'l2', 'l3', 'l4', 'l5']

Then use pd.cut() -

pd.cut(df1['A'], bins, right=False)

There is a labels parameter in pd.cut() that you can use to get labels -

pd.cut(df1['A'], bins, labels=labels, right=False)

You can use the bins generated to compute probabilities I leave it upto you to do that.

You can do this for the rest of the dfs in a loop and finally assign them to a list -

list_dfs = [df1, df2, ...]

If you have dynamic number of dfs use a loop -

Framework

for df in dfs:
    df['bins'] = pd.cut(df['A'], bins, right=False)
    df['label'] = pd.cut(df['A'], bins, labels=labels, right=False)

edited Aug 27, 2018 at 11:16

answered Aug 27, 2018 at 10:43

Vivek Kalyanarangan

9,1011 gold badge27 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

João Fernandes Over a year ago

Although it is a good answer, someting must be wrong with the labels. The labels are not working fine, it does not match the proper range.

Vivek Kalyanarangan Over a year ago

@JoãoFernandes you were right. I have updated the ans to include l6. The reason being I decided to include the float("inf") to capture values greater than 100

ysearka Over a year ago

Following your codethe label l1 will be given to the bin [0,1] since the singleton [0] won't be taken in to account. You could consider adding a new category for this in order to match the desired mapping.

ysearka · Accepted Answer · 2018-08-27 11:28:17Z

1

For the labels and bins, you can use pandas.cut. Note that you can't use a singleton as a bin in this function. Therefore you will have to create it afterwards. Here is how you can do this.

First I recreate one of your dataframes:

    import io
temp = u"""
GeneID    A
      1     0.3
      2     0.0
      3     143
      4      9
      5     0.6"""
foo = pd.read_csv(io.StringIO(temp),delim_whitespace = True)

Then I create the new column and fill the NaN values with the label l1 which corresponds to the singleton [0].

foo['Exp_prof'] = pd.cut(foo.A,bins = [0,1,10,100,np.inf],labels = ['l2','l3','l4','l5'])
foo['Exp_prof'] = foo['Exp_prof'].cat.add_categories(['l1'])
foo['Exp_prof'] = foo['Exp_prof'].fillna('l1')

And I use this new column to compute the probabilities:

foo['Prob_val'] = foo.Exp_prof.map((foo.Exp_prof.value_counts()/len(foo)).to_dict())

And the output is:

    GeneID  A       Exp_prof    Prob_val
0   1       0.3     l2          0.4
1   2       0.0     l1          0.2
2   3       143.0   l5          0.2
3   4       9.0     l3          0.2
4   5       0.6     l2          0.4

edited Aug 27, 2018 at 11:28

answered Aug 27, 2018 at 11:02

ysearka

3,8655 gold badges24 silver badges42 bronze badges

2 Comments

João Fernandes Over a year ago

Thar works just fine, thank you ! The probability calculi is based on the A column range or in the frequency of labels just added as well. See, in this case, the label l2 as a prob value of 0.4 since it is 2/5

ysearka Over a year ago

I edited my answer to add the computation of these probabilities.

Collectives™ on Stack Overflow

Slice multiple dataframes based in different ranges values in a specific column and categorize them in new columns

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related