1

I am using the following code to calculate the quartiles of a given data set:

#!/usr/bin/python

import numpy as np

series = [1,2,2,2,2,2,2,2,2,2,2,5,5,6,7,8]

p1 = 25
p2 = 50
p3 = 75

q1 = np.percentile(series,  p1)
q2 = np.percentile(series,  p2)
q3 = np.percentile(series,  p3)

print('percentile(' + str(p1) + '): ' + str(q1))
print('percentile(' + str(p2) + '): ' + str(q2))
print('percentile(' + str(p3) + '): ' + str(q3))

The percentile function returns the quartiles, however, I would also like to get the indexes which it used to mark the bounderies of the quartiles. Is there any way to do this?

2
  • 1
    Is the data always sorted? Or else, this question wouldn't make sense, unless I'm missing something. But if it is sorted, then you can directly calculate the index. Commented Mar 22, 2017 at 17:35
  • @juanpa.arrivillaga Yes, the data is always sorted. Commented Mar 22, 2017 at 17:54

3 Answers 3

1

Since the data is sorted, you could just use numpy.searchsorted to return the indices at which to insert the values to maintain sorted order. You can specify which 'side' to insert the values.

>>> np.searchsorted(series,q1)
1
>>> np.searchsorted(series,q1,side='right')
11
>>> np.searchsorted(series,q2)
1
>>> np.searchsorted(series,q3)
11
>>> np.searchsorted(series,q3,side='right')
13
Sign up to request clarification or add additional context in comments.

Comments

0

Assuming that the data is always sorted (thanks @juanpa.arrivillaga), you can use the rank method from the Pandas Series class. rank() takes several arguments. One of them is pct:

pct : boolean, default False

Computes percentage rank of data

There are different ways of calculating the percentage rank. These methods are controlled by the argument method:

method : {‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}

You need the method "max":

max: highest rank in group

Let's look at the output of the rank() method with these parameters:

import numpy as np
import pandas as pd

series = [1,2,2,2,2,2,2,2,2,2,2,5,5,6,7,8]

S = pd.Series(series)
percentage_rank = S.rank(method="max", pct=True)
print(percentage_rank)

This gives you basically the percentile for every entry in the Series:

0     0.0625
1     0.6875
2     0.6875
3     0.6875
4     0.6875
5     0.6875
6     0.6875
7     0.6875
8     0.6875
9     0.6875
10    0.6875
11    0.8125
12    0.8125
13    0.8750
14    0.9375
15    1.0000
dtype: float64

In order to retrieve the index for the three percentiles, you look up the first element in the Series that has an equal or higher percentage rank than the percentile you're interested in. The index of that element is the index that you need.

index25 = S.index[percentage_rank >= 0.25][0]
index50 = S.index[percentage_rank >= 0.50][0]
index75 = S.index[percentage_rank >= 0.75][0]

print("25 percentile: index {}, value {}".format(index25, S[index25]))
print("50 percentile: index {}, value {}".format(index50, S[index50]))
print("75 percentile: index {}, value {}".format(index75, S[index75]))

This gives you the output:

25 percentile: index 1, value 2
50 percentile: index 1, value 2
75 percentile: index 11, value 5

Comments

-1

Try this:

import numpy as np
import pandas as pd
series = [1,2,2,2,2,2,2,2,2,2,2,5,5,6,7,8]
thresholds = [25,50,75]
output = pd.DataFrame([np.percentile(series,x) for x in thresholds], index = thresholds, columns = ['quartiles'])
output

By making it a dataframe, you can assign the index pretty easily.

2 Comments

I'm not sure how this answers the question... I'm not sure I understand the question though...
@juanpa.arrivillaga I assumed that the question was about structuring the output...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.