How do I calculate percentiles with python/numpy?

Question

Is there a convenient way to calculate percentiles for a sequence or single-dimensional numpy array?

I am looking for something similar to Excel's percentile function.

A related question on computation of percentiles from frequencies: stackoverflow.com/questions/25070086/… — newtover
– newtover, Commented Oct 10, 2019 at 7:33
A related question for pandas data frame: python - Find percentile stats of a given column — Timur Shtatland
– Timur Shtatland, Commented Nov 16, 2023 at 22:48

Mateen Ulhaq · Accepted Answer · 2023-08-19 11:04:29Z

403

NumPy has np.percentile().

import numpy as np
a = np.array([1,2,3,4,5])
p = np.percentile(a, 50)  # return 50th percentile, i.e. median.

>>> print(p)
3.0

SciPy has scipy.stats.scoreatpercentile(), in addition to many other statistical goodies.

edited Aug 19, 2023 at 11:04

Mateen Ulhaq

27.8k21 gold badges121 silver badges155 bronze badges

answered Mar 3, 2010 at 20:24

Jon W

15.9k6 gold badges39 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Uri Over a year ago

Thank you! So that's where it's been hiding. I was aware of scipy but I guess I assumed simple things like percentiles would be built into numpy.

Anaphory Over a year ago

By now, a percentile function exists in numpy: docs.scipy.org/doc/numpy/reference/generated/…

patricksurry Over a year ago

You can use it as an aggregation function as well, e.g. to compute the tenth percentile of each group of a value column by key, use df.groupby('key')[['value']].agg(lambda g: np.percentile(g, 10))

Tim Diels Over a year ago

Note that SciPy recommends to use np.percentile for NumPy 1.9 and higher

Boris Gorelik · Accepted Answer · 2011-09-15 06:37:56Z

94

By the way, there is a pure-Python implementation of percentile function, in case one doesn't want to depend on scipy. The function is copied below:

## {{{ http://code.activestate.com/recipes/511478/ (r1)
import math
import functools

def percentile(N, percent, key=lambda x:x):
    """
    Find the percentile of a list of values.

    @parameter N - is a list of values. Note N MUST BE already sorted.
    @parameter percent - a float value from 0.0 to 1.0.
    @parameter key - optional key function to compute value from each element of N.

    @return - the percentile of the values
    """
    if not N:
        return None
    k = (len(N)-1) * percent
    f = math.floor(k)
    c = math.ceil(k)
    if f == c:
        return key(N[int(k)])
    d0 = key(N[int(f)]) * (c-k)
    d1 = key(N[int(c)]) * (k-f)
    return d0+d1

# median is 50th percentile.
median = functools.partial(percentile, percent=0.5)
## end of http://code.activestate.com/recipes/511478/ }}}

edited Sep 15, 2011 at 6:37

answered May 2, 2010 at 11:46

Boris Gorelik

32.1k41 gold badges136 silver badges172 bronze badges

6 Comments

Wai Yip Tung Over a year ago

I am the author of the above recipe. A commenter in ASPN has pointed out the original code has a bug. The formula should be d0 = key(N[int(f)]) * (c-k); d1 = key(N[int(c)]) * (k-f). It has been corrected on ASPN.

Richard Over a year ago

How does percentile know what to use for N? It isn't specified in the function call.

kevin Over a year ago

for those who didn't even read the code, before using it, N must be sorted

dsanchez Over a year ago

I'm confused by the lambda expression. What does it do and how does it do it? I know what lambda expression are so I am not asking what lambda is. I am asking what does this specific lambda expression do and how is it doing it, step-by-step? Thanks!

Elias Schoof Over a year ago

The lambda function lets you transform the data in N before calculating a percentile. Say you actually have a list of tuples N = [(1, 2), (3, 1), ..., (5, 1)] and you want to get the percentile of the first element of the tuples, then you choose key=lambda x: x[0]. You could also apply some (order-changing) transformation to the list elements before calculating a percentile.

|

Xavier Guihot · Accepted Answer · 2019-04-23 22:48:39Z

47

Starting Python 3.8, the standard library comes with the quantiles function as part of the statistics module:

from statistics import quantiles

quantiles([1, 2, 3, 4, 5], n=100)
# [0.06, 0.12, 0.18, 0.24, 0.3, 0.36, 0.42, 0.48, 0.54, 0.6, 0.66, 0.72, 0.78, 0.84, 0.9, 0.96, 1.02, 1.08, 1.14, 1.2, 1.26, 1.32, 1.38, 1.44, 1.5, 1.56, 1.62, 1.68, 1.74, 1.8, 1.86, 1.92, 1.98, 2.04, 2.1, 2.16, 2.22, 2.28, 2.34, 2.4, 2.46, 2.52, 2.58, 2.64, 2.7, 2.76, 2.82, 2.88, 2.94, 3.0, 3.06, 3.12, 3.18, 3.24, 3.3, 3.36, 3.42, 3.48, 3.54, 3.6, 3.66, 3.72, 3.78, 3.84, 3.9, 3.96, 4.02, 4.08, 4.14, 4.2, 4.26, 4.32, 4.38, 4.44, 4.5, 4.56, 4.62, 4.68, 4.74, 4.8, 4.86, 4.92, 4.98, 5.04, 5.1, 5.16, 5.22, 5.28, 5.34, 5.4, 5.46, 5.52, 5.58, 5.64, 5.7, 5.76, 5.82, 5.88, 5.94]
quantiles([1, 2, 3, 4, 5], n=100)[49] # 50th percentile (e.g median)
# 3.0

quantiles returns for a given distribution dist a list of n - 1 cut points separating the n quantile intervals (division of dist into n continuous intervals with equal probability):

statistics.quantiles(dist, *, n=4, method='exclusive')

where n, in our case (percentiles) is 100.

answered Apr 23, 2019 at 22:48

Xavier Guihot

62.7k26 gold badges320 silver badges202 bronze badges

1 Comment

Amaimersion Over a year ago

Just a note. With method="exclusive" p99 can be larger than maximum value in original list. If it is not what you want, i.e. you want p100 = max, then use method="inclusive".

richie · Accepted Answer · 2013-06-12 07:45:24Z

37

import numpy as np
a = [154, 400, 1124, 82, 94, 108]
print np.percentile(a,95) # gives the 95th percentile

answered Jun 12, 2013 at 7:45

richie

18.7k19 gold badges53 silver badges71 bronze badges

Comments

Pavel Vlasov · Accepted Answer · 2021-05-19 13:54:25Z

29

Here's how to do it without numpy, using only python to calculate the percentile.

import math

def percentile(data, perc: int):
    size = len(data)
    return sorted(data)[int(math.ceil((size * perc) / 100)) - 1]

percentile([10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0], 90)
# 9.0
percentile([142, 232, 290, 120, 274, 123, 146, 113, 272, 119, 124, 277, 207], 50)
# 146

edited May 19, 2021 at 13:54

Pavel Vlasov

4,3816 gold badges46 silver badges55 bronze badges

answered Mar 23, 2013 at 16:35

Ashkan

1,90316 silver badges13 bronze badges

1 Comment

Ashkan Over a year ago

Yes, you have to sort the list before: mylist=sorted(...)

mpounsett · Accepted Answer · 2017-02-08 03:39:05Z

13

The definition of percentile I usually see expects as a result the value from the supplied list below which P percent of values are found... which means the result must be from the set, not an interpolation between set elements. To get that, you can use a simpler function.

def percentile(N, P):
    """
    Find the percentile of a list of values

    @parameter N - A list of values.  N must be sorted.
    @parameter P - A float value from 0.0 to 1.0

    @return - The percentile of the values.
    """
    n = int(round(P * len(N) + 0.5))
    return N[n-1]

# A = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# B = (15, 20, 35, 40, 50)
#
# print percentile(A, P=0.3)
# 4
# print percentile(A, P=0.8)
# 9
# print percentile(B, P=0.3)
# 20
# print percentile(B, P=0.8)
# 50

If you would rather get the value from the supplied list at or below which P percent of values are found, then use this simple modification:

def percentile(N, P):
    n = int(round(P * len(N) + 0.5))
    if n > 1:
        return N[n-2]
    else:
        return N[0]

Or with the simplification suggested by @ijustlovemath:

def percentile(N, P):
    n = max(int(round(P * len(N) + 0.5)), 2)
    return N[n-2]

edited Feb 8, 2017 at 3:39

answered Sep 18, 2011 at 20:05

mpounsett

1,2141 gold badge13 silver badges32 bronze badges

6 Comments

hansaplast Over a year ago

thanks, I also expect percentile/median to result actual values from the sets and not interpolations

marco Over a year ago

Hi @mpounsett. Thank you for the upper code. Why does your percentile always return integer values? The percentile function should return the N-th percentile of a list of values, and this can be a float number too. For example, the Excel PERCENTILE function returns the following percentiles for your upper examples: 3.7 = percentile(A, P=0.3),0.82 = percentile(A, P=0.8), 20 = percentile(B, P=0.3), 42 = percentile(B, P=0.8).

mpounsett Over a year ago

It's explained in the first sentence. The more common definition of percentile is that it is the number in a series below which P percent of values in the series are found. Since that is the index number of an item in a list, it cannot be a float.

ijustlovemath Over a year ago

This doesn't work for the 0'th percentile. It returns the maximum value. A quick fix would be to wrap the n = int(...) in a max(int(...), 1) function

mpounsett Over a year ago

To clarify, do you mean in the second example? I get 0 rather than the maximum value. The bug is actually in the else clause.. I printed the index number rather than the value I intended to. Wrapping the assignment of 'n' in a max() call would also fix it, but you'd want the second value to be 2, not 1. You could then eliminate the entire if/else structure and just print the result of N[n-2]. 0th percentile works fine in the first example, returning '1' and '15' respectively.

|

karthikr · Accepted Answer · 2014-10-21 01:19:35Z

6

check for scipy.stats module:

 scipy.stats.scoreatpercentile

edited Oct 21, 2014 at 1:19

karthikr

100k26 gold badges208 silver badges191 bronze badges

answered Jul 22, 2011 at 0:53

Evert

691 silver badge1 bronze badge

Comments

Italo Gervasio · Accepted Answer · 2020-01-17 16:24:48Z

A convenient way to calculate percentiles for a one-dimensional numpy sequence or matrix is by using numpy.percentile <https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html>. Example:

import numpy as np

a = np.array([0,1,2,3,4,5,6,7,8,9,10])
p50 = np.percentile(a, 50) # return 50th percentile, e.g median.
p90 = np.percentile(a, 90) # return 90th percentile.
print('median = ',p50,' and p90 = ',p90) # median =  5.0  and p90 =  9.0

However, if there is any NaN value in your data, the above function will not be useful. The recommended function to use in that case is the numpy.nanpercentile <https://docs.scipy.org/doc/numpy/reference/generated/numpy.nanpercentile.html> function:

import numpy as np

a_NaN = np.array([0.,1.,2.,3.,4.,5.,6.,7.,8.,9.,10.])
a_NaN[0] = np.nan
print('a_NaN',a_NaN)
p50 = np.nanpercentile(a_NaN, 50) # return 50th percentile, e.g median.
p90 = np.nanpercentile(a_NaN, 90) # return 90th percentile.
print('median = ',p50,' and p90 = ',p90) # median =  5.5  and p90 =  9.1

In the two options presented above, you can still choose the interpolation mode. Follow the examples below for easier understanding.

import numpy as np

b = np.array([1,2,3,4,5,6,7,8,9,10])
print('percentiles using default interpolation')
p10 = np.percentile(b, 10) # return 10th percentile.
p50 = np.percentile(b, 50) # return 50th percentile, e.g median.
p90 = np.percentile(b, 90) # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.9 , median =  5.5  and p90 =  9.1

print('percentiles using interpolation = ', "linear")
p10 = np.percentile(b, 10,interpolation='linear') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='linear') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='linear') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.9 , median =  5.5  and p90 =  9.1

print('percentiles using interpolation = ', "lower")
p10 = np.percentile(b, 10,interpolation='lower') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='lower') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='lower') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1 , median =  5  and p90 =  9

print('percentiles using interpolation = ', "higher")
p10 = np.percentile(b, 10,interpolation='higher') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='higher') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='higher') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  2 , median =  6  and p90 =  10

print('percentiles using interpolation = ', "midpoint")
p10 = np.percentile(b, 10,interpolation='midpoint') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='midpoint') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='midpoint') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.5 , median =  5.5  and p90 =  9.5

print('percentiles using interpolation = ', "nearest")
p10 = np.percentile(b, 10,interpolation='nearest') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='nearest') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='nearest') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  2 , median =  5  and p90 =  9

If your input array only consists of integer values, you might be interested in the percentil answer as an integer. If so, choose interpolation mode such as ‘lower’, ‘higher’, or ‘nearest’.

Thanks For mentioning the interpolation option since without it the outputs were misleading

Roei Bahumi · Accepted Answer · 2017-08-02 12:54:16Z

2

To calculate the percentile of a series, run:

from scipy.stats import rankdata
import numpy as np

def calc_percentile(a, method='min'):
    if isinstance(a, list):
        a = np.asarray(a)
    return rankdata(a, method=method) / float(len(a))

For example:

a = range(20)
print {val: round(percentile, 3) for val, percentile in zip(a, calc_percentile(a))}
>>> {0: 0.05, 1: 0.1, 2: 0.15, 3: 0.2, 4: 0.25, 5: 0.3, 6: 0.35, 7: 0.4, 8: 0.45, 9: 0.5, 10: 0.55, 11: 0.6, 12: 0.65, 13: 0.7, 14: 0.75, 15: 0.8, 16: 0.85, 17: 0.9, 18: 0.95, 19: 1.0}

answered Aug 2, 2017 at 12:54

Roei Bahumi

3,7132 gold badges22 silver badges21 bronze badges

Comments

ClimateUnboxed · Accepted Answer · 2018-03-22 12:55:45Z

1

In case you need the answer to be a member of the input numpy array:

Just to add that the percentile function in numpy by default calculates the output as a linear weighted average of the two neighboring entries in the input vector. In some cases people may want the returned percentile to be an actual element of the vector, in this case, from v1.9.0 onwards you can use the "interpolation" option, with either "lower", "higher" or "nearest".

import numpy as np
x=np.random.uniform(10,size=(1000))-5.0

np.percentile(x,70) # 70th percentile

2.075966046220879

np.percentile(x,70,interpolation="nearest")

2.0729677997904314

The latter is an actual entry in the vector, while the former is a linear interpolation of two vector entries that border the percentile

edited Mar 22, 2018 at 12:55

answered Mar 22, 2018 at 9:09

ClimateUnboxed

8,1766 gold badges47 silver badges100 bronze badges

Comments

Ropali Munshi · Accepted Answer · 2019-03-06 08:39:22Z

1

for a series: used describe functions

suppose you have df with following columns sales and id. you want to calculate percentiles for sales then it works like this,

df['sales'].describe(percentiles = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])

0.0: .0: minimum
1: maximum 
0.1 : 10th percentile and so on

edited Mar 6, 2019 at 8:39

Ropali Munshi

3,0365 gold badges27 silver badges49 bronze badges

answered Mar 6, 2019 at 6:56

ashwini

111 bronze badge

Comments

ListenSoftware Louise Ai Agent · Accepted Answer · 2021-03-29 20:32:44Z

I bootstrap the data and then plotted out the confidence interval for 10 samples. The confidence interval shows the range where the probabilities will fall between 5 percent and 95 percent probability.

 import pandas as pd
 import matplotlib.pyplot as plt
 import seaborn as sns
 import numpy as np
 import json
 import dc_stat_think as dcst

 data = [154, 400, 1124, 82, 94, 108]
 #print (np.percentile(data,[0.5,95])) # gives the 95th percentile

 bs_data = dcst.draw_bs_reps(data, np.mean, size=6*10)

 #print(np.reshape(bs_data,(24,6)))

 x= np.linspace(1,6,6)
 print(x)
 for (item1,item2,item3,item4,item5,item6) in bs_data.reshape((10,6)):
     line_data=[item1,item2,item3,item4,item5,item6]
     ci=np.percentile(line_data,[.025,.975])
     mean_avg=np.mean(line_data)
     fig, ax = plt.subplots()
     ax.plot(x,line_data)
     ax.fill_between(x, (line_data-ci[0]), (line_data+ci[1]), color='b', alpha=.1)
     ax.axhline(mean_avg,color='red')
     plt.show()

Collectives™ on Stack Overflow

How do I calculate percentiles with python/numpy?

12 Answers 12

4 Comments

6 Comments

1 Comment

Comments

1 Comment

6 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

4 Comments

6 Comments

1 Comment

Comments

1 Comment

6 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Linked

Related