Python - calculating pdf from a numpy array distribution

Question

Given an array of values, I want to be able to fit a density function to it and find the pdf of an arbitrary input value. Is this possible, and how would I go about it? There aren't necessarily assumptions of normality, and I don't need the function itself.

For instance, given:

x = array([ 0.62529759, -0.08202699,  0.59220673, -0.09074541,  0.05517865,
        0.20153703,  0.22773723, -0.26229708,  0.76137555, -0.61229314,
        0.27292745,  0.35596795, -0.01373896,  0.32464979, -0.22932331,
        1.14796175,  0.17268531,  0.40692172,  0.13846154,  0.22752953,
        0.13087359,  0.14111479, -0.09932381,  0.12800392,  0.02605917,
        0.18776078,  0.45872642, -0.3943505 , -0.0771418 , -0.38822433,
       -0.09171721,  0.23083624, -0.21603973,  0.05425592,  0.47910286,
        0.26359565, -0.19917942,  0.40182097, -0.0797546 ,  0.47239264,
       -0.36654449,  0.4513859 , -0.00282486, -0.13950512, -0.05375369,
        0.03331833,  0.48951555, -0.13760504,  2.788     , -0.15017848,
        0.02930675,  0.10910646,  0.03868301, -0.048482  ,  0.7277376 ,
        0.08841259, -0.10968462,  0.50371324,  0.86379698,  0.01674877,
        0.19542421, -0.06639165,  0.74500856, -0.10148342,  0.02482331,
        0.79195804,  0.40401969,  0.25120005,  0.21020794, -0.01767013,
       -0.13453783, -0.09605592, -0.88044229,  0.04689623,  0.09043851,
        0.21232286,  0.34129982, -0.3736799 ,  0.17313858])

I would like to find how a value of 0.3 compares to all of the above, and what percent of the above values it is greater than.

@Joe I don't have the explicit representation of the density function, so I can't pass in anything for scipy integrate. Is there a workaround with just the data? — Brian G
– Brian G, Commented Aug 2, 2017 at 15:30
calculate the cdf from the data? stackoverflow.com/a/41980574/7919597 then pdf? — Joe
– Joe, Commented Aug 2, 2017 at 15:34
@Joe thanks for the tip, looks useful. However, my data is not normal so using scipy.norm would provide very inaccurate estimates. If possible, I could equivalently use any way to fit an explicit function to my data such that it can return an equation for me to integrate. Would you know of anything of that nature? — Brian G
– Brian G, Commented Aug 2, 2017 at 15:38

Mati K · Accepted Answer · 2019-04-12 18:28:52Z

8

I personally like using the scipy.stats package. It has a useful implementation of Kernel Density Estimation. Bascially what this does is it estimates a probability density function of certain data, using combinations of gaussian (or other) distributions. Which distributions are used is a parameter you can set. Look at the documentation and related examples here: https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html#kernel-density-estimation

And for more about KDE: https://en.wikipedia.org/wiki/Kernel_density_estimation

Once you have built your KDE, then you can perform the same operations on it to get probabilities. For example, if you want to calculate the probability that a value occurs that is as large or larger than 0.3 you would do the following:

kde = stats.gaussian_kde(np.array(x))

#visualize KDE
fig = plt.figure()
ax = fig.add_subplot(111)
x_eval = np.linspace(-.2, .2, num=200)
ax.plot(x_eval, kde(x_eval), 'k-')

#get probability
kde.integrate_box_1d( 0.3, np.inf)

TLDR: Calculate a KDE, then use the KDE as if it were a PDF.

answered Apr 12, 2019 at 18:28

Mati K

1061 silver badge4 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

JeeyCi Over a year ago

Kernel is smoothing the sequence, pure pdf can see here

tupui · Accepted Answer · 2017-08-02 16:21:11Z

4

You can use openTURNS for that. You can use a Gaussian kernel smoothing to do that easily! From the doc:

import openturns as ot 
kernel = ot.KernelSmoothing()
estimated = kernel.build(x)

That's it, now you have a distribution object :)

This library is very cool for statistics! (I am not related to them).

answered Aug 2, 2017 at 16:21

tupui

6,5463 gold badges35 silver badges55 bronze badges

8 Comments

Brian G Over a year ago

Thanks! I'm not really familiar with openturns, and from what I can see there isn't an immediate method to calculate cdf of a value after building with the kernel. Are you aware of how to use the distribution object you mentioned (estimated) to calculate cdf of future samples?

tupui Over a year ago

@BrianG yes you can. Here is the doc of a distribution object: openturns.github.io/openturns/latest/user_manual/_generated/… you have a computeCDF method :) Consoder accepting the answer to mark it resolved.

tupui Over a year ago

@BrianG your welcome. I know pretty well OT so do not hesitate to post questions about it

Brian G Over a year ago

Hey, I actually encountered this problem when trying to use it:

tupui Over a year ago

We would better Stack Overflow Chat

|

Michael Baudin · Accepted Answer · 2019-11-28 21:59:22Z

We have first to create the Sample from the Numpy array. Then we compute the complementary CDF with the complementaryCDF method of the distribution (a small improvement over Yoda's answer).

import numpy as np
x = np.array([ 0.62529759, -0.08202699,  0.59220673, -0.09074541,  0.05517865,
        0.20153703,  0.22773723, -0.26229708,  0.76137555, -0.61229314,
        0.27292745,  0.35596795, -0.01373896,  0.32464979, -0.22932331,
        1.14796175,  0.17268531,  0.40692172,  0.13846154,  0.22752953,
        0.13087359,  0.14111479, -0.09932381,  0.12800392,  0.02605917,
        0.18776078,  0.45872642, -0.3943505 , -0.0771418 , -0.38822433,
       -0.09171721,  0.23083624, -0.21603973,  0.05425592,  0.47910286,
        0.26359565, -0.19917942,  0.40182097, -0.0797546 ,  0.47239264,
       -0.36654449,  0.4513859 , -0.00282486, -0.13950512, -0.05375369,
        0.03331833,  0.48951555, -0.13760504,  2.788     , -0.15017848,
        0.02930675,  0.10910646,  0.03868301, -0.048482  ,  0.7277376 ,
        0.08841259, -0.10968462,  0.50371324,  0.86379698,  0.01674877,
        0.19542421, -0.06639165,  0.74500856, -0.10148342,  0.02482331,
        0.79195804,  0.40401969,  0.25120005,  0.21020794, -0.01767013,
       -0.13453783, -0.09605592, -0.88044229,  0.04689623,  0.09043851,
        0.21232286,  0.34129982, -0.3736799 ,  0.17313858])
import openturns as ot 
kernel = ot.KernelSmoothing()
sample = ot.Sample(x,1)
distribution = kernel.build(sample)
q = distribution.computeComplementaryCDF(0.3)
print(q)

which prints:

0.29136124840835353

JeeyCi · Accepted Answer · 2024-08-14 19:38:47Z

0

try this:

import numpy as np

...
# hist, bin_edges = np.histogram(x, bins=10, density=True)
counts_, bins_, patches_ = plt.hist(x, bins=10, density=True)
pdf = counts_/sum(counts_)

edited Aug 14, 2024 at 19:38

answered Jul 6, 2024 at 18:27

JeeyCi

6456 silver badges14 bronze badges

Collectives™ on Stack Overflow

Python - calculating pdf from a numpy array distribution

4 Answers 4

1 Comment

8 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

8 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related