3

Given an array of values, I want to be able to fit a density function to it and find the pdf of an arbitrary input value. Is this possible, and how would I go about it? There aren't necessarily assumptions of normality, and I don't need the function itself.

For instance, given:

x = array([ 0.62529759, -0.08202699,  0.59220673, -0.09074541,  0.05517865,
        0.20153703,  0.22773723, -0.26229708,  0.76137555, -0.61229314,
        0.27292745,  0.35596795, -0.01373896,  0.32464979, -0.22932331,
        1.14796175,  0.17268531,  0.40692172,  0.13846154,  0.22752953,
        0.13087359,  0.14111479, -0.09932381,  0.12800392,  0.02605917,
        0.18776078,  0.45872642, -0.3943505 , -0.0771418 , -0.38822433,
       -0.09171721,  0.23083624, -0.21603973,  0.05425592,  0.47910286,
        0.26359565, -0.19917942,  0.40182097, -0.0797546 ,  0.47239264,
       -0.36654449,  0.4513859 , -0.00282486, -0.13950512, -0.05375369,
        0.03331833,  0.48951555, -0.13760504,  2.788     , -0.15017848,
        0.02930675,  0.10910646,  0.03868301, -0.048482  ,  0.7277376 ,
        0.08841259, -0.10968462,  0.50371324,  0.86379698,  0.01674877,
        0.19542421, -0.06639165,  0.74500856, -0.10148342,  0.02482331,
        0.79195804,  0.40401969,  0.25120005,  0.21020794, -0.01767013,
       -0.13453783, -0.09605592, -0.88044229,  0.04689623,  0.09043851,
        0.21232286,  0.34129982, -0.3736799 ,  0.17313858])

I would like to find how a value of 0.3 compares to all of the above, and what percent of the above values it is greater than.

10
  • stackoverflow.com/questions/41974615/… Commented Aug 2, 2017 at 15:28
  • @Joe I don't have the explicit representation of the density function, so I can't pass in anything for scipy integrate. Is there a workaround with just the data? Commented Aug 2, 2017 at 15:30
  • calculate the cdf from the data? stackoverflow.com/a/41980574/7919597 then pdf? Commented Aug 2, 2017 at 15:34
  • stats.stackexchange.com/questions/23/… Commented Aug 2, 2017 at 15:36
  • @Joe thanks for the tip, looks useful. However, my data is not normal so using scipy.norm would provide very inaccurate estimates. If possible, I could equivalently use any way to fit an explicit function to my data such that it can return an equation for me to integrate. Would you know of anything of that nature? Commented Aug 2, 2017 at 15:38

4 Answers 4

8

I personally like using the scipy.stats package. It has a useful implementation of Kernel Density Estimation. Bascially what this does is it estimates a probability density function of certain data, using combinations of gaussian (or other) distributions. Which distributions are used is a parameter you can set. Look at the documentation and related examples here: https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html#kernel-density-estimation

And for more about KDE: https://en.wikipedia.org/wiki/Kernel_density_estimation

Once you have built your KDE, then you can perform the same operations on it to get probabilities. For example, if you want to calculate the probability that a value occurs that is as large or larger than 0.3 you would do the following:

kde = stats.gaussian_kde(np.array(x))

#visualize KDE
fig = plt.figure()
ax = fig.add_subplot(111)
x_eval = np.linspace(-.2, .2, num=200)
ax.plot(x_eval, kde(x_eval), 'k-')

#get probability
kde.integrate_box_1d( 0.3, np.inf)

TLDR: Calculate a KDE, then use the KDE as if it were a PDF.

Sign up to request clarification or add additional context in comments.

1 Comment

Kernel is smoothing the sequence, pure pdf can see here
4

You can use openTURNS for that. You can use a Gaussian kernel smoothing to do that easily! From the doc:

import openturns as ot 
kernel = ot.KernelSmoothing()
estimated = kernel.build(x)

That's it, now you have a distribution object :)

This library is very cool for statistics! (I am not related to them).

8 Comments

Thanks! I'm not really familiar with openturns, and from what I can see there isn't an immediate method to calculate cdf of a value after building with the kernel. Are you aware of how to use the distribution object you mentioned (estimated) to calculate cdf of future samples?
@BrianG yes you can. Here is the doc of a distribution object: openturns.github.io/openturns/latest/user_manual/_generated/… you have a computeCDF method :) Consoder accepting the answer to mark it resolved.
@BrianG your welcome. I know pretty well OT so do not hesitate to post questions about it
Hey, I actually encountered this problem when trying to use it:
We would better Stack Overflow Chat
|
0

We have first to create the Sample from the Numpy array. Then we compute the complementary CDF with the complementaryCDF method of the distribution (a small improvement over Yoda's answer).

import numpy as np
x = np.array([ 0.62529759, -0.08202699,  0.59220673, -0.09074541,  0.05517865,
        0.20153703,  0.22773723, -0.26229708,  0.76137555, -0.61229314,
        0.27292745,  0.35596795, -0.01373896,  0.32464979, -0.22932331,
        1.14796175,  0.17268531,  0.40692172,  0.13846154,  0.22752953,
        0.13087359,  0.14111479, -0.09932381,  0.12800392,  0.02605917,
        0.18776078,  0.45872642, -0.3943505 , -0.0771418 , -0.38822433,
       -0.09171721,  0.23083624, -0.21603973,  0.05425592,  0.47910286,
        0.26359565, -0.19917942,  0.40182097, -0.0797546 ,  0.47239264,
       -0.36654449,  0.4513859 , -0.00282486, -0.13950512, -0.05375369,
        0.03331833,  0.48951555, -0.13760504,  2.788     , -0.15017848,
        0.02930675,  0.10910646,  0.03868301, -0.048482  ,  0.7277376 ,
        0.08841259, -0.10968462,  0.50371324,  0.86379698,  0.01674877,
        0.19542421, -0.06639165,  0.74500856, -0.10148342,  0.02482331,
        0.79195804,  0.40401969,  0.25120005,  0.21020794, -0.01767013,
       -0.13453783, -0.09605592, -0.88044229,  0.04689623,  0.09043851,
        0.21232286,  0.34129982, -0.3736799 ,  0.17313858])
import openturns as ot 
kernel = ot.KernelSmoothing()
sample = ot.Sample(x,1)
distribution = kernel.build(sample)
q = distribution.computeComplementaryCDF(0.3)
print(q)

which prints:

0.29136124840835353

Comments

0

try this:

import numpy as np

...
# hist, bin_edges = np.histogram(x, bins=10, density=True)
counts_, bins_, patches_ = plt.hist(x, bins=10, density=True)
pdf = counts_/sum(counts_)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.