Probability density function for a set of values using numpy

Question

Below is the data for which I want to plot the PDF. https://gist.github.com/ecenm/cbbdcea724e199dc60fe4a38b7791eb8#file-64_general-out

Below is the script

import numpy as np
import matplotlib.pyplot as plt
import pylab

data = np.loadtxt('64_general.out')
H,X1 = np.histogram( data, bins = 10, normed = True, density = True) # Is this the right way to get the PDF ?
plt.xlabel('Latency')
plt.ylabel('PDF')
plt.title('PDF of latency values')

plt.plot(X1[1:], H)
plt.show()

When I plot the above, I get the following.

Is the above the correct way to calculate the PDF of a range of values
Is there any other way to confirm that the results I get is the actual PDF. For example, how can show the area under pdf = 1 for my case.

Your data is made with integers only. Is this a discrete or continuous variable? Also take into account that PDF is "Probability Density Function". This means that for sparse data you are interpreting a "PDF" from it, not obtaining one. So, depending on your data, having 100 bins will beat 10 in terms of approximation (this is an example, don't take the numbers literally). — armatita
– armatita, Commented Jun 22, 2016 at 13:37
Thanks for the info, my data is discrete variable. I didn't understand your last sentence. Could you explain more ? — user2532296
– user2532296, Commented Jun 22, 2016 at 13:49
If it is a discrete variable than you are probably looking for a PMF and my last comment won't apply. You can still use the histogram function to do it but you need to take into account that each bin should correspond to an unique value. See if the answer in this question helps you. — armatita
– armatita, Commented Jun 22, 2016 at 14:17
normed = True is depricated, density = True is enough to get sum of probabilities equal to 1 while binarizing data... can also see np.discretize if would like — JeeyCi
– JeeyCi, Commented Jul 6, 2024 at 18:16

user1337 · Accepted Answer · 2016-06-22 13:36:52Z

2

It is a legit way of approximating the PDF. Since np.histogram uses various techniques for binning the values you won't get the exact frequency of each number in your input. For a more exact approximation you should count the occurrence of each number and divide it by the total count. Also, since these are discrete values, the plot could be plotted as points or bars to give a more correct impression.
In the discrete case, the sum of the frequencies should equal 1. In the continuous case you can for example use np.trapz() to approximate the integral.

answered Jun 22, 2016 at 13:36

user1337

5043 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

JeeyCi Over a year ago

to the first remark: unique, counts = np.unique(x, return_counts=True) or probs = np.bincount(x)/100

JeeyCi · Accepted Answer · 2024-08-15 17:50:27Z

how can show the area under pdf = 1 for my case.

for discrete case

import numpy as np

x = np.random.normal(size=1000)
x=x*0.7

hist, bin_edges = np.histogram(x, density=True)
##print(hist.sum())
print(np.sum(hist * np.diff(bin_edges)))

or:

import matplotlib.pyplot as plt

n, bins, patches = plt.hist(x, bins=10, density=True, edgecolor='black', lw=3, fc=(0, 0, 1, 0.5), alpha=0.2)     # color='maroon',
plt.hist(x, bins=10,  cumulative=True,  lw=3, fc=(0, 0, 0.5, 0.3), log=True)  # fc= RGBA
##print(n, bins, patches.datavalues)
density = n / (sum(n) * np.diff(bins))
##print(density)
#### the area (or integral) under the histogram will sum to 1 = (np.sum(density * np.diff(bins)) == 1).
print(np.sum(density * np.diff(bins)))
print(np.allclose(np.sum(density * np.diff(bins)) , 1))

for continuous:

# https://stackoverflow.com/a/59096585/15893581
# Calculate a KDE, then use the KDE as if it were a PDF
from  scipy.stats import gaussian_kde

kde = gaussian_kde(x)
#get probability
print(kde.integrate_box_1d( -np.inf, np.inf))

or as was suggested

import matplotlib.pyplot as plt

counts_, bins_, patches_ = plt.hist(x, bins=10, density=True)
pdf = np.array(counts_/sum(counts_))
print(np.trapz(pdf, x=None, dx=1.0, axis=-1))

or for normal distr. can also do

from scipy.integrate import quad

fun=  lambda x: np.exp(-x**2/2)/(np.sqrt(2*np.pi))
y, err= quad(fun, -1000, 1000)
print(y)

or using rv_histogram:

from scipy.stats import rv_histogram
r = rv_histogram(np.histogram(x, bins=100))  
r.pdf(np.linspace(0,1,5))

for custom distribution see here & do integration as previosly mentioned
cdf from pdf here
but I agree with comments here - @My Work: "it will always return something, the question is what"
HERE for norm distr: "scipy also has a CDF function that returns the integral from -inf to x": scipy.stats.norm.cdf(np.inf); # 1.0
from CDF to PDF: use derivative dx=0.1; np.gradient(pdf, dx) because PDF(x) = d CDF(x)/ dx, meaning that probability density on PDF is a rate of change for CDF
Performance considerations

Collectives™ on Stack Overflow

Probability density function for a set of values using numpy

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related