1

Below is the data for which I want to plot the PDF. https://gist.github.com/ecenm/cbbdcea724e199dc60fe4a38b7791eb8#file-64_general-out

Below is the script

import numpy as np
import matplotlib.pyplot as plt
import pylab

data = np.loadtxt('64_general.out')
H,X1 = np.histogram( data, bins = 10, normed = True, density = True) # Is this the right way to get the PDF ?
plt.xlabel('Latency')
plt.ylabel('PDF')
plt.title('PDF of latency values')

plt.plot(X1[1:], H)
plt.show()

When I plot the above, I get the following.

  1. Is the above the correct way to calculate the PDF of a range of values
  2. Is there any other way to confirm that the results I get is the actual PDF. For example, how can show the area under pdf = 1 for my case.

enter image description here

4
  • Your data is made with integers only. Is this a discrete or continuous variable? Also take into account that PDF is "Probability Density Function". This means that for sparse data you are interpreting a "PDF" from it, not obtaining one. So, depending on your data, having 100 bins will beat 10 in terms of approximation (this is an example, don't take the numbers literally). Commented Jun 22, 2016 at 13:37
  • Thanks for the info, my data is discrete variable. I didn't understand your last sentence. Could you explain more ? Commented Jun 22, 2016 at 13:49
  • 1
    If it is a discrete variable than you are probably looking for a PMF and my last comment won't apply. You can still use the histogram function to do it but you need to take into account that each bin should correspond to an unique value. See if the answer in this question helps you. Commented Jun 22, 2016 at 14:17
  • normed = True is depricated, density = True is enough to get sum of probabilities equal to 1 while binarizing data... can also see np.discretize if would like Commented Jul 6, 2024 at 18:16

2 Answers 2

2
  1. It is a legit way of approximating the PDF. Since np.histogram uses various techniques for binning the values you won't get the exact frequency of each number in your input. For a more exact approximation you should count the occurrence of each number and divide it by the total count. Also, since these are discrete values, the plot could be plotted as points or bars to give a more correct impression.

  2. In the discrete case, the sum of the frequencies should equal 1. In the continuous case you can for example use np.trapz() to approximate the integral.

Sign up to request clarification or add additional context in comments.

1 Comment

to the first remark: unique, counts = np.unique(x, return_counts=True) or probs = np.bincount(x)/100
0

how can show the area under pdf = 1 for my case.

for discrete case

import numpy as np

x = np.random.normal(size=1000)
x=x*0.7

hist, bin_edges = np.histogram(x, density=True)
##print(hist.sum())
print(np.sum(hist * np.diff(bin_edges)))

or:

import matplotlib.pyplot as plt

n, bins, patches = plt.hist(x, bins=10, density=True, edgecolor='black', lw=3, fc=(0, 0, 1, 0.5), alpha=0.2)     # color='maroon',
plt.hist(x, bins=10,  cumulative=True,  lw=3, fc=(0, 0, 0.5, 0.3), log=True)  # fc= RGBA
##print(n, bins, patches.datavalues)
density = n / (sum(n) * np.diff(bins))
##print(density)
#### the area (or integral) under the histogram will sum to 1 = (np.sum(density * np.diff(bins)) == 1).
print(np.sum(density * np.diff(bins)))
print(np.allclose(np.sum(density * np.diff(bins)) , 1))

for continuous:

# https://stackoverflow.com/a/59096585/15893581
# Calculate a KDE, then use the KDE as if it were a PDF
from  scipy.stats import gaussian_kde

kde = gaussian_kde(x)
#get probability
print(kde.integrate_box_1d( -np.inf, np.inf))

or as was suggested

import matplotlib.pyplot as plt

counts_, bins_, patches_ = plt.hist(x, bins=10, density=True)
pdf = np.array(counts_/sum(counts_))
print(np.trapz(pdf, x=None, dx=1.0, axis=-1))

or for normal distr. can also do

from scipy.integrate import quad

fun=  lambda x: np.exp(-x**2/2)/(np.sqrt(2*np.pi))
y, err= quad(fun, -1000, 1000)
print(y)

or using rv_histogram:

from scipy.stats import rv_histogram
r = rv_histogram(np.histogram(x, bins=100))  
r.pdf(np.linspace(0,1,5)) 
  • for custom distribution see here & do integration as previosly mentioned
  • cdf from pdf here
  • but I agree with comments here - @My Work: "it will always return something, the question is what"
  • HERE for norm distr: "scipy also has a CDF function that returns the integral from -inf to x": scipy.stats.norm.cdf(np.inf); # 1.0
  • from CDF to PDF: use derivative dx=0.1; np.gradient(pdf, dx) because PDF(x) = d CDF(x)/ dx, meaning that probability density on PDF is a rate of change for CDF
  • Performance considerations

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.