Getting probability distribution in Python

Question

I have the following lines of data in the file (of course much more lines):

data1 0.20
data2 2.32
data3 0.02
dataX x.xx

data1 1.13
data2 3.10
data3 0.96
dataX x.xx

....

I'd like to create probability distribution for each data*. I can do that by hand but maybe there is a library which let me do that more automatically. Ideally I would like to avoid preformatting lines (and feed the library with the above lines but if it is not possible I will have to).

UPDATE

Sorry for inaccuracy. What I wanted to find is how many numbers fall into custom ranges. Example:

[0.0 - 0.1) - 2 numbers;
[0.1 - 0.2) - 3 numbers;
[0.2 - 0.3) - ...

Of course I would like to easily set different ranges (wider or narrower) and then - having that - I would like to generate charts.

There's lots of probability stuff in SciPy. What kind of distribution are you after? — Fred Foo
– Fred Foo, Commented Sep 21, 2012 at 12:45
Do you mean you're trying to build a histogram of the different datasets? It's not very clear what you have in mind. — Harel
– Harel, Commented Sep 21, 2012 at 12:56

Community · Accepted Answer · 2017-05-23 12:23:37Z

The concept of 'probability' is a little subtle - if the data are the output of a stationary stochastic process, then you could estimate probabilities of future outputs of that process by measuring past outputs. But the identical dataset could have been generated deterministically, in which case there is no probability involved, and each time you run the process you'll get the identical data (instead of different data with a similar distribution).

In either case, you can get a distribution of your data by binning it into histograms. Formatting the data into separate lists can be done with:

import collections, re

data = ["data1 0.20", "data2 2.32", "data3 0.02",
        "data1 1.13", "data2 3.10", "data3 0.96" ]

hist = collections.defaultdict(list)
for d in data:
    m = re.match("data(\d+)\s+(\S+)", d)
    if m:
        hist[int(m.group(1))].append(float(m.group(2)))
for k in hist.keys():
    print(k, hist[k])

producing:

1 [0.2, 1.13]
2 [2.32, 3.1]
3 [0.02, 0.96]

You can then build the histograms using Howto bin series of float values into histogram in Python?. And finally, normalize the bin values so that they sum to 1.0 (divide each bin by the total of all bins) to make a probability distribution. Not the probability distribution used to create the data, but an approximation to it.

Andy Hayden · Accepted Answer · 2012-09-21 14:23:50Z

0

You could use scipy stats norm (and collections).

To split up your data (I think you mean to have it in this form):

raw_data = ( line.split() for line in raw_data.split('\n') )

data = collections.defaultdict(list)
for item in raw_data:
    data[item[0]] = item[1]

data['data1'] # [0.2, 1.13...]

Then for each data set:

for i in xrange(X):
    scipy.stats.norm.fit(data['data'+i]) # (mean, standard deviation)

scipy.stats.norm.fit(data['data1']) # (0.66499999999999992, 0.46499999999999991)

It's unclear precisely what probability you have in mind, but mean and standard deviation are a good start (you can find others in the scipy's statistical functions).

answered Sep 21, 2012 at 14:23

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

Collectives™ on Stack Overflow

Getting probability distribution in Python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related