0

I have the following lines of data in the file (of course much more lines):

data1 0.20
data2 2.32
data3 0.02
dataX x.xx

data1 1.13
data2 3.10
data3 0.96
dataX x.xx

....

I'd like to create probability distribution for each data*. I can do that by hand but maybe there is a library which let me do that more automatically. Ideally I would like to avoid preformatting lines (and feed the library with the above lines but if it is not possible I will have to).

UPDATE

Sorry for inaccuracy. What I wanted to find is how many numbers fall into custom ranges. Example:

[0.0 - 0.1) - 2 numbers;
[0.1 - 0.2) - 3 numbers;
[0.2 - 0.3) - ...

Of course I would like to easily set different ranges (wider or narrower) and then - having that - I would like to generate charts.

3
  • There's lots of probability stuff in SciPy. What kind of distribution are you after? Commented Sep 21, 2012 at 12:45
  • 2
    Do you mean you're trying to build a histogram of the different datasets? It's not very clear what you have in mind. Commented Sep 21, 2012 at 12:56
  • Maybe the statlib module is what you are after. Commented Sep 21, 2012 at 13:14

2 Answers 2

1

The concept of 'probability' is a little subtle - if the data are the output of a stationary stochastic process, then you could estimate probabilities of future outputs of that process by measuring past outputs. But the identical dataset could have been generated deterministically, in which case there is no probability involved, and each time you run the process you'll get the identical data (instead of different data with a similar distribution).

In either case, you can get a distribution of your data by binning it into histograms. Formatting the data into separate lists can be done with:

import collections, re

data = ["data1 0.20", "data2 2.32", "data3 0.02",
        "data1 1.13", "data2 3.10", "data3 0.96" ]

hist = collections.defaultdict(list)
for d in data:
    m = re.match("data(\d+)\s+(\S+)", d)
    if m:
        hist[int(m.group(1))].append(float(m.group(2)))
for k in hist.keys():
    print(k, hist[k])

producing:

1 [0.2, 1.13]
2 [2.32, 3.1]
3 [0.02, 0.96]

You can then build the histograms using Howto bin series of float values into histogram in Python?. And finally, normalize the bin values so that they sum to 1.0 (divide each bin by the total of all bins) to make a probability distribution. Not the probability distribution used to create the data, but an approximation to it.

Sign up to request clarification or add additional context in comments.

Comments

0

You could use scipy stats norm (and collections).

To split up your data (I think you mean to have it in this form):

raw_data = ( line.split() for line in raw_data.split('\n') )

data = collections.defaultdict(list)
for item in raw_data:
    data[item[0]] = item[1]

data['data1'] # [0.2, 1.13...]

Then for each data set:

for i in xrange(X):
    scipy.stats.norm.fit(data['data'+i]) # (mean, standard deviation)

scipy.stats.norm.fit(data['data1']) # (0.66499999999999992, 0.46499999999999991)

It's unclear precisely what probability you have in mind, but mean and standard deviation are a good start (you can find others in the scipy's statistical functions).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.