1

I'm trying to add a column in a pandas dataframe which is a value on average equal to the initial column, but can deviate on each row some decimal points. Ideally deviating with a normal distribution, but I'm not sure how to do this.

I've tried a simple code like the one below:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1,99,size=(100000, 1)), columns=["GOD_prob"])

df["GOD_prob"] = df["GOD_prob"] / 100
df["GOD_odd"] = 1 / df["GOD_prob"]

df["market_prob"] = ((df["GOD_prob"] * 100 ) + np.random.randint(-10,10, len(df))) / 100
df["market_price"] = 1 / df["market_prob"] 

The problem I'm having is, for values in df["GOD_prob"] under 0.10, I can get negative values for df["market_prob"] and I don't want this, as these columns stand for probabilities.

Afterwards I'd like to create another column which deviates from df["GOD_prob"] 5% on average, but I'm not really sure how to do this.

Thanks for helping!

3
  • I mean get a normal distribution by df[col] = np.random.normal(mean, std, size=len(df)) Commented Aug 24, 2018 at 11:28
  • Thanks! But it still doesn't solve my issue with negative probability numbers. Commented Aug 24, 2018 at 11:30
  • So the normal distribution is defined over the range (-inf,inf) so you can't use one and keep values in the [0,inf) range. See this statistics.se answer for alternate distributions. There is a numpy generator np.random.gamma but you'll have to do some maths to figure out what shape and scale should be. Commented Aug 24, 2018 at 11:36

1 Answer 1

1

Since your issue is with negative values, I would suggest clipping those or reassigning.

Option 1:

s = df['GOD_prob']
df['market_prob'] = np.random.normal(s, 0.05*s).clip(0,1)

Option 2:

s = df['GOD_prob']
df['market_prob'] = np.random.normal(s, 0.05*s)
cond = (df['market_prob']<0)|(df['market_prob']>1)
while cond.any():
  s = df.loc[cond, 'GOD_prob']
  df.loc[cond, 'market_prob'] = np.random.normal(s, 0.05*s)
  cond = (df['market_prob']<0)|(df['market_prob']>1)

The first option could lead to some deviation shift.

The second option could be inefficient but will preserve certain deviation.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.