Add column with random number based on other column

Question

I'm trying to add a column in a pandas dataframe which is a value on average equal to the initial column, but can deviate on each row some decimal points. Ideally deviating with a normal distribution, but I'm not sure how to do this.

I've tried a simple code like the one below:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1,99,size=(100000, 1)), columns=["GOD_prob"])

df["GOD_prob"] = df["GOD_prob"] / 100
df["GOD_odd"] = 1 / df["GOD_prob"]

df["market_prob"] = ((df["GOD_prob"] * 100 ) + np.random.randint(-10,10, len(df))) / 100
df["market_price"] = 1 / df["market_prob"]

The problem I'm having is, for values in df["GOD_prob"] under 0.10, I can get negative values for df["market_prob"] and I don't want this, as these columns stand for probabilities.

Afterwards I'd like to create another column which deviates from df["GOD_prob"] 5% on average, but I'm not really sure how to do this.

Thanks for helping!

I mean get a normal distribution by df[col] = np.random.normal(mean, std, size=len(df)) — FHTMitchell
– FHTMitchell, Commented Aug 24, 2018 at 11:28
Thanks! But it still doesn't solve my issue with negative probability numbers. — wazo
– wazo, Commented Aug 24, 2018 at 11:30
So the normal distribution is defined over the range (-inf,inf) so you can't use one and keep values in the [0,inf) range. See this statistics.se answer for alternate distributions. There is a numpy generator np.random.gamma but you'll have to do some maths to figure out what shape and scale should be. — FHTMitchell
– FHTMitchell, Commented Aug 24, 2018 at 11:36

agastalver · Accepted Answer · 2018-08-24 12:34:15Z

1

Since your issue is with negative values, I would suggest clipping those or reassigning.

Option 1:

s = df['GOD_prob']
df['market_prob'] = np.random.normal(s, 0.05*s).clip(0,1)

Option 2:

s = df['GOD_prob']
df['market_prob'] = np.random.normal(s, 0.05*s)
cond = (df['market_prob']<0)|(df['market_prob']>1)
while cond.any():
  s = df.loc[cond, 'GOD_prob']
  df.loc[cond, 'market_prob'] = np.random.normal(s, 0.05*s)
  cond = (df['market_prob']<0)|(df['market_prob']>1)

The first option could lead to some deviation shift.

The second option could be inefficient but will preserve certain deviation.

answered Aug 24, 2018 at 12:34

agastalver

1,0568 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Add column with random number based on other column

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related