random sampling of the data in python

Question

I have a dataframe with several columns and I need to re-sample from that data with more weight to one category. I think np.random.choice should work but not sure how to implement it. Following is the example data from which I want to sample randomly but want 70% probability of getting expensive home (based on the Expensive_home column, value = 1) and 30% probability for Expensive_home=0. How can I create the re-sampled data file? Thank you!

ID  Lot_Area    Year_Built  Full_Bath   Bedroom Sale_Price  Expensive_home
1   31770   1960    1   3   215000  0
2   11622   1961    1   2   105000  0
3   5389    1995    2   2   236500  0
4   8402    1998    2   3   180400  0
5   10176   1990    1   2   171500  0
6   6820    1985    1   1   212000  0
7   53504   2003    3   4   538000  1
8   12134   1988    2   4   164000  0
9   11394   2010    1   1   394432  1
10  19138   1951    1   2   141000  0
11  13175   1978    2   3   210000  0
12  11751   1977    2   3   190000  0
13  10625   1974    2   3   170000  0
14  7500    2000    2   3   216000  0
15  11241   1970    1   2   149000  0
16  2280    1978    2   3   146000  0
17  12858   2009    2   3   376162  1
18  12883   2009    2   3   290941  0
19  12182   2005    2   3   220000  0
20  11520   2005    2   3   275000  0

similar data file but with more of randomly picked 1s in the last column

I'm not 100% sure, but it looks like you might be able to use df.sample with a weights arguments, eg: df.sample(n, weights=df['Expensive_home'].replace({0:0.3, 1:0.7})) - not going to make that an answer at the moment though as not sure if that gives you the results you want... (you might have to do something different to provide it the weights that gives the desired result) — Jon Clements
– Jon Clements, Commented Nov 5, 2019 at 23:19
Thank you, I want a data frame with same number of cases and same columns just want more of the rows with 1s in the last column. — Pushpraj Verma
– Pushpraj Verma, Commented Nov 6, 2019 at 0:56
So what if you specify the sample size as the number of rows and pass replace=True so it can select the same row more than once, eg: df.sample(len(df), replace=True, weights=df['Expensive_home'].replace({0:0.3, 1:0.7})) - is that (close to) what you want? — Jon Clements
– Jon Clements, Commented Nov 6, 2019 at 1:02
Okay... and have you tried the line of code I posted above that should work fine on your dataframe? — Jon Clements
– Jon Clements, Commented Nov 6, 2019 at 1:12
Hi Jon - how would the code change if I want to under sample cases with 0s? Keep all the cases with Expensive_home=1 and random sample 30% of cases with Expensive_home=0 (of course without replacement). So my data will have fewer rows than total (20) cases. Thank you so much!!! — Pushpraj Verma
– Pushpraj Verma, Commented Nov 6, 2019 at 2:20

Jon Clements · Accepted Answer · 2019-11-06 02:34:10Z

1

To create a dataframe of the same length but allowing expensive to have a higher chance of being selected and allowing replacements, use:

weights = df['Expensive_home'].replace({0: 30, 1: 70})
df1 = df.sample(len(df), replace=True, weights=weights)

To create a dataframe with all expensive and then 30% of non-expensive, you can do:

expensive = df['Expensive_home'].astype(bool)
df2 = pd.concat([df[expensive], df[~expensive].sample(frac=0.3)])

answered Nov 6, 2019 at 2:34

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Pushpraj Verma Over a year ago

Perfect, appreciate it. Thanks

Collectives™ on Stack Overflow

random sampling of the data in python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related