0

I have a dataframe with several columns and I need to re-sample from that data with more weight to one category. I think np.random.choice should work but not sure how to implement it. Following is the example data from which I want to sample randomly but want 70% probability of getting expensive home (based on the Expensive_home column, value = 1) and 30% probability for Expensive_home=0. How can I create the re-sampled data file? Thank you!

ID  Lot_Area    Year_Built  Full_Bath   Bedroom Sale_Price  Expensive_home
1   31770   1960    1   3   215000  0
2   11622   1961    1   2   105000  0
3   5389    1995    2   2   236500  0
4   8402    1998    2   3   180400  0
5   10176   1990    1   2   171500  0
6   6820    1985    1   1   212000  0
7   53504   2003    3   4   538000  1
8   12134   1988    2   4   164000  0
9   11394   2010    1   1   394432  1
10  19138   1951    1   2   141000  0
11  13175   1978    2   3   210000  0
12  11751   1977    2   3   190000  0
13  10625   1974    2   3   170000  0
14  7500    2000    2   3   216000  0
15  11241   1970    1   2   149000  0
16  2280    1978    2   3   146000  0
17  12858   2009    2   3   376162  1
18  12883   2009    2   3   290941  0
19  12182   2005    2   3   220000  0
20  11520   2005    2   3   275000  0

similar data file but with more of randomly picked 1s in the last column

7
  • I'm not 100% sure, but it looks like you might be able to use df.sample with a weights arguments, eg: df.sample(n, weights=df['Expensive_home'].replace({0:0.3, 1:0.7})) - not going to make that an answer at the moment though as not sure if that gives you the results you want... (you might have to do something different to provide it the weights that gives the desired result) Commented Nov 5, 2019 at 23:19
  • Thank you, I want a data frame with same number of cases and same columns just want more of the rows with 1s in the last column. Commented Nov 6, 2019 at 0:56
  • 1
    So what if you specify the sample size as the number of rows and pass replace=True so it can select the same row more than once, eg: df.sample(len(df), replace=True, weights=df['Expensive_home'].replace({0:0.3, 1:0.7})) - is that (close to) what you want? Commented Nov 6, 2019 at 1:02
  • 1
    Okay... and have you tried the line of code I posted above that should work fine on your dataframe? Commented Nov 6, 2019 at 1:12
  • Hi Jon - how would the code change if I want to under sample cases with 0s? Keep all the cases with Expensive_home=1 and random sample 30% of cases with Expensive_home=0 (of course without replacement). So my data will have fewer rows than total (20) cases. Thank you so much!!! Commented Nov 6, 2019 at 2:20

1 Answer 1

1

To create a dataframe of the same length but allowing expensive to have a higher chance of being selected and allowing replacements, use:

weights = df['Expensive_home'].replace({0: 30, 1: 70})
df1 = df.sample(len(df), replace=True, weights=weights)

To create a dataframe with all expensive and then 30% of non-expensive, you can do:

expensive = df['Expensive_home'].astype(bool)
df2 = pd.concat([df[expensive], df[~expensive].sample(frac=0.3)])
Sign up to request clarification or add additional context in comments.

1 Comment

Perfect, appreciate it. Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.