1

Have a df with values :

name     algo      accuracy
tom       1         88
tommy     2         87
mark      1         88
stuart    3         100
alex      2         99
lincoln   1         88

How to randomly pick 4 records from df with a condition that at least one record should be picked from each unique algo column values. here, algo column has only 3 unique values (1 , 2 , 3 )

Sample outputs:

name     algo      accuracy
tom       1         88
tommy     2         87
stuart    3         100
lincoln   1         88

sample output2:

name     algo      accuracy
mark      1         88
stuart    3         100
alex      2         99
lincoln   1         88

1 Answer 1

3

One way

num_sample, num_algo = 4, 3

# sample one for each algo
out = df.groupby('algo').sample(n=num_sample//num_algo)

# append one more sample from those that didn't get selected.
out = out.append(df.drop(out.index).sample(n=num_sample-num_algo) )

Another way is to shuffle the whole data, enumerate the rows within each algo, sort by that enumeration and take the required number of samples. This is slightly more code than the first approach, but is cheaper and produces more balanced algo counts:

# shuffle data
df_random = df['algo'].sample(frac=1)

# enumerations of rows with the same algo
enums = df_random.groupby(df_random).cumcount()

# sort with `np.argsort`:
enums = enums.sort_values()

# pick the first num_sample indices
# these will be indices of the samples
# so we can use `loc`
out = df.loc[enums.iloc[:num_sample].index]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.