Random sample list values from a DataFrame column

Question

I have the following test DateFrame:

| tag      | list                                                | Count |
| -------- | ----------------------------------------------------|-------|
| icecream | [['A',0.9],['B',0.6],['C',0.5],['D',0.3],['E',0.1]] |  5    |
| potato   | [['U',0.8],['V',0.7],['W',0.4],['X',0.3]]           |  4    |
| cheese   | [['I',0.2],['J',0.4]]                               |  2    |

I want to randomly sample the list column to pick any 3 from the first 4 lists of lists. (Like ['E',0.1] is not even considered for tag = icecream).

The rule should be able to pick 3 list randomly from the list of lists. If there is less than 3 then pick whatever is there and randomize it.

The result should be random every time so need to seed it for the same output:

| tag      | list                           | 
| -------- | -------------------------------|
| icecream | [['B',0.6],['C',0.5],['A',0.9]]|
| potato   | [['W',0.4],['X',0.3],['U',0.8]]|
| cheese   | [['J',0.4],['I',0.2]]          |

This is what I tried:

data = [['icecream', [['A', 0.9],['B', 0.6],['C',0.5],['D',0.3],['E',0.1]]], 
        ['potato', [['U', 0.8],['V', 0.7],['W',0.4],['X',0.3]]],
        ['cheese',[['I',0.2],['J',0.4]]]]

df = pd.DataFrame(data, columns=['tag', 'list'])
df['Count'] = df['list'].str.len().sort_values( ascending=[False])
df
--

import random
item_top_3 =  []
find = 4
num = 3
for i in range(df.shape[0]):
    item_id = df["tag"].iloc[i]
    whole_list = df["list"].iloc[i]
    item_top_3.append([item_id, random.sample(whole_list[0:find], num)])

--
I get this error:
ValueError: Sample larger than population or is negative.

Can anyone help randomizing it. The original DataFrame has over 50,000 rows and I want to randomize for any rule like tomorrow someone may want to pick 5 random items from first 20 elements in the list of lists, but it should still work.

Can you provide a DataFrame constructor of the input?

mozway
– mozway

2022-08-03 05:56:45 +00:00
Commented Aug 3, 2022 at 5:56 — mozway
– mozway, Commented Aug 3, 2022 at 5:56
@mozway - updated it in the question. Can you check?

trojan horse
– trojan horse

2022-08-03 06:02:56 +00:00
Commented Aug 3, 2022 at 6:02 — trojan horse
– trojan horse, Commented Aug 3, 2022 at 6:02

mozway · Accepted Answer · 2022-08-03 06:16:14Z

1

Use a list comprehension combined with random.sample:

import random

find = 4
num = 3
df['list'] = [random.sample(l[:find], k=min(num, len(l))) for l in df['list']]

output:

        tag                            list  Count
0  icecream  [[C, 0.5], [B, 0.6], [D, 0.3]]      5
1    potato  [[V, 0.7], [U, 0.8], [X, 0.3]]      4
2    cheese            [[J, 0.4], [I, 0.2]]      2

edited Aug 3, 2022 at 6:16

answered Aug 3, 2022 at 6:10

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

trojan horse Over a year ago

but E shouldnt be coming in results, correct? It is 5th item in the list for icecream

trojan horse Over a year ago

I think we have to do find = 3 but this works. Thanks again, Mozway. You are awsesome!

mozway Over a year ago

I pasted the wrong output. find should be 4 ;)

trojan horse Over a year ago

yeah, it is fine

ko3 · Accepted Answer · 2022-08-03 07:24:41Z

Alternatively, you can combine np.random.choice with apply after creating a temporary list column that only contains the first n elements of your orginal list column.

Code:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "tag": ["icecream", "potato", "cheese"],
    "list": [[['A',0.9],['B',0.6],['C',0.5],['D',0.3],['E',0.1]], [['U',0.8],['V',0.7],['W',0.4],['X',0.3]], [['I',0.2],['J',0.4]]],
    "count": [5, 4, 2]
})

first_n = 4
size = 3
df["ls_tmp"] = df["list"].str[:first_n].apply(np.array)
df["list"] = df["ls_tmp"].apply(lambda x: list(x[np.random.choice(len(x), size=size)]))

You can also write a helper function and use map instead of apply, which should be faster and more effective:

def randomize(x, size=3):
    return list(x[np.random.choice(len(x), size=size)])

df["list"] = df["ls_tmp"].map(randomize)

Output:

    tag       list                              count   ls_tmp
0   icecream  [[A, 0.9], [A, 0.9], [C, 0.5]]    5       [[A, 0.9], [B, 0.6], [C, 0.5], [D, 0.3]]
1   potato    [[W, 0.4], [V, 0.7], [V, 0.7]]    4       [[U, 0.8], [V, 0.7], [W, 0.4], [X, 0.3]]
2   cheese    [[J, 0.4], [J, 0.4]]              2       [[I, 0.2], [J, 0.4]]

where the column ls_tmp contains the original first nvalues.

Collectives™ on Stack Overflow

Random sample list values from a DataFrame column

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related