Data manipulation in Pandas/Python

Question

It seems to be simple data manipulation operation. But I am stuck at this.

I have a recommendation dataset for a campaign.

Masteruserid content 

1             100
1             101
1             102
2             100
2             101
2             110

Now for each user we want to recommend atleast 5 content. So for instance Masteruserid 1 has three recommendations, I want to pick remaining two randomly from globally viewed content, which is a separate dataset(list). Then I have to also check for duplicates in case if the randomly picked is already present in the raw dataset.

global_content
100
300
301
101

In actual I have around 4000+ Masteruserid's. Now I want assistance in just how to start approaching this.

What exactly is the expected output or your question/problem? Sounds like you want to select any 2 elements from "global" where not in the content of the "campaign"... Sounds very familiar to a SQL statement — OneCricketeer
– OneCricketeer, Commented Aug 23, 2016 at 13:54
Yes I want 5 elements for each masteruserid. So any missing element is picked up from the global. I want to do this in python. — user2906657
– user2906657, Commented Aug 23, 2016 at 13:58
As far as I know, SQL can be translated into Dataframes logic very well. You should edit your question to include some attempt at the problem. — OneCricketeer
– OneCricketeer, Commented Aug 23, 2016 at 14:04

piRSquared · Accepted Answer · 2016-08-23 14:27:45Z

1

def add_content(df, gc, k=5):
    n = len(df)
    gcs = set(gc.squeeze())
    if n < k:
        choices = list(gcs.difference(df.content))
        mc = np.random.choice(choices, k - n, replace=False)
        ids = np.repeat(df.Masteruserid.iloc[-1], k - n)
        data = dict(Masteruserid=ids, content=mc)

        return df.append(pd.DataFrame(data), ignore_index=True)


gb = df.groupby('Masteruserid', group_keys=False)
gb.apply(add_content, gc).reset_index(drop=True)

answered Aug 23, 2016 at 14:27

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Merlin · Accepted Answer · 2016-08-23 20:17:42Z

0

Try this, using this as recs list,

df2['global_content']

0    100
1    300
2    301
3    101
4    400
5    500
6    401
7    501

recs = pd.DataFrame()
recs['content'] = df.groupby('Masteruserid')['content'].apply(lambda x: list(x) + np.random.choice(df2[~df2.isin(list(x))].dropna().values.flatten(), 2, replace=False).tolist())
recs

                                    content
Masteruserid                               
1             [100, 101, 102, 300.0, 301.0]
2             [100, 101, 110, 501.0, 301.0]

edited Aug 23, 2016 at 20:17

answered Aug 23, 2016 at 16:47

Merlin

25.9k44 gold badges141 silver badges213 bronze badges

Collectives™ on Stack Overflow

Data manipulation in Pandas/Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related