4

I have a dataframe like below.

user cat  val
u1   cat1 1
u1   cat2 2
u1   cat3 3
u2   cat1 5
u3   cat4 4

And wish to transform it like this.

user cat  val(avg)
u1   cat1 1
u1   cat2 2
u1   cat3 3
u1   cat4 4
u2   cat1 5
u2   cat2 2
u2   cat3 3
u2   cat4 4
u3   cat1 3
u3   cat2 2
u3   cat3 3
u3   cat4 4

There are also a couple of numeric columns to the right of cat, which hopefully can be filled with either NA; or if possible averaged out.

3
  • Can you clarify what you mean by "averaged out", with an example? Commented Jun 20, 2018 at 11:41
  • Edited with example. the val column is the original value, otherwise the average of existing values. Commented Jun 20, 2018 at 11:51
  • 2
    How you calculated the avg in your example? Commented Jun 20, 2018 at 12:01

4 Answers 4

5

one way to solve this,

l1=df['user'].unique().tolist()
l2=df['cat'].unique().tolist()
new_df =  pd.DataFrame(list(itertools.product(l1,l2))).rename(columns={0:'user',1:'cat'})
new_df=pd.merge(new_df,df,on=['user','cat'],how='left')

Output:

   user   cat  val
0    u1  cat1  1.0
1    u1  cat2  2.0
2    u1  cat3  3.0
3    u1  cat4  NaN
4    u2  cat1  5.0
5    u2  cat2  NaN
6    u2  cat3  NaN
7    u2  cat4  NaN
8    u3  cat1  NaN
9    u3  cat2  NaN
10   u3  cat3  NaN
11   u3  cat4  4.0
Sign up to request clarification or add additional context in comments.

Comments

1

You want to basically reindex your DataFrame.

from itertools import product

idx = list(product(df['user'].unique(), df['cat'].unique()))
df  = df.set_index(['user', 'cat']).reindex(idx).reset_index()

The second part seems to be setting the missing values (in this case NaN after the reindex) to the average value from that category. You can accomplish this with a mapping and groupby to calculate the average.

df.loc[df.val.isnull(), 'val'] = df.loc[df.val.isnull(), 'cat'].map(df.groupby('cat').val.mean())

df is now:

   user   cat  val
0    u1  cat1  1.0
1    u1  cat2  2.0
2    u1  cat3  3.0
3    u1  cat4  4.0
4    u2  cat1  5.0
5    u2  cat2  2.0
6    u2  cat3  3.0
7    u2  cat4  4.0
8    u3  cat1  3.0
9    u3  cat2  2.0
10   u3  cat3  3.0
11   u3  cat4  4.0

Comments

1

You could use the complete function from pyjanitor to expose the missing values :

# pip install pyjanitor
import pandas as pd
import janitor as jn

df.complete('user', 'cat')
   user   cat  val
0    u1  cat1  1.0
1    u1  cat2  2.0
2    u1  cat3  3.0
3    u1  cat4  NaN
4    u2  cat1  5.0
5    u2  cat2  NaN
6    u2  cat3  NaN
7    u2  cat4  NaN
8    u3  cat1  NaN
9    u3  cat2  NaN
10   u3  cat3  NaN
11   u3  cat4  4.0

3 Comments

its very handy to use complete. I think pandas developers should add that in the main library itself.
@ShubhamSharma maybe you can raise a request for it? possibly make a PR?
I guess we could raise a PR, not sure if they will accept it but nevertheless that's a good idea. Let's do it together :-)
0

What you want is a cross join between the unique values of each column

You can do this creating two dataframes with unique values from each

df_col1 = pd.DataFrame(df['user'].unique(), columns=['user'])
df_col2 = pd.DataFrame(df['cat'].unique(), columns=['cat'])

# This step will make a dummy var in each dataframe which will be used to 
# merge the dataframes
merged = df_col1.assign(key=1).merge(df_col2.assign(key=1)).drop(['key'], 1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.