6

I'd like to convert to below Df1 to Df2.

The empty values would be filled with Nan.

Below Dfs are examples.

  • My data has weeks from 1 to 8.
  • IDs are 100,000.

  • Only week 8 has all IDs, so total rows will be 100,000.

I have Df3 which has 100,000 of id, and I want to merge df1 on Df3 formatted as df2.

ex) pd.merge(df3, df1, on="id", how="left") -> but, formatted as df2

 Df1>
 wk, id, col1, col2  ...
 1    1   0.5  15  
 2    2   0.5  15  
 3    3   0.5  15  
 1    2   0.5  15  
 3    2   0.5  15  

 ------
 Df2>
 wk1, id, col1, col2, wk2, id, col1, col2, wk3,  id, col1, col2,...
 1    1   0.5  15      2    1   Nan   Nan   3    1   Nan   Nan
 1    2   0.5  15      2    2   0.5  15     3    2   0.5    15
 1    3   Nan  Nan     2    3   Nan   Nan   3    3   0.5    15
10
  • did you try transposing your DataFrame; DF2 = DF1.T ? Commented Aug 1, 2018 at 6:09
  • @Hamidreza I don't think that what I asked is about transposing. If so, could you tell me how it works? Commented Aug 1, 2018 at 6:14
  • You see, I need al least a small portion (example) of your code there, please do it. By the way, your wk is not the DF index, so it means you need kind of filtering based on one weeks, right? Commented Aug 1, 2018 at 6:19
  • @Hamidreza Yes, you're right. It's needed to be filtered by wk and added by columns. Wk is not the index. I don't have this code. Above Df1 is an example to explain. If you need bigger one or code about the Df1, I will happily write down. Commented Aug 1, 2018 at 6:25
  • Plus, I think IDs could be the index of D2. Commented Aug 1, 2018 at 6:42

2 Answers 2

4

Use:

#create dictionary for rename columns for correct sorting
d = dict(enumerate(df.columns))
d1 = {v:k for k, v in d.items()}

#first add missing values for each `wk` and `id`
df1 = df.set_index(['wk', 'id']).unstack().stack(dropna=False).reset_index()

#for each id create DataFrame, reshape by unstask and rename columns
df1 = (df1.groupby('id')
       .apply(lambda x: pd.DataFrame(x.values, columns=df.columns))
       .unstack()
       .reset_index(drop=True)
       .rename(columns=d1, level=0)
       .sort_index(axis=1, level=1)
       .rename(columns=d, level=0))

#convert values to integers if necessary
df1.loc[:, ['wk', 'id']] = df1.loc[:, ['wk', 'id']].astype(int)

#flatten MultiIndex in columns
df1.columns = ['{}_{}'.format(a, b) for a, b in df1.columns]
print (df1)

   wk_0  id_0  col1_0  col2_0  wk_1  id_1  col1_1  col2_1  wk_2  id_2  col1_2  \
0     1     1     0.5    15.0     2     1     NaN     NaN     3     1     NaN   
1     1     2     0.5    15.0     2     2     0.5    15.0     3     2     0.5   
2     1     3     NaN     NaN     2     3     NaN     NaN     3     3     0.5   

   col2_2  
0     NaN  
1    15.0  
2    15.0  
Sign up to request clarification or add additional context in comments.

Comments

2

You can use GroupBy + concat. The idea is to create a list of dataframes with appropriately named columns and appropriate index. The concatenate along axis=1:

d = {k: v.reset_index(drop=True) for k, v in df.groupby('wk')}

def formatter(df, key):
    return df.rename(columns={'w': f'wk{key}'}).set_index('id')

L = [formatter(df, key) for key, df in d.items()]
res = pd.concat(L, axis=1).reset_index()

print(res)

   id   wk  col1  col2   wk  col1  col2   wk  col1  col2
0   1  1.0   0.5  15.0  NaN   NaN   NaN  NaN   NaN   NaN
1   2  1.0   0.5  15.0  2.0   0.5  15.0  3.0   0.5  15.0
2   3  NaN   NaN   NaN  NaN   NaN   NaN  3.0   0.5  15.0

Note NaN forces your series to become float. There's no "good" fix for this.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.