1

I have medical data split into two different CSVs, and I need to merge them. One data set contains basic demographic information, and the second contains diagnosis codes. Each patient is assigned a unique identification number called INC_KEY, which I've simplified to simple numbers, as shown in this example:

df1:

INC_KEY   SEX    AGE
1         F      40
2         F      24  
3         M      66

df2:

INC_KEY   DCODE
1         BW241ZZ
1         BW28ZZZ
2         0BH17EZ
3         05H633Z
2         4A103BD
3         BR30ZZZ   
1         BF42ZZZ

I need to merge the two dataframes with the output containing the three rows as seen in df1 with appended columns for each dcode respective to that patient. Like this:

INC_KEY   SEX    AGE   DCODE1     DCODE2     DCODE3
1         F      40    BW241ZZ    BW28ZZZ    BF42ZZZ
2         F      24    0BH17EZ    4A103BD    N/A
3         M      66    05H633Z    BR30ZZZ    N/A 

How can I go about this? I've tried to do a left merge but it does not give the result I am looking for.

2 Answers 2

1

You can combine the two dataframes on the INC_KEY column using .merge. Then, you can use .groupby() and pd.concat() to turn individual rows into the desired columns. Finally, you can drop the original "DCODE" column using .drop():

df = df1.merge(df2, on="INC_KEY", how="right")
df = df.groupby(["INC_KEY", "SEX", "AGE"]).agg({"DCODE": list}).reset_index()
df = pd.concat(
    (df, pd.DataFrame(df["DCODE"].values.tolist()).add_prefix("DCODE")), 
    axis=1
)
df = df.drop("DCODE", axis=1)

This outputs:

   INC_KEY SEX  AGE   DCODE0   DCODE1   DCODE2
0        1   F   40  BW241ZZ  BW28ZZZ  BF42ZZZ
1        2   F   24  0BH17EZ  4A103BD     None
2        3   M   66  05H633Z  BR30ZZZ     None
Sign up to request clarification or add additional context in comments.

Comments

0

Here's another way:

df_out = df1.merge(df2, on='INC_KEY')
df_out = df_out.set_index(['INC_KEY', 'SEX', 'AGE', df_out.groupby('INC_KEY').cumcount()]).unstack()
df_out.columns = [f'{i}{j}' for i, j in df_out.columns]
df_out.reset_index()

Output:

   INC_KEY SEX  AGE   DCODE0   DCODE1   DCODE2
0        1   F   40  BW241ZZ  BW28ZZZ  BF42ZZZ
1        2   F   24  0BH17EZ  4A103BD      NaN
2        3   M   66  05H633Z  BR30ZZZ      NaN

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.