0

I have following pandas dataframe (HC_subset_umls)

    term            code            source  term_normlz     CUI         CODE        SAB     TTY     STR
0   B-cell lymphoma meddra:10003899 meddra  b-cell lymphoma C0079731    MTHU019696  OMIM    PTCS    b-cell lymphoma
1   B-cell lymphoma meddra:10003899 meddra  b-cell lymphoma C0079731    10003899    MDR     PT  b-cell lymphoma
2   Astrocytoma     meddra:10003571 meddra  astrocytoma     C0004114    10003571    MDR     PT  astrocytoma
3   Astrocytoma     meddra:10003571 meddra  astrocytoma     C0004114    D001254     MSH     MH  astrocytoma

I would like to group rows based on common CUI and generate new columns.

The desired output is:

    term            code            source  term_normlz     CUI         OMIM_CODE       OMIM_TTY        OMIM_STR  MDR_CODE      MDR_TTY     MDR_STR   MSH_CODE      MSH_TTY     MSH_STR
0   B-cell lymphoma meddra:10003899 meddra  b-cell lymphoma C0079731    MTHU019696      PTCS     b-cell lymphoma 10003899   PT  b-cell lymphoma  NA   NA   NA   NA
2   Astrocytoma     meddra:10003571 meddra  astrocytoma     C0004114    NA   NA   NA  10003571  MDR     PT  astrocytoma   D001254       MSH     MH  astrocytoma

I am using following lines of code.

HC_subset_umls['OMIM_CODE'] = (
    HC_subset_umls['CUI']
    .map(
        HC_subset_umls
        .groupby('CUI')
        .apply(lambda x: x.loc[x['SAB'].isin(['OMIM']), 'CODE'].values[0])
    )
)


HC_subset_umls['OMIM_TERM'] = (
    HC_subset_umls['CUI']
    .map(
        HC_subset_umls
        .groupby('CUI')
        .apply(lambda x: x.loc[x['SAB'].isin(['OMIM']), 'STR'].values[0])
    )
)

HC_subset_umls['OMIM_TTY'] = (
    HC_subset_umls['CUI']
    .map(
        HC_subset_umls
        .groupby('CUI')
        .apply(lambda x: x.loc[x['SAB'].isin(['OMIM']), 'TTY'].values[0])
    )
)

HC_subset_umls = HC_subset_umls[~(HC_subset_umls['SAB'].isin(['OMIM']))]

And subsequently for the other 'SAB' like 'MDR' and so on. However, I am getting following error.

IndexError: index 0 is out of bounds for axis 0 with size 0

Any help is highly appreciated.

3
  • You need to create a runnable code. it is not clear what is HC_subset_umls. Make your question replicable. Commented Dec 20, 2022 at 19:02
  • HC_subset_umls is the dataframe. What does it mean 'runnable code'? Thanks Commented Dec 20, 2022 at 19:04
  • create a toy example. Then, people can play with it and help you. This piece of code is not useful. Commented Dec 20, 2022 at 19:16

1 Answer 1

1

Try, using groupby, ustack, and flatten multiindex column headers.

df_out = (df.groupby(['term', 'code', 'source', 'term_normlz', 'CUI', 'SAB'])
            .first()
            .unstack()
            .swaplevel(0,1, axis=1))
df_out.columns = df_out.columns.map('_'.join)
df_out.reset_index()

Output:

    term             code  source      term_normlz       CUI  MDR_CODE MSH_CODE   OMIM_CODE MDR_TTY MSH_TTY OMIM_TTY          MDR_STR      MSH_STR         OMIM_STR
0      Astrocytoma  meddra:10003571  meddra      astrocytoma  C0004114  10003571  D001254         NaN      PT      MH      NaN      astrocytoma  astrocytoma              NaN
1  B-cell lymphoma  meddra:10003899  meddra  b-cell lymphoma  C0079731  10003899      NaN  MTHU019696      PT     NaN     PTCS  b-cell lymphoma          NaN  b-cell lymphoma
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.