1

I have this dataframe(df), that looks like

+-----------------+-----------+----------------+---------------------+--------------+-------------+
|      Gene       | Gene name |     Tissue     |      Cell type      |    Level     | Reliability |
+-----------------+-----------+----------------+---------------------+--------------+-------------+
| ENSG00000001561 | ENPP4     | adipose tissue | adipocytes          | Low          | Approved    |
| ENSG00000001561 | ENPP4     | adrenal gland  | glandular cells     | High         | Approved    |
| ENSG00000001561 | ENPP4     | appendix       | glandular cells     | Medium       | Approved    |
| ENSG00000001561 | ENPP4     | appendix       | lymphoid tissue     | Low          | Approved    |
| ENSG00000001561 | ENPP4     | bone marrow    | hematopoietic cells | Medium       | Approved    |
| ENSG00000002586 | CD99      | adipose tissue | adipocytes          | Low          | Supported   |
| ENSG00000002586 | CD99      | adrenal gland  | glandular cells     | Medium       | Supported   |
| ENSG00000002586 | CD99      | appendix       | glandular cells     | Not detected | Supported   |
| ENSG00000002586 | CD99      | appendix       | lymphoid tissue     | Not detected | Supported   |
| ENSG00000002586 | CD99      | bone marrow    | hematopoietic cells | High         | Supported   |
| ENSG00000002586 | CD99      | breast         | adipocytes          | Not detected | Supported   |
| ENSG00000003056 | M6PR      | adipose tissue | adipocytes          | High         | Approved    |
| ENSG00000003056 | M6PR      | adrenal gland  | glandular cells     | High         | Approved    |
| ENSG00000003056 | M6PR      | appendix       | glandular cells     | High         | Approved    |
| ENSG00000003056 | M6PR      | appendix       | lymphoid tissue     | High         | Approved    |
| ENSG00000003056 | M6PR      | bone marrow    | hematopoietic cells | High         | Approved    |
+-----------------+-----------+----------------+---------------------+--------------+-------------+

Expected output:


+-----------+--------+-------------------------------+
| Gene name | Level  |            Tissue             |
+-----------+--------+-------------------------------+
| ENPP4     | Low    | adipose tissue, appendix      |
| ENPP4     | High   | adrenal gland, bronchus       |
| ENPP4     | Medium | appendix, breast, bone marrow |
| CD99      | Low    | adipose tissue, appendix      |
| CD99      | High   | bone marrow                   |
| CD99      | Medium | adrenal gland                 |
| ...       | ...    | ...                           |
+-----------+--------+-------------------------------+

code used (took help from multiple if else conditions in pandas dataframe and derive multiple columns):

def text_df(df):
    if (df[df['Level'].str.match('High')]):
        return (df.assign(Level='High') + df['Tissue'].astype(str))
    elif (df[df['Level'].str.match('Medium')]):
        return (df.assign(Level='Medium') + df['Tissue'].astype(str))
    elif (df[df['Level'].str.match('Low')]):
        return (df.assign(Level='Low') + df['Tissue'].astype(str))

df = df.apply(text_df, axis = 1)

Error: KeyError: ('Level', 'occurred at index 172') I can't understand what am I doing wrong. any suggestion?

1 Answer 1

3

Try:

df.groupby(['Gene name','Level'], as_index=False)['Cell type'].agg(', '.join)

Output:

|    | Gene name   | Level        | Cell type                                                                                                       |
|---:|:------------|:-------------|:----------------------------------------------------------------------------------------------------------------|
|  0 | CD99        | High         | hematopoietic cells                                                                                             |
|  1 | CD99        | Low          | adipocytes                                                                                                      |
|  2 | CD99        | Medium       | glandular cells                                                                                                 |
|  3 | CD99        | Not detected | glandular cells     ,  lymphoid tissue     ,  adipocytes                                                        |
|  4 | ENPP4       | High         | glandular cells                                                                                                 |
|  5 | ENPP4       | Low          | adipocytes          ,  lymphoid tissue                                                                          |
|  6 | ENPP4       | Medium       | glandular cells     ,  hematopoietic cells                                                                      |
|  7 | M6PR        | High         | adipocytes          ,  glandular cells     ,  glandular cells     ,  lymphoid tissue     ,  hematopoietic cells |

Update added per comments below:

(df.groupby(['Gene name','Level'], as_index=False)['Cell type']
   .agg(','.join).set_index(['Gene name','Level'])['Cell type']
   .unstack().reset_index())

Output:

| Gene name   |  High                                                                                                           |  Low                                   |  Medium                                    |  Not detected                                            |
|:------------|:----------------------------------------------------------------------------------------------------------------|:---------------------------------------|:-------------------------------------------|:---------------------------------------------------------|
| CD99        | hematopoietic cells                                                                                             | adipocytes                             | glandular cells                            | glandular cells     ,  lymphoid tissue     ,  adipocytes |
| ENPP4       | glandular cells                                                                                                 | adipocytes          ,  lymphoid tissue | glandular cells     ,  hematopoietic cells | nan                                                      |
| M6PR        | adipocytes          ,  glandular cells     ,  glandular cells     ,  lymphoid tissue     ,  hematopoietic cells | nan                                    | nan                                        | nan                                                      |
Sign up to request clarification or add additional context in comments.

8 Comments

if I may ask Boston, how do you read the data in? I have tried read_clipboard() but what comes out is not right
@sammywemmy I used these statements... df = pd.read_clipboard(sep='|', header=None) df = df.drop([0,7], axis=1).set_axis(['Gene','Gene name','Tissue','Cell type', 'Level','Reliability'], axis=1) Oh.. and I didn't copy the header stuff just that data.
@sammywemmy I could have done a .str.strip to cleanup some of that whitespace too, but I think the bulk of what needs to done is above.
@ScottBoston It worked thanks! I am trying to add another complication to it. I wished to get columns like this in new df: Gene name | Level:High|Level:Medium|Level:Low and these columns to have corresponding values from first df. Any suggestions?
@ScottBoston ..This is great one liner. Thanks for the help (y)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.