4

I currently have a wide dataframe that looks like this:

Index ID1 ID2 Foc_A Foc_B Foc_C Sat_A Sat_B Sat_C
   0   r  1   10    15    17    100   105   107
   1   r  2   20    25    27    110   115   117
   2   s  1   30    35    37    120   125   127
   3   s  2   40    45    47    130   135   137

Each entry has multiple identifier columns (ID1 and ID2). I then have two separate categories of measurements (Foc and Sat) each of which contains multiple identifiers (A, B, C)(the category identifiers are shared between categories). I'm eventually needing to plot it in a facet_grid with x and y as each category of measurement and separated by category identifier, so I'm trying to reshape it so that it looks like this:

Index ID1 ID2 Ch  Foc  Sat 
   0   r  1   A   10   100 
   1   r  1   B   15    105
   2   r  1   C   17    107
   3   r  2   A   20    110 
   4   r  2   B   25    115 
   5   r  2   C   27    117 
   6   s  1   A   30    120 
   7   s  1   B   35    125
   8   s  1   C   37    127

I've been trying .melt, .pivot, and .stack but not understanding what I'm doing well enough to make headway.

5 Answers 5

2

You are thinking in the right way. You can do:

# melt the dataframe
d1 = df.set_index(['Index', 'ID1', 'ID2']).stack().reset_index()

# create separate column
d1[['flag', 'Ch']] = d1['level_3'].str.split('_', expand=True)
d1 = d1.drop('level_3', 1)
d1 = d1.rename(columns = {0: 'Foc'})

# expand the dataframe to wide
d2 = pd.pivot_table(d1, index=['Index', 'ID1', 'ID2', 'Ch'], columns=['flag']).reset_index()

# fix column names
d2.columns = ['Index', 'ID1', 'ID2', 'Ch', 'Foc', 'Sat']

print(d2.head())

   Index ID1  ID2 Ch  Foc  Sat
0      0   r    1  A   10  100
1      0   r    1  B   15  105
2      0   r    1  C   17  107
3      1   r    2  A   20  110
4      1   r    2  B   25  115
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! I have a further point of confusion if I use my real column names. There are many _ and the number can differ. My actual column names are ImageQuality_Focus and ImageQuality_Saturation (instead of Foc and Sat) and are further complicated by the fact that the A and B are variables set by the user and can contain any number of _. The set I'm currently using has 1 _ per variable for 4 of the variables and 2 _ in the final variable. So I will need to be able to parse "ImageQuality_Focus_Pre_Channel1" and "Image_Quality_Focus_Pre_Channel2_Type"
@ErinW str.split has an argument n using which we can specify at which _ we would like to split the column name: pandas.pydata.org/pandas-docs/stable/reference/api/…
1

I'd set ID columns to the index, split and expand the columns on the '_' character, then stack the dataframe:

from io import StringIO
import pandas

datafile = StringIO("""\
Index ID1 ID2 Foc_A Foc_B Foc_C Sat_A Sat_B Sat_C
   0   r  1   10    15    17    100   105   107
   1   r  2   20    25    27    110   115   117
   2   s  1   30    35    37    120   125   127
   3   s  2   40    45    47    130   135   137
""")

(
    pandas.read_csv(datafile, sep='\s+')
        .set_index(['ID1', 'ID2'])
        .drop(columns='Index')
        .pipe(lambda df:
             df.set_axis(
                 df.columns.str.split('_', expand=True),
                 axis="columns"
            )
        )
        .rename_axis([None, 'Ch'], axis='columns')
        .stack(level='Ch')
        .reset_index()
)

And that give me:

   ID1  ID2 Ch  Foc  Sat
0    r    1  A   10  100
1    r    1  B   15  105
2    r    1  C   17  107
3    r    2  A   20  110
4    r    2  B   25  115
5    r    2  C   27  117
6    s    1  A   30  120
7    s    1  B   35  125
8    s    1  C   37  127
9    s    2  A   40  130
10   s    2  B   45  135
11   s    2  C   47  137

Comments

1

Let us do wide_to_long

df = pd.wide_to_long(df,['Foc','Sat'],i=['ID1','ID2'],j='Ch',sep='_',suffix='\w+').reset_index()
Out[168]: 
   ID1  ID2 Ch  Foc  Sat
0    r    1  A   10  100
1    r    1  B   15  105
2    r    1  C   17  107
3    r    2  A   20  110
4    r    2  B   25  115
5    r    2  C   27  117
6    s    1  A   30  120
7    s    1  B   35  125
8    s    1  C   37  127
9    s    2  A   40  130
10   s    2  B   45  135
11   s    2  C   47  137

Comments

0

you could also achieve this with .melt, .groupby and np.where:

df = pd.melt(df, id_vars=['ID1','ID2','Foc_A', 'Foc_B', 'Foc_C'], var_name='Ch', value_name='Sat') \
    .groupby(['ID1','ID2','Ch']).agg({'Foc_A':'max','Foc_B':'max', 'Foc_C':'max','Sat':'max'}).reset_index()
df['Foc'] = np.where((df['Ch'] == 'Sat_A'), df['Foc_A'], '')
df['Foc'] = np.where((df['Ch'] == 'Sat_B'), df['Foc_B'], df['Foc'])
df['Foc'] = np.where((df['Ch'] == 'Sat_C'), df['Foc_C'], df['Foc'])
df['Ch'] = df['Ch'].str.replace('Sat_', '')
df = df.drop(['Foc_A', 'Foc_B', 'Foc_C'], axis=1)
df

output:

    ID1 ID2 Ch      Sat Foc
0   r   1   A       100 10
1   r   1   B       105 15
2   r   1   C       107 17
3   r   2   A       110 20
4   r   2   B       115 25
5   r   2   C       117 27
6   s   1   A       120 30
7   s   1   B       125 35
8   s   1   C       127 37
9   s   2   A       130 40
10  s   2   B       135 45
11  s   2   C       137 47

Comments

0

Creating a multi_index then using stack

df = df.set_index(['ID1','ID2'])

df.columns = df.columns.str.split('_',expand=True)

df1 = df.stack(1).reset_index().rename(columns={'level_2' : 'Ch'})

   ID1  ID2 Ch  Foc  Sat
0    r    1  A   10  100
1    r    1  B   15  105
2    r    1  C   17  107
3    r    2  A   20  110
4    r    2  B   25  115
5    r    2  C   27  117
6    s    1  A   30  120
7    s    1  B   35  125
8    s    1  C   37  127
9    s    2  A   40  130
10   s    2  B   45  135
11   s    2  C   47  137

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.