0

I have this data of measurements at two time values with replicates:

name    t   value   replicate
foo 1   0.5 a
foo 1   0.55    b
foo 1   0.6 c
foo 2   0.7 a
foo 2   0.71    b
foo 2   0.72    c
bar 1   0.1 a
bar 1   0.12    b
bar 1   0.3 c
bar 2   0.4 a
bar 2   0.45    b
bar 2   0.44    c

I want to parse it into dataframe and get the mean and standard deviation of the replicates for each time point ("t" column) and for each sample ("name" column). This can be done with:

df = pandas.read_table("data.txt",sep="\t")
g = df.groupby(["name", "t"])
new_df = g.agg([np.mean, np.std])

The problem is that new_df has a hierarchical index:

           value          
            mean       std
name t                    
bar  1  0.173333  0.110151
     2  0.430000  0.026458
foo  1  0.550000  0.050000
     2  0.710000  0.010000

How can I get a flat dataframe instead where the mean and std values are just regular columns? I tried reset_index() but that does not do it:

>>> new_df.reset_index()
  name  t     value          
               mean       std
0  bar  1  0.173333  0.110151
1  bar  2  0.430000  0.026458
2  foo  1  0.550000  0.050000
3  foo  2  0.710000  0.010000

i'd like the final dataframe to have columns: sample, t, mean, std (or value_mean, value_std). How can this be done in pandas?

2 Answers 2

3

I would do something slightly different from MaxU. Try resetting the index to a specific column level and then drop the other column level(s).

In [5]: new_df2 = new_df.copy()

In [6]: new_df2 = new_df2.reset_index(col_level=1)

In [7]: new_df2.columns = new_df2.columns.get_level_values(1) # same level=1

In [8]: new_df2
Out[8]: 
  name  t      mean       std
0  bar  1  0.173333  0.110151
1  bar  2  0.430000  0.026458
2  foo  1  0.550000  0.050000
3  foo  2  0.710000  0.010000

Edit:

With MultiIndexs, which can be used to setup a multi-level arrangement of either your index (vertical column) or column labels (your case), the column labels are stored as levels and their positions are stored as labels. Like this:

In [4]: df.columns
Out[4]: 
MultiIndex(levels=[[u'value'], [u'mean', u'std']],
           labels=[[0, 0], [0, 1]])

By doing reset_index(col_level=1), we transform the MultiIndex into

In [5]: df.reset_index(col_level=1).columns
Out[5]: 
MultiIndex(levels=[[u'value', u''], [u'mean', u'std', u't', u'name']],
           labels=[[1, 1, 0, 0], [3, 2, 0, 1]])

which takes the labels out of the Index and puts them into level 1 (the second/lower level) of the column MultiIndex. Then columns = columns.get_level_values(1) grabs the values of the column labels at level 1, and sets only those values as the column labels, effectively dropping level 0.

 Out[6]: Index([u'name', u't', u'mean', u'std'], dtype='object')
Sign up to request clarification or add additional context in comments.

1 Comment

can you explain what get_level_values does here?
2

try to rename your columns:

In [9]: new_df.reset_index(inplace=True)

let's set the column names in the following way: take level==1 column if it exists, otherwise take column with level==0

In [14]: new_df.columns = [c[1] if c[1] else c[0] for c in new_df.columns.tolist()]

In [15]: new_df
Out[15]:
  name  t      mean       std
0  bar  1  0.173333  0.110151
1  bar  2  0.430000  0.026458
2  foo  1  0.550000  0.050000
3  foo  2  0.710000  0.010000

2 Comments

can you explain what your code does and whether it will generalize? is there a pandas built in that does the same?
@mvd, i've added a comment to my answer - please check

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.