Combining MultiIndex and Index in a PANDAS DataFrame

Question

I'm trying to come up with a DataFrame to do some data analysis and would really benefit from having a data frame that can handle regular indexing and MultiIndexing together in one data frame.

For each patient, I have 6 slices of various types of data (T1avg, T2avg, etc...). Let's call this dataframe1 (from an ipython notebook):

import pandas
dat0 = numpy.zeros([6])
dat1 = numpy.zeros([6])
pat0=(['NecS3Hs05']*6)
pat1=(['NecS3Hs06']*6)

slc = (['Slice ' + str(x) for x in xrange(dat0.shape[-1])])

ind = zip(*[pat0+pat1,slc+slc])

named_ind = pandas.MultiIndex.from_tuples(ind, names = ['Patients','Slices'])
ser = pandas.Series(numpy.append(dat0,dat1),index = named_ind)
df = pandas.DataFrame(data=ser, columns=['T1avg'])

Image of output: df1

I also have, for each patient, various strings of information (tumour type, number of imaging sessions, treatment type):

pats = ['NecS3Hs05','NecS3Hs05']
tx = ['Control','Treated']
Ttype = ['subcutaneous','orthotopic']
NSessions = ['2','3']

cols = ['Tx Group', 'Tumour Type', 'Imaging Sessions']
dat = numpy.array([tx,Ttype,NSessions]).T

df2 = pandas.DataFrame(dat, index=pats,columns=cols)

[I'd like to post a picture here as well, but I need at least 10 reputation to do so]

Ideally, I want to have a dataframe that looks as follows (sketched it out in an image editor sorry)

Image of desired output: df-desired

But when I try to use the append command,

com = df.append(df2)

I get something undesired, the MultiIndex that I set up in df is now gone, replaced with a simple index of type tuples ('NecS3Hs05, Slice 0' etc...). The indices from df2 remain the same 'NecS3Hs05'.

Is this possible to do with PANDAS, or am I barking up the wrong tree here? Also, is this even a recommended way of storing Patient attributes in a dataframe (i.e. is this unpandas)? I think what I would really like is to keep everything a simple index, but instead store N-d arrays inside the elements of the data frame.

For instance, if I try something like:

 com['NecS3Hs05','T1avg']

I want to get an array/tuple of shape/len 6

and when I try to get the tumour type:

com['NecS3Hs05','Tumour Type']

I get the string 'subcutaneous'. Obviously I also want to retain the cool features of data frames as well, it looks like PANDAS is the right way to go here, I just need to understand a bit more about how to set up my dataframe

I hope this is a sensible question, if not, I'd be happy to re-form it.

P.S. It feels 'wrong' to (['NecS3Hs05']*6) to fill in entries and set up MultiIndex this way, does anyone have a better way? — Firas
– Firas, Commented Aug 28, 2013 at 4:27
I would just use a regular DataFrame with the tumor etc. info duplicated across multiple rows. — BrenBarn
– BrenBarn, Commented Aug 28, 2013 at 4:35
@BrenBarn I believe that's what the OP is indicating in df-desired — DrSAR
– DrSAR, Commented Aug 28, 2013 at 4:47
Are you sure that df1 is what you want? Is the number of slices always the same from patient to patient? in a MultiIndex that is somewhat required. — DrSAR
– DrSAR, Commented Aug 28, 2013 at 6:24
Hmm, it's not necessarily required for the number of slices to be the same from patient to patient, but you're right in that I probably won't be able to just multiply by a single number and have it all work. I guess that's one reason it felt "wrong" to use MultiIndex that way. — Firas
– Firas, Commented Aug 28, 2013 at 8:39

DrSAR · Accepted Answer · 2013-08-28 06:52:55Z

1

Your problem can be solved, I believe, if you drop the MultiIndex business. Imagine '''df''' only has the (non-unique) 'Patient' as index. 'Slices' would become a simple column.

ind = zip(*[pat0+pat1])
named_ind = pandas.MultiIndex.from_tuples(ind, names = ['Patients'])
df = pandas.DataFrame({'T1avg':ser})
df['Slice']=pandas.Series(numpy.append(slc, slc), index=df.index)

If you had to select on the slice, you can still do that:

df[df['Slice']=='Slice 4']

Will give you Slice 4 for all patients. Note how this eliminates the need to have that row for all patients.

As long as your new dataframe (df2) defines the same index you can now join on that index quite simply:

df.join(df2)

and you'll get

               T1avg    Slice Tx Group   Tumour Type Imaging Sessions
Patients                                                         
NecS3Hs05      0  Slice 0  Control  subcutaneous                2
NecS3Hs05      0  Slice 1  Control  subcutaneous                2
NecS3Hs05      0  Slice 2  Control  subcutaneous                2
NecS3Hs05      0  Slice 3  Control  subcutaneous                2
NecS3Hs05      0  Slice 4  Control  subcutaneous                2
NecS3Hs05      0  Slice 5  Control  subcutaneous                2
NecS3Hs06      0  Slice 0  Treated    orthotopic                3
NecS3Hs06      0  Slice 1  Treated    orthotopic                3
NecS3Hs06      0  Slice 2  Treated    orthotopic                3
NecS3Hs06      0  Slice 3  Treated    orthotopic                3
NecS3Hs06      0  Slice 4  Treated    orthotopic                3
NecS3Hs06      0  Slice 5  Treated    orthotopic                3

answered Aug 28, 2013 at 6:52

DrSAR

1,5522 gold badges17 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Firas Over a year ago

I had thought of this as well, but as a last resort. Couple of reasons: 1) Data is duplicated unnecessarily I know the footprint is small, but still seems wasteful to organize my data this way. 2) If there's anything I want to keep unique in the data frame, it's the patient info so it's an easy operation later to "get" all the patients that have orthotopic control tumours that have had 2 imaging sessions (this way, there'll be an extra operation to grab the unique entries.

Firas Over a year ago

3) MultiIndexing just seems so cool when described in the video by Wes and in the documentation pandas.pydata.org/pandas-docs/stable/…. I'm happy to go with this method though if there aren't any other suggestions!

Phillip Cloud Over a year ago

IMHO MultiIndex should be avoided unless there's no way around it. You can easily move your MultiIndex to columns using some combination of stack/unstack/reset_index/melt. Queries will be faster and much easier if your index levels are columns in your DataFrame.

Collectives™ on Stack Overflow

Combining MultiIndex and Index in a PANDAS DataFrame

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related