4

I'm trying to come up with a DataFrame to do some data analysis and would really benefit from having a data frame that can handle regular indexing and MultiIndexing together in one data frame.

For each patient, I have 6 slices of various types of data (T1avg, T2avg, etc...). Let's call this dataframe1 (from an ipython notebook):

import pandas
dat0 = numpy.zeros([6])
dat1 = numpy.zeros([6])
pat0=(['NecS3Hs05']*6)
pat1=(['NecS3Hs06']*6)

slc = (['Slice ' + str(x) for x in xrange(dat0.shape[-1])])

ind = zip(*[pat0+pat1,slc+slc])

named_ind = pandas.MultiIndex.from_tuples(ind, names = ['Patients','Slices'])
ser = pandas.Series(numpy.append(dat0,dat1),index = named_ind)
df = pandas.DataFrame(data=ser, columns=['T1avg'])

Image of output: df1

I also have, for each patient, various strings of information (tumour type, number of imaging sessions, treatment type):

pats = ['NecS3Hs05','NecS3Hs05']
tx = ['Control','Treated']
Ttype = ['subcutaneous','orthotopic']
NSessions = ['2','3']

cols = ['Tx Group', 'Tumour Type', 'Imaging Sessions']
dat = numpy.array([tx,Ttype,NSessions]).T

df2 = pandas.DataFrame(dat, index=pats,columns=cols)

[I'd like to post a picture here as well, but I need at least 10 reputation to do so]

Ideally, I want to have a dataframe that looks as follows (sketched it out in an image editor sorry)

Image of desired output: df-desired

But when I try to use the append command,

com = df.append(df2)

I get something undesired, the MultiIndex that I set up in df is now gone, replaced with a simple index of type tuples ('NecS3Hs05, Slice 0' etc...). The indices from df2 remain the same 'NecS3Hs05'.

Is this possible to do with PANDAS, or am I barking up the wrong tree here? Also, is this even a recommended way of storing Patient attributes in a dataframe (i.e. is this unpandas)? I think what I would really like is to keep everything a simple index, but instead store N-d arrays inside the elements of the data frame.

For instance, if I try something like:

 com['NecS3Hs05','T1avg']

I want to get an array/tuple of shape/len 6

and when I try to get the tumour type:

com['NecS3Hs05','Tumour Type']

I get the string 'subcutaneous'. Obviously I also want to retain the cool features of data frames as well, it looks like PANDAS is the right way to go here, I just need to understand a bit more about how to set up my dataframe

I hope this is a sensible question, if not, I'd be happy to re-form it.

5
  • P.S. It feels 'wrong' to (['NecS3Hs05']*6) to fill in entries and set up MultiIndex this way, does anyone have a better way? Commented Aug 28, 2013 at 4:27
  • I would just use a regular DataFrame with the tumor etc. info duplicated across multiple rows. Commented Aug 28, 2013 at 4:35
  • @BrenBarn I believe that's what the OP is indicating in df-desired Commented Aug 28, 2013 at 4:47
  • Are you sure that df1 is what you want? Is the number of slices always the same from patient to patient? in a MultiIndex that is somewhat required. Commented Aug 28, 2013 at 6:24
  • Hmm, it's not necessarily required for the number of slices to be the same from patient to patient, but you're right in that I probably won't be able to just multiply by a single number and have it all work. I guess that's one reason it felt "wrong" to use MultiIndex that way. Commented Aug 28, 2013 at 8:39

1 Answer 1

1

Your problem can be solved, I believe, if you drop the MultiIndex business. Imagine '''df''' only has the (non-unique) 'Patient' as index. 'Slices' would become a simple column.

ind = zip(*[pat0+pat1])
named_ind = pandas.MultiIndex.from_tuples(ind, names = ['Patients'])
df = pandas.DataFrame({'T1avg':ser})
df['Slice']=pandas.Series(numpy.append(slc, slc), index=df.index)

If you had to select on the slice, you can still do that:

df[df['Slice']=='Slice 4']

Will give you Slice 4 for all patients. Note how this eliminates the need to have that row for all patients.

As long as your new dataframe (df2) defines the same index you can now join on that index quite simply:

df.join(df2)

and you'll get

               T1avg    Slice Tx Group   Tumour Type Imaging Sessions
Patients                                                         
NecS3Hs05      0  Slice 0  Control  subcutaneous                2
NecS3Hs05      0  Slice 1  Control  subcutaneous                2
NecS3Hs05      0  Slice 2  Control  subcutaneous                2
NecS3Hs05      0  Slice 3  Control  subcutaneous                2
NecS3Hs05      0  Slice 4  Control  subcutaneous                2
NecS3Hs05      0  Slice 5  Control  subcutaneous                2
NecS3Hs06      0  Slice 0  Treated    orthotopic                3
NecS3Hs06      0  Slice 1  Treated    orthotopic                3
NecS3Hs06      0  Slice 2  Treated    orthotopic                3
NecS3Hs06      0  Slice 3  Treated    orthotopic                3
NecS3Hs06      0  Slice 4  Treated    orthotopic                3
NecS3Hs06      0  Slice 5  Treated    orthotopic                3
Sign up to request clarification or add additional context in comments.

3 Comments

I had thought of this as well, but as a last resort. Couple of reasons: 1) Data is duplicated unnecessarily I know the footprint is small, but still seems wasteful to organize my data this way. 2) If there's anything I want to keep unique in the data frame, it's the patient info so it's an easy operation later to "get" all the patients that have orthotopic control tumours that have had 2 imaging sessions (this way, there'll be an extra operation to grab the unique entries.
3) MultiIndexing just seems so cool when described in the video by Wes and in the documentation pandas.pydata.org/pandas-docs/stable/…. I'm happy to go with this method though if there aren't any other suggestions!
IMHO MultiIndex should be avoided unless there's no way around it. You can easily move your MultiIndex to columns using some combination of stack/unstack/reset_index/melt. Queries will be faster and much easier if your index levels are columns in your DataFrame.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.