Python - Pandas dataframe with tuples

Question

I have this kind of dataframe:

      A       B       C       D
0   (a,b)   (c,d)   (e,f)   (g,h)
1   (a,b)   (c,d)   (e,f)    NaN
2   (a,b)    NaN    (e,f)    NaN
3   (a,b)    NaN     NaN     NaN

so in each cell there is a tuple, and I want to make it to be like so:

  |    A     |     B     |     C     |     D
0 |  a |  b  |  c  |  d  |  e  |  f  |  g  |  h
1 |  a |  b  |  c  |  d  |  e  |  f  | NaN | NaN
2 |  a |  b  | NaN | NaN |  e  |  f  | NaN | NaN
3 |  a |  b  | NaN | NaN | NaN | NaN | NaN | NaN

so for example, in the column A, there are two columns inside of it.

Thank you.

why don't you want to create two columns per letter (eg. A1 and A2) ? — MMF
– MMF, Commented Dec 14, 2016 at 8:54

jezrael · Accepted Answer · 2016-12-14 09:47:41Z

You can use stack with DataFrame.from_records and then reshape with unstack, swaplevel for change levels in MultiIndex in columns and last sort columns by sort_index:

stacked = df.stack()
df1 = pd.DataFrame.from_records(stacked.tolist(), index = stacked.index)
        .unstack(1)
        .swaplevel(0, 1, 1)
        .sort_index(axis=1)
        .replace({None:np.nan})
print (df1)

   A       B         C         D     
   0  1    0    1    0    1    0    1
0  a  b    c    d    e    f    g    h
1  a  b    c    d    e    f  NaN  NaN
2  a  b  NaN  NaN    e    f  NaN  NaN
3  a  b  NaN  NaN  NaN  NaN  NaN  NaN

Last is possible remove MultiIndex from columns and create new column names:

stacked = df.stack()
df1 = pd.DataFrame.from_records(stacked.tolist(), index = stacked.index)
        .unstack(1)
        .swaplevel(0, 1, 1)
        .sort_index(1)
        .replace({None:np.nan})
df1.columns = ['{}{}'.format(col[0], col[1]) for col in df1.columns]
print (df1)
  A0 A1   B0   B1   C0   C1   D0   D1
0  a  b    c    d    e    f    g    h
1  a  b    c    d    e    f  NaN  NaN
2  a  b  NaN  NaN    e    f  NaN  NaN
3  a  b  NaN  NaN  NaN  NaN  NaN  NaN

Timings:

#len (df)=400

In [220]: %timeit (pir(df))
100 loops, best of 3: 3.45 ms per loop

In [221]: %timeit (jez(df))
100 loops, best of 3: 5.17 ms per loop

In [222]: %timeit (nick(df))
1 loop, best of 3: 231 ms per loop

In [223]: %timeit (df.stack().apply(pd.Series).unstack().swaplevel(0, 1, 1).sort_index(1).replace({None:np.nan}))
10 loops, best of 3: 152 ms per loop


#len (df)=4k

In [216]: %timeit (pir(df))
100 loops, best of 3: 16.5 ms per loop

In [217]: %timeit (jez(df))
100 loops, best of 3: 14.8 ms per loop

In [218]: %timeit (nick(df))
1 loop, best of 3: 2.34 s per loop

In [219]: %timeit (df.stack().apply(pd.Series).unstack().swaplevel(0, 1, 1).sort_index(1).replace({None:np.nan}))
1 loop, best of 3: 1.53 s per loop

Code for timings:

df = pd.DataFrame({"A": [('a','b'),('a','b'),('a','b'),('a','b')], 
                   'B': [('c','d'),('c','d'), np.nan,np.nan], 
                   'C':[('e','f'),('e','f'),('e','f'),np.nan],
                   'D':[('g','h'),np.nan,np.nan,np.nan]})

df = pd.concat([df]*1000).reset_index(drop=True)
print (df)

def jez(df):
    stacked = df.stack()
    return pd.DataFrame.from_records(stacked.tolist(), index = stacked.index).unstack(1).swaplevel(0, 1, 1).sort_index(1).replace({None:np.nan})


print (df.stack().apply(pd.Series).unstack().swaplevel(0, 1, 1).sort_index(1).replace({None:np.nan}))

def nick(df):
    cols = df.columns.values.tolist()
    return pd.concat([df[col].apply(pd.Series) for col in cols], axis=1, keys=cols)

def pir(df):
    # fillna with (np.nan, np.nan)
    df_ = df.stack().unstack(fill_value=tuple([np.nan] * 2))
    # construct MultiIndex
    col = pd.MultiIndex.from_product([df.columns, [0, 1]])
    # rip off of Nickil's pd.concat but using numpy
    return pd.DataFrame(np.hstack([np.array(s.values.tolist()) for _, s in df_.iteritems()]), columns=col)


print (jez(df))
print (nick(df))
print (pir(df))

Nickil Maveli · Accepted Answer · 2016-12-14 08:58:45Z

2

Split the tuples present in every series into individual elements using apply. Then, concatenate all such columns together columnwise and providing the headers same as the original DF using the keys argument.

cols = df.columns.values.tolist()
pd.concat([df[col].apply(pd.Series) for col in cols], axis=1, keys=cols)

answered Dec 14, 2016 at 8:58

Nickil Maveli

29.8k10 gold badges86 silver badges88 bronze badges

Comments

piRSquared · Accepted Answer · 2016-12-14 09:41:03Z

1

methon 1
stack + apply

df.stack().apply(pd.Series).unstack().swaplevel(0, 1, 1).sort_index(1)

method 2

# fillna with (np.nan, np.nan)
df_ = df.stack().unstack(fill_value=tuple([np.nan] * 2))
# construct MultiIndex
col = pd.MultiIndex.from_product([df.columns, [0, 1]])
# rip off of Nickil's pd.concat but using numpy
pd.DataFrame(
    np.hstack(
        [np.array(s.values.tolist()) \
         for _, s in df_.iteritems()]
    ), columns=col)

edited Dec 14, 2016 at 9:41

answered Dec 14, 2016 at 8:56

piRSquared

296k68 gold badges509 silver badges654 bronze badges

3 Comments

jezrael Over a year ago

hmmm, I think it is not good idea duplicates columns in second solution, what do you think?

piRSquared Over a year ago

@jezrael I decided to change it

jezrael Over a year ago

I add timings. It seems in larger dataframe my code is faster.

Collectives™ on Stack Overflow

Python - Pandas dataframe with tuples

3 Answers 3

Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related