3

I have this kind of dataframe:

      A       B       C       D
0   (a,b)   (c,d)   (e,f)   (g,h)
1   (a,b)   (c,d)   (e,f)    NaN
2   (a,b)    NaN    (e,f)    NaN
3   (a,b)    NaN     NaN     NaN

so in each cell there is a tuple, and I want to make it to be like so:

  |    A     |     B     |     C     |     D
0 |  a |  b  |  c  |  d  |  e  |  f  |  g  |  h
1 |  a |  b  |  c  |  d  |  e  |  f  | NaN | NaN
2 |  a |  b  | NaN | NaN |  e  |  f  | NaN | NaN
3 |  a |  b  | NaN | NaN | NaN | NaN | NaN | NaN

so for example, in the column A, there are two columns inside of it.

Thank you.

1
  • why don't you want to create two columns per letter (eg. A1 and A2) ? Commented Dec 14, 2016 at 8:54

3 Answers 3

4

You can use stack with DataFrame.from_records and then reshape with unstack, swaplevel for change levels in MultiIndex in columns and last sort columns by sort_index:

stacked = df.stack()
df1 = pd.DataFrame.from_records(stacked.tolist(), index = stacked.index)
        .unstack(1)
        .swaplevel(0, 1, 1)
        .sort_index(axis=1)
        .replace({None:np.nan})
print (df1)

   A       B         C         D     
   0  1    0    1    0    1    0    1
0  a  b    c    d    e    f    g    h
1  a  b    c    d    e    f  NaN  NaN
2  a  b  NaN  NaN    e    f  NaN  NaN
3  a  b  NaN  NaN  NaN  NaN  NaN  NaN

Last is possible remove MultiIndex from columns and create new column names:

stacked = df.stack()
df1 = pd.DataFrame.from_records(stacked.tolist(), index = stacked.index)
        .unstack(1)
        .swaplevel(0, 1, 1)
        .sort_index(1)
        .replace({None:np.nan})
df1.columns = ['{}{}'.format(col[0], col[1]) for col in df1.columns]
print (df1)
  A0 A1   B0   B1   C0   C1   D0   D1
0  a  b    c    d    e    f    g    h
1  a  b    c    d    e    f  NaN  NaN
2  a  b  NaN  NaN    e    f  NaN  NaN
3  a  b  NaN  NaN  NaN  NaN  NaN  NaN

Timings:

#len (df)=400

In [220]: %timeit (pir(df))
100 loops, best of 3: 3.45 ms per loop

In [221]: %timeit (jez(df))
100 loops, best of 3: 5.17 ms per loop

In [222]: %timeit (nick(df))
1 loop, best of 3: 231 ms per loop

In [223]: %timeit (df.stack().apply(pd.Series).unstack().swaplevel(0, 1, 1).sort_index(1).replace({None:np.nan}))
10 loops, best of 3: 152 ms per loop


#len (df)=4k

In [216]: %timeit (pir(df))
100 loops, best of 3: 16.5 ms per loop

In [217]: %timeit (jez(df))
100 loops, best of 3: 14.8 ms per loop

In [218]: %timeit (nick(df))
1 loop, best of 3: 2.34 s per loop

In [219]: %timeit (df.stack().apply(pd.Series).unstack().swaplevel(0, 1, 1).sort_index(1).replace({None:np.nan}))
1 loop, best of 3: 1.53 s per loop

Code for timings:

df = pd.DataFrame({"A": [('a','b'),('a','b'),('a','b'),('a','b')], 
                   'B': [('c','d'),('c','d'), np.nan,np.nan], 
                   'C':[('e','f'),('e','f'),('e','f'),np.nan],
                   'D':[('g','h'),np.nan,np.nan,np.nan]})

df = pd.concat([df]*1000).reset_index(drop=True)
print (df)

def jez(df):
    stacked = df.stack()
    return pd.DataFrame.from_records(stacked.tolist(), index = stacked.index).unstack(1).swaplevel(0, 1, 1).sort_index(1).replace({None:np.nan})


print (df.stack().apply(pd.Series).unstack().swaplevel(0, 1, 1).sort_index(1).replace({None:np.nan}))

def nick(df):
    cols = df.columns.values.tolist()
    return pd.concat([df[col].apply(pd.Series) for col in cols], axis=1, keys=cols)

def pir(df):
    # fillna with (np.nan, np.nan)
    df_ = df.stack().unstack(fill_value=tuple([np.nan] * 2))
    # construct MultiIndex
    col = pd.MultiIndex.from_product([df.columns, [0, 1]])
    # rip off of Nickil's pd.concat but using numpy
    return pd.DataFrame(np.hstack([np.array(s.values.tolist()) for _, s in df_.iteritems()]), columns=col)


print (jez(df))
print (nick(df))
print (pir(df))
Sign up to request clarification or add additional context in comments.

Comments

2

Split the tuples present in every series into individual elements using apply. Then, concatenate all such columns together columnwise and providing the headers same as the original DF using the keys argument.

cols = df.columns.values.tolist()
pd.concat([df[col].apply(pd.Series) for col in cols], axis=1, keys=cols)

enter image description here

Comments

1

methon 1
stack + apply

df.stack().apply(pd.Series).unstack().swaplevel(0, 1, 1).sort_index(1)

enter image description here

method 2

# fillna with (np.nan, np.nan)
df_ = df.stack().unstack(fill_value=tuple([np.nan] * 2))
# construct MultiIndex
col = pd.MultiIndex.from_product([df.columns, [0, 1]])
# rip off of Nickil's pd.concat but using numpy
pd.DataFrame(
    np.hstack(
        [np.array(s.values.tolist()) \
         for _, s in df_.iteritems()]
    ), columns=col)

enter image description here

3 Comments

hmmm, I think it is not good idea duplicates columns in second solution, what do you think?
@jezrael I decided to change it
I add timings. It seems in larger dataframe my code is faster.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.