split the column with list into multiple columns in dataframe

Question

I know there have been many questions on this topic, but still:
My input: as dataframe

 task                                            m_label
0  S101-10061  [Cecum Landmark, ICV, Comment, Appendiceal ori...
1  S101-10069  [Rectum RF, ICV, Cecum Landmark, TI, Comment, ...
2  S101-10078  [Appendiceal orifice, ICV, Cecum Landmark, Com...
3  S101-10088  [Cecum Landmark, ICV, Comment, Appendiceal ori...
4  S101-10100  [Transverse, Appendiceal orifice, ICV, Cecum L...
5  S101-10102  [Rectum RF, ICV, Cecum Landmark, Comment, TI, ...
6  S101-10133  [Rectum RF, Transverse, ICV, Cecum Landmark, C...
7  S101YGBgZ2                                          [Comment]

I wan to split like df.m_label.str.split("",expand=True) but it return NaN Maybe problem with df? I get its from panda Series from: m_lab_task=data.groupby(['task'])['m_label'].unique(). So maybe in previously step it possible correct?

Required output:

      task       m_label1 m_label2 m_label3 m_label4 m_label5 m_label6
0   S101-10061  Cecum Landmark ICV Comment Appendiceal orifice
1   S101-10069  Rectum RF ICV Cecum Landmark TI Comment Transverse
2   S101-10078  Appendiceal orifice ICV Cecum Landmark Comment Transverse
 Rectum RF

Why some values are enclosed inside double quotes " and some not? Is m_lbel a string column or does it have python lists? — ThePyGuy
– ThePyGuy, Commented Aug 12, 2021 at 7:02
Do you want the values inside single quotes '' to be in a separate column? — ThePyGuy
– ThePyGuy, Commented Aug 12, 2021 at 7:13
Yes, I wan split object, which was created by function pandas.groupby() in column m_label to separate "labels" — TeoK
– TeoK, Commented Aug 12, 2021 at 7:14

ThePyGuy · Accepted Answer · 2021-08-12 07:20:30Z

2

Use str.findall and pass the regex to capture everything enclosed by single quite '', then apply pd.Series to convert them to columns

df=df.set_index('task')['m_label'].str.findall('\'(.*?)\'').apply(pd.Series)
df.columns = [f'm_label{i+1}' for i in df]

OUTPUT:

                       m_label1             m_label2        m_label3               m_label4    m_label5    m_label6             m_label7  
task                                                                                                                                       
S101-10061       Cecum Landmark                  ICV         Comment    Appendiceal orifice         NaN         NaN                  NaN   
S101-10069            Rectum RF                  ICV  Cecum Landmark                     TI     Comment  Transverse                  NaN   
S101-10078  Appendiceal orifice                  ICV  Cecum Landmark                Comment  Transverse   Rectum RF                  NaN   
S101-10088       Cecum Landmark                  ICV         Comment    Appendiceal orifice         NaN         NaN                  NaN   
S101-10100           Transverse  Appendiceal orifice             ICV         Cecum Landmark     Comment         NaN                  NaN   
S101-10102            Rectum RF                  ICV  Cecum Landmark                Comment          TI  Transverse  Appendiceal orifice   
S101-10133            Rectum RF           Transverse             ICV         Cecum Landmark     Comment         NaN                  NaN   
S101YGBgZ2              Comment                  NaN             NaN                    NaN         NaN         NaN                  NaN

If needed, you can reset the index later, and fillna('').

edited Aug 12, 2021 at 7:20

answered Aug 12, 2021 at 7:15

ThePyGuy

18.5k5 gold badges24 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

TeoK Over a year ago

I don't understand why for me its return Nan? Maybe something in preference?

ThePyGuy Over a year ago

Now, you have updated the question which looks like python list. Try df[m_label'].apply(type) if it's list, then you don't need to use split/findall, just use df.set_index('task').apply(pd.Series)

TeoK Over a year ago

task S101-10061 <class 'numpy.ndarray'> Name: m_label, dtype: object

ThePyGuy Over a year ago

You have numpy array, just skip findall part, it should work fine i.e. df.set_index('task').apply(pd.Series), findall is for string type column only, but you have a numpy array.

TeoK Over a year ago

Command just return me: same df within 'task' as index and 'm_label' as column in all stuff. Whats wrong? I folow step by step. Is it problem in previously steps? I use to get pandas series and convert it to df this commands: df=data.groupby(['task'])['m_label'].unique() and df=pd.DataFrame(mltask) df=mltask.reset_index()

|

Bharat Natrayn · Accepted Answer · 2021-08-12 08:00:38Z

when you convert the list into dataframe string data without separation will combine as single data to overcome this you have insert the comma before convert into dataframe like this,

import pandas as pd
data={"task":["S101-10061","S101-10069","S101-10078","S101-10088","S101-10100","S101-10102","S101-10133","S101YGBgZ2"],
     "m_label":[['Cecum Landmark','ICV' ,'Comment' ,'Appendiceal orifice'],['Rectum RF','ICV','Cecum Landmark','TI','Comment','Transverse']
               ,['Appendiceal orifice' ,'ICV' ,'Cecum Landmark', 'Comment', 'Transverse','Rectum RF'],['Cecum Landmark', 'ICV', 'Comment', 'Appendiceal orifice'],
               ['Transverse' ,'Appendiceal orifice', 'ICV', 'Cecum Landmark', 'Comment'],['Rectum RF' ,'ICV' ,'Cecum Landmark', 'Comment' ,'TI' ,'Transverse','Appendiceal orifice'],
               ['Rectum RF', 'Transverse' ,'ICV' ,'Cecum Landmark', 'Comment'],['Comment']]}
data=pd.DataFrame(data)

dataframe should like this

        task    m_label
0   S101-10061  [Cecum Landmark, ICV, Comment, Appendiceal ori...
1   S101-10069  [Rectum RF, ICV, Cecum Landmark, TI, Comment, ...
2   S101-10078  [Appendiceal orifice, ICV, Cecum Landmark, Com...
3   S101-10088  [Cecum Landmark, ICV, Comment, Appendiceal ori...
4   S101-10100  [Transverse, Appendiceal orifice, ICV, Cecum L...
5   S101-10102  [Rectum RF, ICV, Cecum Landmark, Comment, TI, ...
6   S101-10133  [Rectum RF, Transverse, ICV, Cecum Landmark, C...
7   S101YGBgZ2  [Comment]

output code

import numpy as np
data=pd.concat([data["task"],data["m_label"].apply(lambda x:pd.Series(x).add_prefix("m_label"))],axis=1).replace(np.nan," ")

task    m_label0    m_label1    m_label2    m_label3    m_label4    m_label5    m_label6
0   S101-10061  Cecum Landmark  ICV Comment Appendiceal orifice         
1   S101-10069  Rectum RF   ICV Cecum Landmark  TI  Comment Transverse  
2   S101-10078  Appendiceal orifice ICV Cecum Landmark  Comment Transverse  Rectum RF   
3   S101-10088  Cecum Landmark  ICV Comment Appendiceal orifice         
4   S101-10100  Transverse  Appendiceal orifice ICV Cecum Landmark  Comment     
5   S101-10102  Rectum RF   ICV Cecum Landmark  Comment TI  Transverse  Appendiceal orifice
6   S101-10133  Rectum RF   Transverse  ICV Cecum Landmark  Comment     
7   S101YGBgZ2  Comment

BlackMath · Accepted Answer · 2021-08-12 07:20:13Z

0

Just to add something to ThePyGuy's answer,if you want to rename columns "on the fly", you can use add_prefix().

df.set_index('task')['m_label'].str.findall('\'(.*?)\'').apply(pd.Series).add_prefix('m_label')

output:

Out[27]: 
                  m_label0 m_label1  ... m_label4    m_label5
task                                 ...                     
S101-10061  Cecum Landmark      ICV  ...      NaN         NaN
S101-10069       Rectum RF      ICV  ...  Comment  Transverse

answered Aug 12, 2021 at 7:20

BlackMath

1,8561 gold badge13 silver badges15 bronze badges

Comments

Rinshan Kolayil · Accepted Answer · 2021-08-12 08:51:22Z

From Mr.Brarath Narayan code, I hope you can short the code as follows without use of numpy

df = pd.concat([df['task'],df['m_label'].apply(pd.Series).add_prefix("m_label").fillna("")], axis = 1)

Output

task    m_label0    m_label1    m_label2    m_label3    m_label4    m_label5    m_label6
0   S101-10061  Cecum Landmark  ICV Comment Appendiceal orifice         
1   S101-10069  Rectum RF   ICV Cecum Landmark  TI  Comment Transverse  
2   S101-10078  Appendiceal orifice ICV Cecum Landmark  Comment Transverse  Rectum RF   
3   S101-10088  Cecum Landmark  ICV Comment Appendiceal orifice         
4   S101-10100  Transverse  Appendiceal orifice ICV Cecum Landmark  Comment     
5   S101-10102  Rectum RF   ICV Cecum Landmark  Comment TI  Transverse  Appendiceal orifice
6   S101-10133  Rectum RF   Transverse  ICV Cecum Landmark  Comment     
7   S101YGBgZ2  Comment

Collectives™ on Stack Overflow

split the column with list into multiple columns in dataframe

4 Answers 4

6 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related