pandas: Split separated values in a DataFrame column (one Series) into multiple Columns. Elegant solutions?

Question

I have a column in a DataFrame (which is a column in a csv) which are comma-separated values. I'd like to split this column into multiple columns.

The problem is an old one, and has been discussed here also, but there is one peculiarity: one entry may be from 0-n comma-separated values. An example:

df.head():

i: vals   | sth_else 
---------------------
1: a,b,c  | ba
2: a,d    | be
3:        | bi
4: e,a,c  | bo
5: e      | bu

I'd like the following output (or similar, e.g. True/False):

i : a | b | c | d | e |  sth_else 
-----------------------------------
1:  1 | 1 | 1 | 0 | 0 | ba
2:  1 | 0 | 0 | 1 | 0 | be
3:  0 | 0 | 0 | 0 | 0 | bi
4:  1 | 0 | 1 | 0 | 1 | bo
5:  0 | 0 | 0 | 0 | 1 | bu

I'm currently experimenting with the Series.str.split and then Series.to_dict functions, but with out any satisfactory results (causing always a ValueError: arrays must all be same length. :)

Also, I always try to find elegant solutions which are easily understandable when looked at after a couple of months ;). In any case, propositions are highly appreciated!

Here is the dummy.csv for testing.

vals;sth_else 
a,b,c;ba
a,d;be
;bi
e,a,c;bo
e;bu

For those who will stumble upon the answers below, I think the answer here is strictly superior: stackoverflow.com/questions/28121682/… — Alex Petralia
– Alex Petralia, Commented Jul 3, 2016 at 4:08
@AlexPetralia indeed so! That is the real pandas way - Why don't you add it also to the answers here? — dmeu
– dmeu, Commented Jul 4, 2016 at 7:55

Jianxun Li · Accepted Answer · 2015-07-10 07:38:34Z

3

import pandas as pd
from StringIO import StringIO  # py2.7 used here
# from io.StringIO import StringIO    if you have py3.x

# data
# ==================================================================
csv_buffer = 'vals;sth_else\na,b,c;ba\na,d;be\n;bi\ne,a,c;bo\ne;bu'

df = pd.read_csv(StringIO(csv_buffer), sep=';')

Out[58]: 
    vals sth_else
0  a,b,c       ba
1    a,d       be
2    NaN       bi
3  e,a,c       bo
4      e       bu

# processing
# ==================================================================
def func(group):
    return pd.Series(group.vals.str.split(',').values[0], name='vals')

ser = df.groupby(level=0).apply(func)

Out[60]: 
0  0      a
   1      b
   2      c
1  0      a
   1      d
2  0    NaN
3  0      e
   1      a
   2      c
4  0      e
Name: vals, dtype: object


# use get_dummies, and then aggregate for each column of a b c d e to be its max (max is always 1 in this case)
pd.get_dummies(ser)

Out[85]: 
     a  b  c  d  e
0 0  1  0  0  0  0
  1  0  1  0  0  0
  2  0  0  1  0  0
1 0  1  0  0  0  0
  1  0  0  0  1  0
2 0  0  0  0  0  0
3 0  0  0  0  0  1
  1  1  0  0  0  0
  2  0  0  1  0  0
4 0  0  0  0  0  1

# do this groupby on outer index level [0,1,2,3,4] and reduce any inner group from multiple rows to one row
df_dummies = pd.get_dummies(ser).groupby(level=0).apply(lambda group: group.max())

Out[64]: 
   a  b  c  d  e
0  1  1  1  0  0
1  1  0  0  1  0
2  0  0  0  0  0
3  1  0  1  0  1
4  0  0  0  0  1


df_dummies['sth_else'] = df.sth_else

Out[67]: 
   a  b  c  d  e sth_else
0  1  1  1  0  0       ba
1  1  0  0  1  0       be
2  0  0  0  0  0       bi
3  1  0  1  0  1       bo
4  0  0  0  0  1       bu

edited Jul 10, 2015 at 7:38

answered Jul 10, 2015 at 7:24

Jianxun Li

24.9k10 gold badges64 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

dmeu Over a year ago

That was fast! And looks quite nice, thanks! Just a little side note: you imported pandas as np - I guess should be pd ;)

dmeu Over a year ago

Nothing embarassing, just a little confusion. I must say that is a very cool solution. But maybe a couple of comments which explain what the two lines two would be great. I am still deciphering =)

Jianxun Li Over a year ago

@dmeu Just added a few comments and hope it's more clear now. :-)

Woody Pride · Accepted Answer · 2015-07-10 07:23:59Z

1

This is very similar to another question today. As I said in that question, there may be a simple elegant pandas way to do this, but I also find it convenient to simply create a new data frame and populate it by iterating over the original one in the following fashion:

#import and create your data
import pandas as pd
DF = pd.DataFrame({ 'vals'  : ['a,b,c', 'a,d', '', 'e,a,c', 'e'],
                    'other' : ['ba', 'be', 'bi', 'bo', 'bu'] 
                  }, dtype = str)

Now create a new data frame with the other column form the DF as the index and columns that are drawn from the unique characters found in your val column in the DF:

New_DF = pd.DataFrame({col : 0 for col in 
                             set([letter for letter in ''.join([char for char in DF.vals.values]) 
                             if letter.isalpha()])},
                             index = DF.other)

In [51]: New_DF
Out[51]: 
       a  b  c  d  e
other               
ba     0  0  0  0  0
be     0  0  0  0  0
bi     0  0  0  0  0
bo     0  0  0  0  0
bu     0  0  0  0  0

Now simply iterate over the index of the New_DF slicing the original DF at that value and iterate over the columns to see if they appear in the relevant_string:

for ind in New_DF.index:
    relevant_string = str(DF[DF.other == ind].vals.values)
    for col in list(New_DF.columns):
        if col in relevant_string:
            New_DF.loc[ind, col] += 1

Output looks like this

In [54]: New_DF
Out[54]: 
       a  b  c  d  e
other               
ba     1  1  1  0  0
be     1  0  0  1  0
bi     0  0  0  0  0
bo     1  0  1  0  1
bu     0  0  0  0  1

answered Jul 10, 2015 at 7:23

Woody Pride

14k10 gold badges51 silver badges64 bronze badges

3 Comments

dmeu Over a year ago

Hey, also - very fast! Thanks for your proposition

dmeu Over a year ago

BTW @Woody Pride, which is the other similar question so I can have a look ?

Woody Pride Over a year ago

stackoverflow.com/questions/31307014/…

Collectives™ on Stack Overflow

pandas: Split separated values in a DataFrame column (one Series) into multiple Columns. Elegant solutions?

2 Answers 2

3 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related