3

I have a column in a DataFrame (which is a column in a csv) which are comma-separated values. I'd like to split this column into multiple columns.

The problem is an old one, and has been discussed here also, but there is one peculiarity: one entry may be from 0-n comma-separated values. An example:

df.head():

i: vals   | sth_else 
---------------------
1: a,b,c  | ba
2: a,d    | be
3:        | bi
4: e,a,c  | bo
5: e      | bu

I'd like the following output (or similar, e.g. True/False):

i : a | b | c | d | e |  sth_else 
-----------------------------------
1:  1 | 1 | 1 | 0 | 0 | ba
2:  1 | 0 | 0 | 1 | 0 | be
3:  0 | 0 | 0 | 0 | 0 | bi
4:  1 | 0 | 1 | 0 | 1 | bo
5:  0 | 0 | 0 | 0 | 1 | bu

I'm currently experimenting with the Series.str.split and then Series.to_dict functions, but with out any satisfactory results (causing always a ValueError: arrays must all be same length. :)

Also, I always try to find elegant solutions which are easily understandable when looked at after a couple of months ;). In any case, propositions are highly appreciated!

Here is the dummy.csv for testing.

vals;sth_else 
a,b,c;ba
a,d;be
;bi
e,a,c;bo
e;bu
2
  • 2
    For those who will stumble upon the answers below, I think the answer here is strictly superior: stackoverflow.com/questions/28121682/… Commented Jul 3, 2016 at 4:08
  • @AlexPetralia indeed so! That is the real pandas way - Why don't you add it also to the answers here? Commented Jul 4, 2016 at 7:55

2 Answers 2

3
import pandas as pd
from StringIO import StringIO  # py2.7 used here
# from io.StringIO import StringIO    if you have py3.x

# data
# ==================================================================
csv_buffer = 'vals;sth_else\na,b,c;ba\na,d;be\n;bi\ne,a,c;bo\ne;bu'

df = pd.read_csv(StringIO(csv_buffer), sep=';')

Out[58]: 
    vals sth_else
0  a,b,c       ba
1    a,d       be
2    NaN       bi
3  e,a,c       bo
4      e       bu

# processing
# ==================================================================
def func(group):
    return pd.Series(group.vals.str.split(',').values[0], name='vals')

ser = df.groupby(level=0).apply(func)

Out[60]: 
0  0      a
   1      b
   2      c
1  0      a
   1      d
2  0    NaN
3  0      e
   1      a
   2      c
4  0      e
Name: vals, dtype: object


# use get_dummies, and then aggregate for each column of a b c d e to be its max (max is always 1 in this case)
pd.get_dummies(ser)

Out[85]: 
     a  b  c  d  e
0 0  1  0  0  0  0
  1  0  1  0  0  0
  2  0  0  1  0  0
1 0  1  0  0  0  0
  1  0  0  0  1  0
2 0  0  0  0  0  0
3 0  0  0  0  0  1
  1  1  0  0  0  0
  2  0  0  1  0  0
4 0  0  0  0  0  1

# do this groupby on outer index level [0,1,2,3,4] and reduce any inner group from multiple rows to one row
df_dummies = pd.get_dummies(ser).groupby(level=0).apply(lambda group: group.max())

Out[64]: 
   a  b  c  d  e
0  1  1  1  0  0
1  1  0  0  1  0
2  0  0  0  0  0
3  1  0  1  0  1
4  0  0  0  0  1


df_dummies['sth_else'] = df.sth_else

Out[67]: 
   a  b  c  d  e sth_else
0  1  1  1  0  0       ba
1  1  0  0  1  0       be
2  0  0  0  0  0       bi
3  1  0  1  0  1       bo
4  0  0  0  0  1       bu
Sign up to request clarification or add additional context in comments.

3 Comments

That was fast! And looks quite nice, thanks! Just a little side note: you imported pandas as np - I guess should be pd ;)
Nothing embarassing, just a little confusion. I must say that is a very cool solution. But maybe a couple of comments which explain what the two lines two would be great. I am still deciphering =)
@dmeu Just added a few comments and hope it's more clear now. :-)
1

This is very similar to another question today. As I said in that question, there may be a simple elegant pandas way to do this, but I also find it convenient to simply create a new data frame and populate it by iterating over the original one in the following fashion:

#import and create your data
import pandas as pd
DF = pd.DataFrame({ 'vals'  : ['a,b,c', 'a,d', '', 'e,a,c', 'e'],
                    'other' : ['ba', 'be', 'bi', 'bo', 'bu'] 
                  }, dtype = str)

Now create a new data frame with the other column form the DF as the index and columns that are drawn from the unique characters found in your val column in the DF:

New_DF = pd.DataFrame({col : 0 for col in 
                             set([letter for letter in ''.join([char for char in DF.vals.values]) 
                             if letter.isalpha()])},
                             index = DF.other)

In [51]: New_DF
Out[51]: 
       a  b  c  d  e
other               
ba     0  0  0  0  0
be     0  0  0  0  0
bi     0  0  0  0  0
bo     0  0  0  0  0
bu     0  0  0  0  0

Now simply iterate over the index of the New_DF slicing the original DF at that value and iterate over the columns to see if they appear in the relevant_string:

for ind in New_DF.index:
    relevant_string = str(DF[DF.other == ind].vals.values)
    for col in list(New_DF.columns):
        if col in relevant_string:
            New_DF.loc[ind, col] += 1

Output looks like this

In [54]: New_DF
Out[54]: 
       a  b  c  d  e
other               
ba     1  1  1  0  0
be     1  0  0  1  0
bi     0  0  0  0  0
bo     1  0  1  0  1
bu     0  0  0  0  1

3 Comments

Hey, also - very fast! Thanks for your proposition
BTW @Woody Pride, which is the other similar question so I can have a look ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.