How to append columns based on other column values to pandas dataframe

Question

I have the following problem: I want to append columns to a dataframe. These columns are the unique values in another row of this dataframe, filled with the occurence of this value in this row. It looks like this:

df:

   Column1  Column2
0     1       a,b,c
1     2       a,e
2     3       a
3     4       c,f
4     5       c,f

What I am trying to get is:

    Column1  Column2  a  b  c  e  f
0     1       a,b,c   1  1  1
1     2       a,e     1        1
2     3       a       1
3     4       c,f           1     1
4     5       c,f           1     1

(the empty spaces can be nan or 0, it matters not.)

I have now written some code to aceive this, but instead of appending columns, it appends rows, so that my output looks like this:

        Column1  Column2
    0     1       a,b,c
    1     2       a,e
    2     3       a
    3     4       c,f
    4     5       c,f
    a     1        1
    b     1        1
    c     1        1
    e     1        1
    f     1        1

The code looks like this:

def NewCols(x):
    for i, value in df['Column2'].iteritems():
        listi=value.split(',')
        for value in listi:
            string = value
            x[string]=list.count(string)
    return x

df1=df.apply(NewCols)

What I am trying to do here is to iterate through each row of the dataframe and split the string (a,b,c) contained in Column2 at comma, so the variable listi is then a list containing the separated string values. For each of this values I then want to make a new column and fill it with the number of occurences of that value in listi. I am confused why the code appends rows instead of columns. Does somebody know why and how I can correct that?

DSM · Accepted Answer · 2015-10-27 16:29:18Z

4

While we could do this using get_dummies, we can also cheat and use pd.value_counts directly:

>>> df = pd.DataFrame({'Column1': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'Column2': {0: 'a,b,c', 1: 'a,e', 2: 'a', 3: 'c,f', 4: 'c,f'}})
>>> df.join(df.Column2.str.split(",").apply(pd.value_counts).fillna(0))
   Column1 Column2  a  b  c  e  f
0        1   a,b,c  1  1  1  0  0
1        2     a,e  1  0  0  1  0
2        3       a  1  0  0  0  0
3        4     c,f  0  0  1  0  1
4        5     c,f  0  0  1  0  1

Step-by-step, we have

>>> df.Column2.str.split(",")
0    [a, b, c]
1       [a, e]
2          [a]
3       [c, f]
4       [c, f]
dtype: object
>>> df.Column2.str.split(",").apply(pd.value_counts)
    a   b   c   e   f
0   1   1   1 NaN NaN
1   1 NaN NaN   1 NaN
2   1 NaN NaN NaN NaN
3 NaN NaN   1 NaN   1
4 NaN NaN   1 NaN   1
>>> df.Column2.str.split(",").apply(pd.value_counts).fillna(0)
   a  b  c  e  f
0  1  1  1  0  0
1  1  0  0  1  0
2  1  0  0  0  0
3  0  0  1  0  1
4  0  0  1  0  1
>>> df.join(df.Column2.str.split(",").apply(pd.value_counts).fillna(0))
   Column1 Column2  a  b  c  e  f
0        1   a,b,c  1  1  1  0  0
1        2     a,e  1  0  0  1  0
2        3       a  1  0  0  0  0
3        4     c,f  0  0  1  0  1
4        5     c,f  0  0  1  0  1

answered Oct 27, 2015 at 16:29

DSM

355k67 gold badges606 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

sequence_hard Over a year ago

Works perfekt, thank you for the detailed explanation, I get it :-).

IanS Over a year ago

Why is using value_counts rather than get_dummies considered cheating? :)

BrenBarn · Accepted Answer · 2015-10-27 16:24:00Z

2

When you use apply, it calls your function once for each column, with that column as an argument. So x in your NewCols will be set to a single column. When you do x[string] = list.count(string), you are adding values to that column. Since apply is called for each column, you wind up appending the values to both columns in this way.

apply is not the right choice when your computation depends only on the values of a single column. Instead, use map. In this case, what you need to do is write a NewCol function that accepts a single Column2 value and returns the data for a single row. You can return this as a dict, or, handily, a dict-like object such as a collections.Counter. Then you need to wrap this new row data into a DataFrame and attach it column-wise to your existing data using concat. Here is an example:

def NewCols(val):
    return collections.Counter(val.split(','))

>>> pandas.concat([d, pandas.DataFrame.from_records(d.Column2.map(NewCols))], axis=1)
   Column1 Column2   a   b   c   e   f
0        1   a,b,c   1   1   1 NaN NaN
1        2     a,e   1 NaN NaN   1 NaN
2        3       a   1 NaN NaN NaN NaN
3        4     c,f NaN NaN   1 NaN   1
4        5     c,f NaN NaN   1 NaN   1

For this particular computation, you actually don't need to write your own function at all, because pandas has split built in as an operation under the .str method accessor. So you can do this:

>>> pandas.concat([d, pandas.DataFrame.from_records(d.Column2.str.split(',').map(collections.Counter))], axis=1)
   Column1 Column2   a   b   c   e   f
0        1   a,b,c   1   1   1 NaN NaN
1        2     a,e   1 NaN NaN   1 NaN
2        3       a   1 NaN NaN NaN NaN
3        4     c,f NaN NaN   1 NaN   1
4        5     c,f NaN NaN   1 NaN   1

answered Oct 27, 2015 at 16:24

BrenBarn

253k39 gold badges421 silver badges392 bronze badges

2 Comments

sequence_hard Over a year ago

Thanks for the detailed reply! However the code raises a key error for me...KeyError: 0L Any idea what the cause could be?

BrenBarn Over a year ago

@sequence_hard: Not sure, I can't reproduce that error.

Roxana · Accepted Answer · 2015-10-27 16:55:42Z

0

You could use something as:

import pandas as pd
import sklearn.feature_extraction.text

vect = sklearn.feature_extraction.text.CountVectorizer(binary=True,   token_pattern=u'(?u)\\b\\w+\\b')
df = ...
v = [a for a in df['Column2']]
new_df = df.combine_first( pd.DataFrame(vect.fit_transform(v).todense(), columns=vect.get_feature_names()) )
print new_df

Cheers!

answered Oct 27, 2015 at 16:55

Roxana

3921 gold badge3 silver badges12 bronze badges

1 Comment

jkalden Over a year ago

You can improve your answer by annotating the code!

Collectives™ on Stack Overflow

How to append columns based on other column values to pandas dataframe

3 Answers 3

2 Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related