How to add multiple columns to a DataFrame?

Question

I want to add total fields to this DataFrame:

df_test = pd.DataFrame([
    {'id':1,'cat1a':3,'cat1b':2, 'cat2a':4,'cat2b':3},
    {'id':2,'cat1a':7,'cat1b':5, 'cat2a':9,'cat2b':6}
])

This code almost works:

 def add_total(therecord):
        t1 = therecord['cat1a'] + therecord['cat1b']
        t2 = therecord['cat2a'] + therecord['cat2b']
        return t1, t2

df_test['cat1tot', 'cat2tot'] = df_test[['cat1a', 'cat1b', 'cat2a', 'cat2b']].apply(add_total,axis=1)

Except it results in only 1 new column:

And this code:

 def add_total(therecord):
        t1 = therecord['cat1a'] + therecord['cat1b']
        t2 = therecord['cat2a'] + therecord['cat2b']
        return [t1, t2]

df_test[['cat1tot', 'cat2tot']] = df_test[['cat1a', 'cat1b', 'cat2a', 'cat2b']].apply(add_total,axis=1)

Results in: KeyError: "['cat1tot' 'cat2tot'] not in index"

I tried to resolve that with:

my_cols_list=['cat1tot','cat2tot']
df_test.reindex(columns=[*df_test.columns.tolist(), *my_cols_list], fill_value=0)

But that didn't solve the problem. So what am I missing?

Have you tried .withColumn(), possibly with .drop() to remove unneeded source columns? Or df_test.select((df_test.cat1a + df_test.cat1b).alias("cat1tot")), etc? — 9000
– 9000, Commented Feb 27, 2018 at 19:28
@9000 Those don't look like valid pandas functions to me... what version are you running? — cs95
– cs95, Commented Feb 27, 2018 at 19:39
@BradRhoads, are you looking to just add totals or more complex calculations which cannot be vectorised? — jpp
– jpp, Commented Feb 27, 2018 at 19:39
@cᴏʟᴅsᴘᴇᴇᴅ: Ah, mistook Pandas DataFrame for Spark DataFrame! Hence the confusion. Indeed, this won't work with Pandas. — 9000
– 9000, Commented Feb 28, 2018 at 1:10

jpp · Accepted Answer · 2018-02-27 19:35:20Z

2

It's generally not a good idea to use df.apply unless you absolutely must. The reason is that these operations are not vectorised, i.e. in the background there is a loop where each row is fed into a function as its own pd.Series.

This would be a vectorised implementation:

df_test['cat1tot'] = df_test['cat1a'] + df_test['cat1b']
df_test['cat2tot'] = df_test['cat2a'] + df_test['cat2b']

#    cat1a  cat1b  cat2a  cat2b  id  cat1tot  cat2tot
# 0      3      2      4      3   1        5        7
# 1      7      5      9      6   2       12       15

answered Feb 27, 2018 at 19:35

jpp

166k37 gold badges301 silver badges362 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

cs95 Over a year ago

I'm assuming this was a toy example, otherwise this would have been my knee jerk response ;)

jpp Over a year ago

@COLDSPEED, I don't know yet. I'll happily delete this answer if it turns out user wants to do more complex non-parallelisable operations :)

Brad Rhoads Over a year ago

I'm fairly new to python. Can you explain a bit more? It seems like your solution would need to loop through the dataset twice instead of just once. Yes, this is a toy example. The real data has 10 sets of 7 columns, so I need to add tot1, tot2, . . ., tot10.

jpp Over a year ago

Of course, happy to explain. It may seem like you are doing "2 lines of work", but the 2 lines should be significantly faster than the single df.apply. In the background, there are highly optimised numpy-based calculations [highly efficient library] for simple calculations such as +, *, etc, which mean the calculation is not cycling through one row at a time. See also: Pandas - Explanation on apply function being slow

juanpa.arrivillaga Over a year ago

@BradRhoads this approach will be significantly faster than any approach involving .apply.

|

cs95 · Accepted Answer · 2018-02-27 19:35:27Z

2

Return a Series object instead:

def add_total(therecord):
    t1 = therecord['cat1a'] + therecord['cat1b']
    t2 = therecord['cat2a'] + therecord['cat2b']

    return pd.Series([t1, t2])

And then,

df_test[['cat1tot', 'cat2tot']] = \
      df_test[['cat1a', 'cat1b', 'cat2a', 'cat2b']].apply(add_total,axis=1)

df_test

   cat1a  cat1b  cat2a  cat2b  id  cat1tot  cat2tot
0      3      2      4      3   1        5        7
1      7      5      9      6   2       12       15

This works, because apply will special case the Series return type, and assume you want the result as a dataframe slice.

answered Feb 27, 2018 at 19:35

cs95

406k106 gold badges744 silver badges797 bronze badges

1 Comment

Brad Rhoads Over a year ago

This works and is the closest to what I was trying to do.

mortysporty · Accepted Answer · 2018-02-27 19:29:20Z

1

how about

df_test['cat1tot'], df_test['cat2tot'] =\
   df_test[['cat1a', 'cat1b', 'cat2a', 'cat2b']].apply(add_total,axis=1)

answered Feb 27, 2018 at 19:29

mortysporty

2,9118 gold badges36 silver badges57 bronze badges

Collectives™ on Stack Overflow

How to add multiple columns to a DataFrame?

3 Answers 3

9 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related