1

I want to add total fields to this DataFrame:

df_test = pd.DataFrame([
    {'id':1,'cat1a':3,'cat1b':2, 'cat2a':4,'cat2b':3},
    {'id':2,'cat1a':7,'cat1b':5, 'cat2a':9,'cat2b':6}
])

This code almost works:

 def add_total(therecord):
        t1 = therecord['cat1a'] + therecord['cat1b']
        t2 = therecord['cat2a'] + therecord['cat2b']
        return t1, t2

df_test['cat1tot', 'cat2tot'] = df_test[['cat1a', 'cat1b', 'cat2a', 'cat2b']].apply(add_total,axis=1)

Except it results in only 1 new column:

enter image description here

And this code:

 def add_total(therecord):
        t1 = therecord['cat1a'] + therecord['cat1b']
        t2 = therecord['cat2a'] + therecord['cat2b']
        return [t1, t2]

df_test[['cat1tot', 'cat2tot']] = df_test[['cat1a', 'cat1b', 'cat2a', 'cat2b']].apply(add_total,axis=1)

Results in: KeyError: "['cat1tot' 'cat2tot'] not in index"

I tried to resolve that with:

my_cols_list=['cat1tot','cat2tot']
df_test.reindex(columns=[*df_test.columns.tolist(), *my_cols_list], fill_value=0)

But that didn't solve the problem. So what am I missing?

4
  • Have you tried .withColumn(), possibly with .drop() to remove unneeded source columns? Or df_test.select((df_test.cat1a + df_test.cat1b).alias("cat1tot")), etc? Commented Feb 27, 2018 at 19:28
  • @9000 Those don't look like valid pandas functions to me... what version are you running? Commented Feb 27, 2018 at 19:39
  • @BradRhoads, are you looking to just add totals or more complex calculations which cannot be vectorised? Commented Feb 27, 2018 at 19:39
  • @cᴏʟᴅsᴘᴇᴇᴅ: Ah, mistook Pandas DataFrame for Spark DataFrame! Hence the confusion. Indeed, this won't work with Pandas. Commented Feb 28, 2018 at 1:10

3 Answers 3

2

It's generally not a good idea to use df.apply unless you absolutely must. The reason is that these operations are not vectorised, i.e. in the background there is a loop where each row is fed into a function as its own pd.Series.

This would be a vectorised implementation:

df_test['cat1tot'] = df_test['cat1a'] + df_test['cat1b']
df_test['cat2tot'] = df_test['cat2a'] + df_test['cat2b']

#    cat1a  cat1b  cat2a  cat2b  id  cat1tot  cat2tot
# 0      3      2      4      3   1        5        7
# 1      7      5      9      6   2       12       15
Sign up to request clarification or add additional context in comments.

9 Comments

I'm assuming this was a toy example, otherwise this would have been my knee jerk response ;)
@COLDSPEED, I don't know yet. I'll happily delete this answer if it turns out user wants to do more complex non-parallelisable operations :)
I'm fairly new to python. Can you explain a bit more? It seems like your solution would need to loop through the dataset twice instead of just once. Yes, this is a toy example. The real data has 10 sets of 7 columns, so I need to add tot1, tot2, . . ., tot10.
Of course, happy to explain. It may seem like you are doing "2 lines of work", but the 2 lines should be significantly faster than the single df.apply. In the background, there are highly optimised numpy-based calculations [highly efficient library] for simple calculations such as +, *, etc, which mean the calculation is not cycling through one row at a time. See also: Pandas - Explanation on apply function being slow
@BradRhoads this approach will be significantly faster than any approach involving .apply.
|
2

Return a Series object instead:

def add_total(therecord):
    t1 = therecord['cat1a'] + therecord['cat1b']
    t2 = therecord['cat2a'] + therecord['cat2b']

    return pd.Series([t1, t2])

And then,

df_test[['cat1tot', 'cat2tot']] = \
      df_test[['cat1a', 'cat1b', 'cat2a', 'cat2b']].apply(add_total,axis=1)

df_test

   cat1a  cat1b  cat2a  cat2b  id  cat1tot  cat2tot
0      3      2      4      3   1        5        7
1      7      5      9      6   2       12       15

This works, because apply will special case the Series return type, and assume you want the result as a dataframe slice.

1 Comment

This works and is the closest to what I was trying to do.
1

how about

df_test['cat1tot'], df_test['cat2tot'] =\
   df_test[['cat1a', 'cat1b', 'cat2a', 'cat2b']].apply(add_total,axis=1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.