2

This is perfectly legal in Python:

In [1]: 'abc' + 'def'
Out[1]: 'abcdef'

If I have an all text Pandas DataFrame, like the example below:

In [2]: df = pd.DataFrame([list('abcd'), list('efgh'), list('ijkl')],
                          columns=['C1','C2','C3','C4'])
        df.loc[[0,2], ['C2', 'C3']] = np.nan
        df
Out[2]:     C1  C2  C3  C4
        0   a   NaN NaN d
        1   e   f   g   h
        2   i   NaN NaN l

Is it possible to do the same with columns of the above DataFrame? Something like:

In [3]: df.apply(+, axis=1) # Or
        df.sum(axis=1)

Note that both of the statements above don't work. Using .str.cat() in a loop is easy, but I am looking for something better.


Expected output is:

Out[3]:    C
        0  ad
        1  efgh
        2  il

3 Answers 3

6

You could do

df.fillna('').sum(axis=1)

Of course, this assumes that your dataframe is made up only of strings and NaNs.

Sign up to request clarification or add additional context in comments.

1 Comment

I assume my dataframe is made up of only strings and NaNs, I hope it is correct! :-P
4

Option 1
stack

I wanted to add it for demonstration. We don't have to accept the rectangular nature of the dataframe and use stack. When we do, stack drops nan by default. Leaving us with a vector of strings and a pd.MultiIndex. We can groupby the first level of this pd.MultiIndex (which used to be row indices) and perform summation:

df.stack().groupby(level=0).sum()

0      ad
1    efgh
2      il
dtype: object

Option2
Use Masked Arrays np.ma.masked_array
I was motivated by @jezrael to post a faster solution (-:

pd.Series(
    np.ma.masked_array(
        df.values,
        df.isnull().values,
    ).filled('').sum(1),
    df.index
)

0      ad
1    efgh
2      il
dtype: object

Timing

df = pd.concat([df]*1000).reset_index(drop=True)

%%timeit
pd.Series(
    np.ma.masked_array(
        df.values,
        df.isnull().values,
        fill_value=''
    ).filled('').sum(1),
    df.index
)

1000 loops, best of 3: 860 µs per loop

%timeit (pd.Series(df.fillna('').values.sum(axis=1), index=df.index))

1000 loops, best of 3: 1.33 ms per loop

2 Comments

Alternate approaches are always welcome, my friend. We cannot guarantee one size fits all.
@Kartik I couldn't agree more (-:
2

A bit faster solution is convert to numpy array by values and then numpy.sum:

#[3000 rows x 4 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
#print (df)

In [49]: %timeit (df.fillna('').sum(axis=1))
100 loops, best of 3: 4.08 ms per loop

In [50]: %timeit (pd.Series(df.fillna('').values.sum(axis=1), index=df.index))
1000 loops, best of 3: 1.49 ms per loop

In [51]: %timeit (pd.Series(np.sum(df.fillna('').values, axis=1), index=df.index))
1000 loops, best of 3: 1.5 ms per loop

1 Comment

hmmm, nice one.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.