Pandas string addition across columns

Question

This is perfectly legal in Python:

In [1]: 'abc' + 'def'
Out[1]: 'abcdef'

If I have an all text Pandas DataFrame, like the example below:

In [2]: df = pd.DataFrame([list('abcd'), list('efgh'), list('ijkl')],
                          columns=['C1','C2','C3','C4'])
        df.loc[[0,2], ['C2', 'C3']] = np.nan
        df
Out[2]:     C1  C2  C3  C4
        0   a   NaN NaN d
        1   e   f   g   h
        2   i   NaN NaN l

Is it possible to do the same with columns of the above DataFrame? Something like:

In [3]: df.apply(+, axis=1) # Or
        df.sum(axis=1)

Note that both of the statements above don't work. Using .str.cat() in a loop is easy, but I am looking for something better.

Expected output is:

Out[3]:    C
        0  ad
        1  efgh
        2  il

Ken Wei · Accepted Answer · 2017-06-30 04:15:54Z

6

You could do

df.fillna('').sum(axis=1)

Of course, this assumes that your dataframe is made up only of strings and NaNs.

answered Jun 30, 2017 at 4:15

Ken Wei

3,1381 gold badge12 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Kartik Over a year ago

I assume my dataframe is made up of only strings and NaNs, I hope it is correct! :-P

piRSquared · Accepted Answer · 2017-06-30 06:15:29Z

4

Option 1
stack

I wanted to add it for demonstration. We don't have to accept the rectangular nature of the dataframe and use stack. When we do, stack drops nan by default. Leaving us with a vector of strings and a pd.MultiIndex. We can groupby the first level of this pd.MultiIndex (which used to be row indices) and perform summation:

df.stack().groupby(level=0).sum()

0      ad
1    efgh
2      il
dtype: object

Option2
Use Masked Arrays np.ma.masked_array
I was motivated by @jezrael to post a faster solution (-:

pd.Series(
    np.ma.masked_array(
        df.values,
        df.isnull().values,
    ).filled('').sum(1),
    df.index
)

0      ad
1    efgh
2      il
dtype: object

Timing

df = pd.concat([df]*1000).reset_index(drop=True)

%%timeit
pd.Series(
    np.ma.masked_array(
        df.values,
        df.isnull().values,
        fill_value=''
    ).filled('').sum(1),
    df.index
)

1000 loops, best of 3: 860 µs per loop

%timeit (pd.Series(df.fillna('').values.sum(axis=1), index=df.index))

1000 loops, best of 3: 1.33 ms per loop

edited Jun 30, 2017 at 6:15

answered Jun 30, 2017 at 5:16

piRSquared

296k68 gold badges509 silver badges654 bronze badges

2 Comments

Kartik Over a year ago

Alternate approaches are always welcome, my friend. We cannot guarantee one size fits all.

piRSquared Over a year ago

@Kartik I couldn't agree more (-:

jezrael · Accepted Answer · 2017-06-30 05:34:46Z

2

A bit faster solution is convert to numpy array by values and then numpy.sum:

#[3000 rows x 4 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
#print (df)

In [49]: %timeit (df.fillna('').sum(axis=1))
100 loops, best of 3: 4.08 ms per loop

In [50]: %timeit (pd.Series(df.fillna('').values.sum(axis=1), index=df.index))
1000 loops, best of 3: 1.49 ms per loop

In [51]: %timeit (pd.Series(np.sum(df.fillna('').values, axis=1), index=df.index))
1000 loops, best of 3: 1.5 ms per loop

answered Jun 30, 2017 at 5:34

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

1 Comment

jezrael Over a year ago

hmmm, nice one.

Collectives™ on Stack Overflow

Pandas string addition across columns

3 Answers 3

1 Comment

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related