Concatenating two columns in pandas dataframe without adding extra spaces at the end when the second column contains NaN/empty strings

Question

I have the following pandas data frame:

> print(tpl_subset)
>
         Fullname                 Infrasp                   Authorship
Lilium abchasicum                     NaN                        Baker
    Lilium affine                     NaN          Schult. & Schult.f.
Lilium akkusianum                     NaN                     Gämperle
 Lilium albanicum                     NaN                      Griseb.
 Lilium albanicum           subsp. jankae              (A.Kern.) Nyman
Lilium albiflorum                     NaN                        Hook.
     Lilium album                     NaN                       Houtt.
   Lilium amabile             var. flavum                      Y.N.Lee
   Lilium amabile        var. immaculatum                      T.B.Lee
   Lilium amabile     var. kwangnungensis            Y.S.Kim & W.B.Lee
              ...                     ...                          ...

I am trying to concatenate the first two columns into a new one only if the value of the second column is not NaN.

What I have been doing so far is simply concatenating the two columns while replacing NaN by an empty string.

tpl_subset['Tmp'] = tpl_subset['Fullname'] + ' ' + tpl_subset['Infrasp'].fillna('')

The problem is that I end up with unwanted whitespaces at the end of the string when the value of the second column is NaN (e.g. 'Lilium abchasicum' becomes 'Lilium abchasicum '), which forces me to do extra steps to remove them.

These steps will be repeated hundreds of times on datasets containing hundred thousands rows each, so I'm looking for something efficient in term of performance. Using a for loop with an if else statement is not option.

Q.: is there an efficient and more direct way to do this?

The desired column output is:

                               Tmp
                 Lilium abchasicum
                     Lilium affine
                 Lilium akkusianum
                  Lilium albanicum
    Lilium albanicum subsp. jankae
                 Lilium albiflorum
                      Lilium album
        Lilium amabile var. flavum
   Lilium amabile var. immaculatum
Lilium amabile var. kwangnungensis

Edit:

A quick performance comparison between numpy.where() and radd(' ').fillna('') on the whole dataset (~1.2 millions rows):

In:
import timeit

s = '''
import pandas
import numpy as np
tpl_data = pandas.read_csv('~/phd/Data/TPL/tpl_all_species.csv', sep = '\t')
tpl_fn = tpl_data['Fullname']
tpl_inf = tpl_data['Infrasp']
tpl_concat = tpl_fn + ' ' + tpl_inf
'''

tmp1 = "tpl_data['tmp1'] = np.where(tpl_inf.isnull(), tpl_fn, tpl_concat)"
tmp2 = "tpl_data['tmp2'] = (tpl_fn + tpl_inf.radd(' ').fillna(''))"

print('np.where():', timeit.Timer(tmp1, setup = s).repeat(repeat = 3, number = 10))
print('radd():', timeit.Timer(tmp2, setup = s).repeat(repeat = 3, number = 10))

Out:
np.where(): [0.7466984760000002, 0.7332379689999993, 0.7483021389999998]
radd(): [2.2832963809999995, 2.320076223000001, 2.299452007000003]

U13-Forward · Accepted Answer · 2019-06-26 11:05:46Z

3

Or use np.where:

df['Tmp'] = np.where(df['Infrasp'].isnull(), df['Fullname'], df['Fullname'] + ' ' + df['Infrasp'])

edited Jun 26, 2019 at 11:05

answered Jun 26, 2019 at 10:58

U13-Forward

71.8k15 gold badges100 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

roganjosh Over a year ago

But that just ditches numpy/pandas and ends up as a for loop

U13-Forward Over a year ago

@roganjosh If jezrael didn't post before me i would do his solution :P

roganjosh Over a year ago

I don't think their solution is the most efficient.

U13-Forward Over a year ago

@roganjosh Haha like me you call that jezrael their

U13-Forward Over a year ago

@jezrael Better now?

|

jezrael · Accepted Answer · 2019-06-26 11:23:27Z

2

First idea is add from right side space by Series.radd, what is not added to values with missing values:

tpl_subset['Tmp'] = (tpl_subset['Fullname'] + tpl_subset['Infrasp'].radd(' ').fillna(''))

Performance:

print (tpl_subset)
        Fullname     Infrasp
0         Lilium  abchasicum
1  Lilium affine         NaN

#200k rows
tpl_subset = pd.concat([tpl_subset] * 100000, ignore_index=True)


In [235]: %timeit tpl_subset['Tmp1'] = np.where(tpl_subset['Infrasp'].isnull(), tpl_subset['Fullname'], tpl_subset['Fullname'] + ' ' + tpl_subset['Infrasp'])
74.8 ms ± 7.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [236]: %timeit tpl_subset['Tmp2'] = (tpl_subset['Fullname'] + tpl_subset['Infrasp'].radd(' ').fillna(''))
63.2 ms ± 625 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [237]: %timeit tpl_subset['Tmp3'] = [f'{a} {b}' if b == b else a for a, b in zip(tpl_subset['Fullname'], tpl_subset['Infrasp'])]
72.4 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Jun 26, 2019 at 11:23

answered Jun 26, 2019 at 10:56

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

8 Comments

roganjosh Over a year ago

Wouldn't it be faster to use np.where() than string concatenation and then a .str method?

jezrael Over a year ago

@roganjosh - just tested and seems not.

roganjosh Over a year ago

So I'm seeing, and I'm confused by that :)

jezrael Over a year ago

@roganjosh - I think reason is np.where count each series separately, explanation is here

roganjosh Over a year ago

I'm still not sure why it would work that way. This almost suggests that it's quicker to do string formatting on every row and then "roll back" on the issues, than do string formatting only when a condition is true. That seems wonky but I guess I have some research to do.

|

Collectives™ on Stack Overflow

Concatenating two columns in pandas dataframe without adding extra spaces at the end when the second column contains NaN/empty strings

2 Answers 2

9 Comments

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related