I have the following pandas data frame:
> print(tpl_subset)
>
Fullname Infrasp Authorship
Lilium abchasicum NaN Baker
Lilium affine NaN Schult. & Schult.f.
Lilium akkusianum NaN Gämperle
Lilium albanicum NaN Griseb.
Lilium albanicum subsp. jankae (A.Kern.) Nyman
Lilium albiflorum NaN Hook.
Lilium album NaN Houtt.
Lilium amabile var. flavum Y.N.Lee
Lilium amabile var. immaculatum T.B.Lee
Lilium amabile var. kwangnungensis Y.S.Kim & W.B.Lee
... ... ...
I am trying to concatenate the first two columns into a new one only if the value of the second column is not NaN.
What I have been doing so far is simply concatenating the two columns while replacing NaN by an empty string.
tpl_subset['Tmp'] = tpl_subset['Fullname'] + ' ' + tpl_subset['Infrasp'].fillna('')
The problem is that I end up with unwanted whitespaces at the end of the string when the value of the second column is NaN (e.g. 'Lilium abchasicum' becomes 'Lilium abchasicum '), which forces me to do extra steps to remove them.
These steps will be repeated hundreds of times on datasets containing hundred thousands rows each, so I'm looking for something efficient in term of performance. Using a for loop with an if else statement is not option.
Q.: is there an efficient and more direct way to do this?
The desired column output is:
Tmp
Lilium abchasicum
Lilium affine
Lilium akkusianum
Lilium albanicum
Lilium albanicum subsp. jankae
Lilium albiflorum
Lilium album
Lilium amabile var. flavum
Lilium amabile var. immaculatum
Lilium amabile var. kwangnungensis
Edit:
A quick performance comparison between numpy.where() and radd(' ').fillna('') on the whole dataset (~1.2 millions rows):
In:
import timeit
s = '''
import pandas
import numpy as np
tpl_data = pandas.read_csv('~/phd/Data/TPL/tpl_all_species.csv', sep = '\t')
tpl_fn = tpl_data['Fullname']
tpl_inf = tpl_data['Infrasp']
tpl_concat = tpl_fn + ' ' + tpl_inf
'''
tmp1 = "tpl_data['tmp1'] = np.where(tpl_inf.isnull(), tpl_fn, tpl_concat)"
tmp2 = "tpl_data['tmp2'] = (tpl_fn + tpl_inf.radd(' ').fillna(''))"
print('np.where():', timeit.Timer(tmp1, setup = s).repeat(repeat = 3, number = 10))
print('radd():', timeit.Timer(tmp2, setup = s).repeat(repeat = 3, number = 10))
Out:
np.where(): [0.7466984760000002, 0.7332379689999993, 0.7483021389999998]
radd(): [2.2832963809999995, 2.320076223000001, 2.299452007000003]