1

I have the following pandas data frame:

> print(tpl_subset)
>
         Fullname                 Infrasp                   Authorship
Lilium abchasicum                     NaN                        Baker
    Lilium affine                     NaN          Schult. & Schult.f.
Lilium akkusianum                     NaN                     Gämperle
 Lilium albanicum                     NaN                      Griseb.
 Lilium albanicum           subsp. jankae              (A.Kern.) Nyman
Lilium albiflorum                     NaN                        Hook.
     Lilium album                     NaN                       Houtt.
   Lilium amabile             var. flavum                      Y.N.Lee
   Lilium amabile        var. immaculatum                      T.B.Lee
   Lilium amabile     var. kwangnungensis            Y.S.Kim & W.B.Lee
              ...                     ...                          ...

I am trying to concatenate the first two columns into a new one only if the value of the second column is not NaN.

What I have been doing so far is simply concatenating the two columns while replacing NaN by an empty string.

tpl_subset['Tmp'] = tpl_subset['Fullname'] + ' ' + tpl_subset['Infrasp'].fillna('')

The problem is that I end up with unwanted whitespaces at the end of the string when the value of the second column is NaN (e.g. 'Lilium abchasicum' becomes 'Lilium abchasicum '), which forces me to do extra steps to remove them.

These steps will be repeated hundreds of times on datasets containing hundred thousands rows each, so I'm looking for something efficient in term of performance. Using a for loop with an if else statement is not option.

Q.: is there an efficient and more direct way to do this?

The desired column output is:

                               Tmp
                 Lilium abchasicum
                     Lilium affine
                 Lilium akkusianum
                  Lilium albanicum
    Lilium albanicum subsp. jankae
                 Lilium albiflorum
                      Lilium album
        Lilium amabile var. flavum
   Lilium amabile var. immaculatum
Lilium amabile var. kwangnungensis

Edit:

A quick performance comparison between numpy.where() and radd(' ').fillna('') on the whole dataset (~1.2 millions rows):

In:
import timeit

s = '''
import pandas
import numpy as np
tpl_data = pandas.read_csv('~/phd/Data/TPL/tpl_all_species.csv', sep = '\t')
tpl_fn = tpl_data['Fullname']
tpl_inf = tpl_data['Infrasp']
tpl_concat = tpl_fn + ' ' + tpl_inf
'''

tmp1 = "tpl_data['tmp1'] = np.where(tpl_inf.isnull(), tpl_fn, tpl_concat)"
tmp2 = "tpl_data['tmp2'] = (tpl_fn + tpl_inf.radd(' ').fillna(''))"

print('np.where():', timeit.Timer(tmp1, setup = s).repeat(repeat = 3, number = 10))
print('radd():', timeit.Timer(tmp2, setup = s).repeat(repeat = 3, number = 10))

Out:
np.where(): [0.7466984760000002, 0.7332379689999993, 0.7483021389999998]
radd(): [2.2832963809999995, 2.320076223000001, 2.299452007000003]

2 Answers 2

3

Or use np.where:

df['Tmp'] = np.where(df['Infrasp'].isnull(), df['Fullname'], df['Fullname'] + ' ' + df['Infrasp'])
Sign up to request clarification or add additional context in comments.

9 Comments

But that just ditches numpy/pandas and ends up as a for loop
@roganjosh If jezrael didn't post before me i would do his solution :P
I don't think their solution is the most efficient.
@roganjosh Haha like me you call that jezrael their
@jezrael Better now?
|
2

First idea is add from right side space by Series.radd, what is not added to values with missing values:

tpl_subset['Tmp'] = (tpl_subset['Fullname'] + tpl_subset['Infrasp'].radd(' ').fillna(''))

Performance:

print (tpl_subset)
        Fullname     Infrasp
0         Lilium  abchasicum
1  Lilium affine         NaN

#200k rows
tpl_subset = pd.concat([tpl_subset] * 100000, ignore_index=True)


In [235]: %timeit tpl_subset['Tmp1'] = np.where(tpl_subset['Infrasp'].isnull(), tpl_subset['Fullname'], tpl_subset['Fullname'] + ' ' + tpl_subset['Infrasp'])
74.8 ms ± 7.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [236]: %timeit tpl_subset['Tmp2'] = (tpl_subset['Fullname'] + tpl_subset['Infrasp'].radd(' ').fillna(''))
63.2 ms ± 625 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [237]: %timeit tpl_subset['Tmp3'] = [f'{a} {b}' if b == b else a for a, b in zip(tpl_subset['Fullname'], tpl_subset['Infrasp'])]
72.4 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

8 Comments

Wouldn't it be faster to use np.where() than string concatenation and then a .str method?
@roganjosh - just tested and seems not.
So I'm seeing, and I'm confused by that :)
@roganjosh - I think reason is np.where count each series separately, explanation is here
I'm still not sure why it would work that way. This almost suggests that it's quicker to do string formatting on every row and then "roll back" on the issues, than do string formatting only when a condition is true. That seems wonky but I guess I have some research to do.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.