How to evaluate and add string to numpy array element

Question

Have this piece of code that I am trying to optimize. It uses list comprehensions and works.

series1 = np.asarray(range(10)).astype(float)
series2 = series1[::-1]

ntup = zip(series1,series2)
[['', 't:'+str(series2)][series1 > series2] for series1,series2 in ntup ]
 #['', '', '', '', '', 't:4.0', 't:3.0', 't:2.0', 't:1.0', 't:0.0']

Trying to use np.where() here. Is there a solution with numpy. (Without series being consumed)

series1 = np.asarray(range(10)).astype(float)
series2 = series1[::-1]  

np.where(series1 >  series2 ,'t:'+ str(series2),'' )

The results is this:

array(['', '', '', '', '', 't:[ 9.  8.  7.  6.  5.  4.  3.  2.  1.  0.]',
       't:[ 9.  8.  7.  6.  5.  4.  3.  2.  1.  0.]',
       't:[ 9.  8.  7.  6.  5.  4.  3.  2.  1.  0.]',
       't:[ 9.  8.  7.  6.  5.  4.  3.  2.  1.  0.]',
       't:[ 9.  8.  7.  6.  5.  4.  3.  2.  1.  0.]'], 
      dtype='|S43')

Do you need those empty ones too or something like this is okay too - ['t:4.0', 't:3.0', 't:2.0', 't:1.0', 't:0.0']? — Divakar
– Divakar, Commented Aug 31, 2016 at 18:34

Community · Accepted Answer · 2017-05-23 10:29:14Z

We can use a vectorized approach based on

np.core.defchararray.add for the string appending of 't:' with the valid strings, and
np.where to choose based on the conditional statement and perform the appending or just use the default value of an empty string.

So, we would have an implementation like so -

np.where(series1>series2,np.core.defchararray.add('t:',series2.astype(str)),'')

Boost it-up!

We can use the appending with np.core.defchararray.add on the valid elements based on the mask of series1>series2 to boost up the performance further after initializing an array with the default empty strings and then assigning only the valid values into it.

So, the modified version would look something like this -

mask = series1>series2
out = np.full(series1.size,'',dtype='U34')
out[mask] = np.core.defchararray.add('t:',series2[mask].astype(str))

Runtime test

Vectorized versions as functions :

def vectorized_app1(series1,series2):
    mask = series1>series2
    return np.where(mask,np.core.defchararray.add('t:',series2.astype(str)),'')

def vectorized_app2(series1,series2):
    mask = series1>series2
    out = np.full(series1.size,'',dtype='U34')
    out[mask] = np.core.defchararray.add('t:',series2[mask].astype(str))
    return out

Timings on a bigger dataset -

In [283]: # Setup input arrays
     ...: series1 = np.asarray(range(10000)).astype(float)
     ...: series2 = series1[::-1]
     ...: 

In [284]: %timeit [['', 't:'+str(s2)][s1 > s2] for s1,s2 in zip(series1, series2)]
10 loops, best of 3: 32.1 ms per loop # OP/@hpaulj's soln

In [285]: %timeit vectorized_app1(series1,series2)
10 loops, best of 3: 20.5 ms per loop

In [286]: %timeit vectorized_app2(series1,series2)
100 loops, best of 3: 10.4 ms per loop

As noted by OP in comments, that we can probably play around with the dtype for series2 before appending. So, I used U32 there to keep the output dtype same as with str dtype, i.e. series2.astype('U32') inside the np.core.defchararray.add call. The new timings for the vectorized approaches were -

In [290]: %timeit vectorized_app1(series1,series2)
10 loops, best of 3: 20.1 ms per loop

In [291]: %timeit vectorized_app2(series1,series2)
100 loops, best of 3: 10.1 ms per loop

So, there's some further marginal improvement there!

hpaulj · Accepted Answer · 2016-08-31 18:56:17Z

1

Your list comprehensions work just fine for lists, not really need to use arrays. And for operations like this arrays probably won't give any speed advantage.

In [521]: series1=[float(i) for i in range(10)]
In [522]: series2=series1[::-1]
In [523]: [['', 't:'+str(s2)][s1 > s2] for s1,s2 in zip(series1, series2)]
Out[523]: ['', '', '', '', '', 't:4.0', 't:3.0', 't:2.0', 't:1.0', 't:0.0']

As @Divaker noted there is a np.char.add function that will perform string operations. My experience is that they are marginally faster than list operations. And when you take into account the overhead of creating arrays, they may be slower.

=========

The array version as shown by @Divakar

In [539]: aseries1=np.array(series1)
In [540]: aseries2=np.array(series2)
In [541]: np.where(aseries1>aseries2, np.char.add('t:',aseries2.astype('U3')), '
     ...: ')
Out[541]: 
array(['', '', '', '', '', 't:4.0', 't:3.0', 't:2.0', 't:1.0', 't:0.0'], 
      dtype='<U5')

A couple of time tests:

In [542]: timeit [['', 't:'+str(s2)][s1 > s2] for s1,s2 in zip(series1, series2)
     ...: ]
100000 loops, best of 3: 15.5 µs per loop

In [543]: timeit np.where(aseries1>aseries2, np.char.add('t:',aseries2.astype('U3')), '')
10000 loops, best of 3: 63 µs per loop

edited Aug 31, 2016 at 18:56

answered Aug 31, 2016 at 18:43

hpaulj

233k14 gold badges260 silver badges392 bronze badges

5 Comments

Divakar Over a year ago

Maybe benchmark on bigger arrays? Also, isn't OP dealing with arrays as inputs?

Merlin Over a year ago

The code sample is used constantly. the arrays sizes are 10,000-5,000, can be a few arrays in conditional. On benchmarking, it worth re-write on some.

Merlin Over a year ago

np.core.defchararray.add('t:',series2.astype(str)) is faster than np.char.add('t:',aseries2.astype('U3')) -- its the unicode converion.

hpaulj Over a year ago

I used 'U3' because I'm on Py3; there str does the same thing. On a Py2 session, str and S3 should be the same. U3 will create longer arrays.

Divakar Over a year ago

@Merlin Incorporated that idea into the timings in my post.

Jules Gagnon-Marchand · Accepted Answer · 2016-08-31 19:26:14Z

1

This works for me. Fully vectorized.

import numpy as np
series1 = np.arange(10)
series2 = series1[::-1]
empties = np.repeat('', series1.shape[0])
ts = np.repeat('t:', series1.shape[0])
s2str = series2.astype(np.str)
m = np.vstack([empties, np.core.defchararray.add(ts, s2str)])
cmp = np.int64(series1 > series2)
idx = np.arange(m.shape[1])
res = m[cmp, idx]
print res

answered Aug 31, 2016 at 19:26

Jules Gagnon-Marchand

3,8011 gold badge24 silver badges35 bronze badges

1 Comment

Merlin Over a year ago

can you add timing?

Collectives™ on Stack Overflow

How to evaluate and add string to numpy array element

3 Answers 3

Comments

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related