0

So I have a numpy array of strings that contain numeric values separated by spaces, for example:

np.array(['1 2', '3 4'])
array(['1 2', '3 4'], dtype='<U3')

and I want to convert it to a numerical matrix like:

np.array([[1,2],[3,4]])
array([[1, 2],[3, 4]])

I'm looking for an operation that can leverage numpy vecotrized operations, as speed is important here. The rows have length 2 in this example, but I need a general approach with an arbitrary row length.

Thanks!

6
  • Possible duplicate of Convert string numpy.ndarray to float numpy.ndarray Commented Jul 15, 2019 at 10:03
  • 1
    I came up with two vectorized solutions using np.char.split and pd.Series.str.split but both of them are slower than the native Python loops in the accepted answer of the duplicate target. Commented Jul 15, 2019 at 11:01
  • @Georgy Can you post these solutions, maybe they are faster with bigger arrays, which is my case Commented Jul 15, 2019 at 11:18
  • 1
    Posted it under the duplicate question Commented Jul 15, 2019 at 12:22
  • @Georgy Thanks! I think the solution with np.char.split should be the fastest, I posted an issue in the numpy tracker Commented Jul 15, 2019 at 15:25

3 Answers 3

1

Here is an approach assuming nonnegative ints coming in pairs of two separated by a single space:

def to_num(x):                                          
    y = (x[:,None].view(np.int32)-48)*10**np.arange(x.itemsize//4-1,-1,-1)                    
    splt = y.argmin(1)                                                                        
    z = np.take_along_axis(y.cumsum(1),np.column_stack([splt-1,np.full(*y.shape-np.arange(2))]),1)
    z[:,1]+=10**(y.shape[1]-splt-1)*16-z[:,0]                                                    
    z[:,0]//=10**(y.shape[1]-splt)                                                               
    end = (y[:,::-1]>=0).argmax(1)
    z[:,1]+=np.concatenate([[0],48*np.cumsum(10**np.arange(end.max()))])[end]
    z[:,1]//=10**end
    return z

For example, 10^6 pairs take roughly 3 secs on my machine:

from timeit import timeit

x = np.random.randint(0,1000000,(1000000,2))
x = np.array([" ".join(map(str, y)) for y in x.tolist()])

(to_num(x) == [[int(z) for z in y.split()] for y in x.tolist()]).all()
# True
timeit(lambda:to_num(x), number=10)
# 2.9360161621589214
Sign up to request clarification or add additional context in comments.

1 Comment

This seems to work, but it assumes that each row has length 2, and I'm interested in arrays with bigger rows, I'll add the info to the question.
0

If it dont have to be that fast you could iterate element-wise over it and then apply:

def seperate_sting(s):

    split_numbers = s.split(' ')
    output = np.asarray(split_numbers).astype(int)

    return output


seperate_sting('1 1')
>>> array([1, 1])

3 Comments

Thanks for the answer, but speed is critical here, I was looking for some type of vectorized operation. I will add the info to the question.
are there int values only? And are they resticted to [0,9]?
Yes, they are integers, but they are not restricted to [0,9]
0

First, try to split your string with the white space, and when it's done check for the function numpy.asmatrix()

1 Comment

np.matrix(';'.join(a)) uses the ''1 2; 3 4" syntax that np.matrix accepts. But this is slower than the list comprehensions. np.matrix still has to use string operations to split the lines and numbers, just reversing our join. It doesn't use fast compiled code to do that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.