4

anyone can tell me what is the fastest way to translate this string array into a number array as below:

import numpy as np
strarray = np.array([["123456"], ["654321"]])

     to

numberarray = np.array([[1,2,3,4,5,6], [6,5,4,3,2,1]])

map str to list and then map str to int is too slow for a large array!

Please help!

6
  • 3
    Possible duplicate of How to convert an array of strings to an array of floats in numpy? Commented Feb 24, 2016 at 13:22
  • 2
    Is this typo? ["12456"] -> [1,2,3,4,5,6] Commented Feb 24, 2016 at 13:22
  • Are all elements guaranteed to have the same length (like it's 6 in the sample case)? Commented Feb 24, 2016 at 13:32
  • To lan: Yes, that is a typo!already correct that! Commented Feb 24, 2016 at 13:58
  • To Divakar: Yes, guaranteed to have the same length!! Commented Feb 24, 2016 at 13:59

2 Answers 2

3

You can split the strings into single characters with the array view method:

In [18]: strarray = np.array([[b"123456"], [b"654321"]])

In [19]: strarray.dtype
Out[19]: dtype('S6')

In [20]: strarray.view('S1')
Out[20]: 
array([['1', '2', '3', '4', '5', '6'],
       ['6', '5', '4', '3', '2', '1']], 
      dtype='|S1')

See here for data type character codes.

Then the most obvious next step is to use astype:

In [23]: strarray.view('S1').astype(int)
Out[23]: 
array([[1, 2, 3, 4, 5, 6],
       [6, 5, 4, 3, 2, 1]])

However, it's a lot faster to reinterpret (view) the memory underlying the strings as single byte integers and subtract 48. This works because ASCII characters take up a single byte and the characters '0' through '9' are binary equivalent to (u)int8's 48 through 57 (check the ord builtin).

Speed comparison:

In [26]: ar = np.array([[''.join(np.random.choice(list('123456789'), size=320))] for _ in range(1000)], bytes)

In [27]: %timeit _ = ar.view('S1').astype(np.uint8)
1 loops, best of 3: 284 ms per loop

In [28]: %timeit _ = ar.view(np.uint8) - ord('0')
1000 loops, best of 3: 1.07 ms per loop

If have Unicode instead of ASCII you need to do these steps slightly different. Or just convert to ASCII first with astype(bytes).

Sign up to request clarification or add additional context in comments.

6 Comments

Could be a version issue, I am getting unicode for strarray.dtype. I am on Python 3.4. And ar.view('S1') has "b'" all over alongwith the strings themselves.
@Divakar - I changed the strings to bytes for Python 3 compatibility.
But if OP has those as strings, he/she has to convert to byte first, right? How could that be done?
@Divakar - Python 2.x has ASCII strings as default and for those it works.
Ah yes you have mentioned .astype(bytes) for the conversion in the post! Nice, works for me now.
|
0

Here's an approach that converts the input strings to N-length numeric arrays, i.e. each string gets converted to a 1D array of length N, where N is the length of each of those strings. The approach suggested here basically converts the string to their int equivalents and then gets all the digits using differentiation from their preceding elements' power-10 scaled version. The implementation looks like this -

A = (strarray.astype(int)/(10**np.arange(len(strarray[0][0])))).astype(int)
out = np.column_stack((A[:,-1],(A[:,:-1] - 10*A[:,1:])[:,::-1]))

Sample run -

In [177]: strarray  = np.array([["0308468"], ["6540542"], ["4973473"]])

In [178]: A = (strarray.astype(int)/(10**np.arange(len(strarray[0][0])))).astype(int)
     ...: out = np.column_stack((A[:,-1],(A[:,:-1] - 10*A[:,1:])[:,::-1]))
     ...: 

In [179]: out
Out[179]: 
array([[0, 3, 0, 8, 4, 6, 8],
       [6, 5, 4, 0, 5, 4, 2],
       [4, 9, 7, 3, 4, 7, 3]])

1 Comment

Tricky solution! Thanks for providing this method for lighting me up!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.