18

I have a String as follows :

1|234|4456|789

I have to convert it into numpy array.I would like to know the most efficient way.Since I will be calling this function for more than 50 million times!

3 Answers 3

25

The fastest way is to use the numpy.fromstring method:

>>> import numpy
>>> data = "1|234|4456|789"
>>> numpy.fromstring(data, dtype=int, sep="|")
array([   1,  234, 4456,  789])
Sign up to request clarification or add additional context in comments.

2 Comments

why didn't i think of that.. :P
Thank you so much.. Quite efficient from the @bernie time calculation.. :)
8

@jterrace wins one (1) internet.

In the measurements below the example code has been shortened to allow the tests to fit on one line without scrolling where possible.

For those not familiar with timeit the -s flag allows you to specify a bit of code which will only be executed once.


The fastest and least-cluttered way is to use numpy.fromstring as jterrace suggested:

python -mtimeit -s"import numpy;s='1|2'" "numpy.fromstring(s,dtype=int,sep='|')"
100000 loops, best of 3: 1.85 usec per loop

The following three examples use string.split in combination with another tool.

string.split with numpy.fromiter

python -mtimeit -s"import numpy;s='1|2'" "numpy.fromiter(s.split('|'),dtype=int)"
100000 loops, best of 3: 2.24 usec per loop

string.split with int() cast via generator-expression

python -mtimeit -s"import numpy;s='1|2'" "numpy.array(int(x) for x in s.split('|'))"
100000 loops, best of 3: 3.12 usec per loop

string.split with NumPy array of type int

python -mtimeit -s"import numpy;s='1|2'" "numpy.array(s.split('|'),dtype=int)"
100000 loops, best of 3: 9.22 usec per loop

1 Comment

much better explanation on speed difference!
5

Try this:

import numpy as np
s = '1|234|4456|789'
array = np.array([int(x) for x in s.split('|')])

... Assuming that the numbers are all ints. if not, replace int with float in the above snippet of code.

EDIT 1:

Alternatively, you can do this, it will only create one intermediate list (the one generated by split()):

array = np.array(s.split('|'), dtype=int)

EDIT 2:

And yet another way, possibly faster (thanks for all the comments, guys!):

array = np.fromiter(s.split("|"), dtype=int)

9 Comments

The problem with this is that it generates an in-memory list of all the parts of the string. If there really are 50 million parts, that's a lot of extra memory for a temporary list.
@AdamMihalcin that really depends on the version of Python in use. In Python 3, the list is evaluated lazily and no intermediate lists are created. Also, the OP said that the function would be called 50 million times, not that there are 50 million elements in the list.
@AdamMihalcin Even if you use imap or a generator expression? Oscar - On Python 3, the list comprehension still will create an intermediate list.
You can simplify it as array = np.array(s.split('|'), dtype=int). But that still leaves you with a large in-memory list.
FWIW I find np.fromiter(s.split("|"), dtype=int) is several times faster.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.