0

So I have this array of strings, that I got from a database query

dat = [['1','2 3 4 5'],
['6', '7 8 9 10'],
['11', '12 13 14 15']]

From this 3x2 array I have to make a 3x5 array of floats to do my calculations. For now I'm just saving the array to a tmp file and reading the file to get the 3x5 array,

np.savetxt(file,dat, fmt="%s\t%s")
np.loadtxt(file)

But other than explicitly looping through the elements spliting them them and converting them, is there any more efficient numpy way to do this?

7
  • Use np.vstack(np.char.split(np.asarray(dat)).sum(axis=1)).astype(np.float) Commented Dec 13, 2018 at 18:50
  • 1
    @Mstaino very nice, that should be posted as an answer Commented Dec 13, 2018 at 18:57
  • Out of curiosity; what is wrong with for-loops? Commented Dec 13, 2018 at 18:59
  • Thanks @G.Anderson! Commented Dec 13, 2018 at 19:00
  • @GlobalTraveler especially if you're running in numpy, vectorized operations are faster and more efficient in general. In this example, it's not really a big deal, but over larger arrays it could make a large difference. Not to mention it's just better coding practice to vectorize over looping. Commented Dec 13, 2018 at 19:04

2 Answers 2

3

You can use the following one-liner:

np.vstack(np.char.split(dat).sum(axis=1)).astype(np.float)
Sign up to request clarification or add additional context in comments.

3 Comments

Is it really necessary to use np.asarray ?
If you omit the np.array, char.splt does it for you. Timing's the same either way.
didn't know that. Thanks @hpaulj. Adding it to improve the answer
2

Using conventional Python iteration:

def foo(row):
    res = []
    for x in row: res.extend(x.split())
    return res
In [141]: np.array([foo(row) for row in dat],int)
Out[141]: 
array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]])

It's noticeably faster than the np.char.split approach:

In [143]: timeit np.vstack(np.char.split(dat).sum(axis=1)).astype(int)
61 µs ± 171 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [144]: timeit np.array([foo(row) for row in dat],int)
8.74 µs ± 239 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And the rejected fromstring approach:

In [147]: timeit np.array([np.fromstring(' '.join(i), sep=' ') for i in dat],int)
13.9 µs ± 296 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And from the comment:

In [256]: timeit np.asarray([' '.join(j for i in dat for j in i).split(' ')], in
     ...: t).reshape(3 ,5)
10.1 µs ± 12.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [253]: ' '.join(j for i in dat for j in i)
Out[253]: '1 2 3 4 5 6 7 8 9 10 11 12 13 14 15'

In the same spirit - do the string join one row at a time:

In [262]: timeit np.array([' '.join(row).split() for row in dat], int)
7.47 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

2 Comments

I would like to add my contribution timeit asarray([' '.join(j for i in dat for j in i).split(' ')]).reshape(3 ,5) 4.47 µs ± 33.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) Which outperforms all the methods you posted above
Forgot the dtype argument above, adding it will add 1 microsecond.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.