Numpy efficient way to parse array of string

Question

So I have this array of strings, that I got from a database query

dat = [['1','2 3 4 5'],
['6', '7 8 9 10'],
['11', '12 13 14 15']]

From this 3x2 array I have to make a 3x5 array of floats to do my calculations. For now I'm just saving the array to a tmp file and reading the file to get the 3x5 array,

np.savetxt(file,dat, fmt="%s\t%s")
np.loadtxt(file)

But other than explicitly looping through the elements spliting them them and converting them, is there any more efficient numpy way to do this?

Use np.vstack(np.char.split(np.asarray(dat)).sum(axis=1)).astype(np.float) — Tarifazo
– Tarifazo, Commented Dec 13, 2018 at 18:50
@GlobalTraveler especially if you're running in numpy, vectorized operations are faster and more efficient in general. In this example, it's not really a big deal, but over larger arrays it could make a large difference. Not to mention it's just better coding practice to vectorize over looping. — G. Anderson
– G. Anderson, Commented Dec 13, 2018 at 19:04

Tarifazo · Accepted Answer · 2018-12-13 20:24:11Z

3

You can use the following one-liner:

np.vstack(np.char.split(dat).sum(axis=1)).astype(np.float)

edited Dec 13, 2018 at 20:24

answered Dec 13, 2018 at 18:58

Tarifazo

4,3531 gold badge11 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Eular Over a year ago

Is it really necessary to use np.asarray ?

hpaulj Over a year ago

If you omit the np.array, char.splt does it for you. Timing's the same either way.

Tarifazo Over a year ago

didn't know that. Thanks @hpaulj. Adding it to improve the answer

hpaulj · Accepted Answer · 2018-12-14 05:35:11Z

2

Using conventional Python iteration:

def foo(row):
    res = []
    for x in row: res.extend(x.split())
    return res
In [141]: np.array([foo(row) for row in dat],int)
Out[141]: 
array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]])

It's noticeably faster than the np.char.split approach:

In [143]: timeit np.vstack(np.char.split(dat).sum(axis=1)).astype(int)
61 µs ± 171 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [144]: timeit np.array([foo(row) for row in dat],int)
8.74 µs ± 239 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And the rejected fromstring approach:

In [147]: timeit np.array([np.fromstring(' '.join(i), sep=' ') for i in dat],int)
13.9 µs ± 296 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And from the comment:

In [256]: timeit np.asarray([' '.join(j for i in dat for j in i).split(' ')], in
     ...: t).reshape(3 ,5)
10.1 µs ± 12.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [253]: ' '.join(j for i in dat for j in i)
Out[253]: '1 2 3 4 5 6 7 8 9 10 11 12 13 14 15'

In the same spirit - do the string join one row at a time:

In [262]: timeit np.array([' '.join(row).split() for row in dat], int)
7.47 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

edited Dec 14, 2018 at 5:35

answered Dec 13, 2018 at 19:31

hpaulj

233k14 gold badges260 silver badges392 bronze badges

2 Comments

cvanelteren Over a year ago

I would like to add my contribution timeit asarray([' '.join(j for i in dat for j in i).split(' ')]).reshape(3 ,5) 4.47 µs ± 33.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) Which outperforms all the methods you posted above

cvanelteren Over a year ago

Forgot the dtype argument above, adding it will add 1 microsecond.

Collectives™ on Stack Overflow

Numpy efficient way to parse array of string

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related