Why is md5 hashing so much faster on strings than on numpy arrays in python?

Question

In python/numpy, I have a 10,000x10,000 array named random_matrix. I use md5 to compute the hash for str(random_matrix) and for random_matrix itself. It takes 0.00754404067993 seconds on the string version, and 1.6968960762 on the numpy array version. When I make it into a 20,000x20,000 array, it takes 0.0778470039368 on the string version and 60.641119957 seconds on the numpy array version. Why is this? Do numpy arrays take up a lot more memory than strings? Also, if I want to make filenames identified by these matrices, is converting to a string before computing hashes a good idea, or are there some drawbacks?

Community · Accepted Answer · 2017-05-23 11:49:45Z

7

str(random_matrix) will not include all of the matrix due to numpy's eliding things with "...":

>>> x = np.ones((1000, 1000))
>>> print str(x)
[[ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 ..., 
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]]

So when you hash str(random_matrix), you aren't really hashing all the data.

See this previous question and this one about how to hash numpy arrays.

edited May 23, 2017 at 11:49

CommunityBot

11 silver badge

answered Mar 6, 2014 at 4:32

BrenBarn

253k39 gold badges421 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Why is md5 hashing so much faster on strings than on numpy arrays in python?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related