1

In python/numpy, I have a 10,000x10,000 array named random_matrix. I use md5 to compute the hash for str(random_matrix) and for random_matrix itself. It takes 0.00754404067993 seconds on the string version, and 1.6968960762 on the numpy array version. When I make it into a 20,000x20,000 array, it takes 0.0778470039368 on the string version and 60.641119957 seconds on the numpy array version. Why is this? Do numpy arrays take up a lot more memory than strings? Also, if I want to make filenames identified by these matrices, is converting to a string before computing hashes a good idea, or are there some drawbacks?

1 Answer 1

7

str(random_matrix) will not include all of the matrix due to numpy's eliding things with "...":

>>> x = np.ones((1000, 1000))
>>> print str(x)
[[ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 ..., 
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]]

So when you hash str(random_matrix), you aren't really hashing all the data.

See this previous question and this one about how to hash numpy arrays.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.