3

I have a vector of strings and I would like to hash each element individually to integers modulo n.

In this SO post it suggests an approach using digest and strotoi. But when I try it I get NA as the returned value

library(digest)
strtoi(digest("cc", algo = "xxhash32"), 16L)

So the above approach will not work as it can not even produce an integer let alone modulo of one.

What's the best way to hash a large vector of strings to integers modulo n for some n? Efficient solutions are more than welcome as the vector is large.

3
  • 1
    For speed, you could also consider replacing the modulo reduction with multiply and shift. Commented Dec 1, 2017 at 8:33
  • @ThomasMueller not obvious how to do for a novice Commented Dec 6, 2017 at 4:29
  • A good description is in the blog post A fast alternative to the modulo reduction Commented Dec 6, 2017 at 5:12

2 Answers 2

2

R uses 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9. strtoi returns NA because the number is too big.

The mpfr-function from the Rmpfr package should work for you:

mpfr(x = digest("cc`enter code here`", algo = "xxhash32"), base = 16)
[1] 4192999065
Sign up to request clarification or add additional context in comments.

1 Comment

seems quite slow when I applied to a large vector.
1

I made a Rcpp implementation using code from this SO post and the resultant code is quite fast even for large-ish string vectors.

To use it

if(!require(disk.frame)) devtools::install_github("xiaodaigh/disk.frame")
modn = 17
disk.frame::hashstr2i(c("string1","string2"), modn)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.