5

Below, I have 2 functions computing the sum of squares of their arguments. The first one is nice and functional, but 20x slower than the second one. I presume that the r/map is not taking advantage of aget to retrieve elements from the double-array, whereas I'm explicitly doing this in function 2.

Is there any way I can further typehint or help r/map r/fold to perform faster?

(defn sum-of-squares
  "Given a vector v, compute the sum of the squares of elements."
  ^double [^doubles v]
  (r/fold + (r/map #(* % %) v)))

(defn sum-of-squares2
  "This is much faster than above.  Post to stack-overflow to see."
  ^double [^doubles v]
  (loop [val 0.0
         i (dec (alength v))]
    (if (neg? i)
      val
      (let [x (aget v i)]
        (recur (+ val (* x x)) (dec i))))))

(def a (double-array (range 10)))
(quick-bench (sum-of-squares a))

800 ns

(quick-bench (sum-of-squares2 a))

40 ns

2 Answers 2

7

Before experiments I've added next line in project.clj:

:jvm-opts ^:replace [] ; Makes measurements more accurate

Basic measurements:

(def a (double-array (range 1000000))) ; 10 is too small for performance measurements
(quick-bench (sum-of-squares a)) ; ... Execution time mean : 27.617748 ms ...
(quick-bench (sum-of-squares2 a)) ; ... Execution time mean : 1.259175 ms ...

This is more or less consistent with time difference in the question. Let's try to not use Java arrays (which are not really idiomatic for Clojure):

(def b (mapv (partial * 1.0) (range 1000000))) ; Persistent vector
(quick-bench (sum-of-squares b)) ; ... Execution time mean : 14.808644 ms ...

Almost 2 times faster. Now let's remove type hints:

(defn sum-of-squares3
"Given a vector v, compute the sum of the squares of elements."
[v]
(r/fold + (r/map #(* % %) v)))

(quick-bench (sum-of-squares3 a)) ; Execution time mean : 30.392206 ms
(quick-bench (sum-of-squares3 b)) ; Execution time mean : 15.583379 ms

Execution time increased only marginally comparing to version with type hints. By the way, version with transducers has very similar performance and is much cleaner:

(defn sum-of-squares3 [v]
  (transduce (map #(* % %)) + v))

Now about additional type hinting. We can indeed optimize first sum-of-squares implementation:

(defn square ^double [^double x] (* x x))

(defn sum-of-squares4
  "Given a vector v, compute the sum of the squares of elements."
  [v]
  (r/fold + (r/map square v)))

(quick-bench (sum-of-squares4 b)) ; ... Execution time mean : 12.891831 ms ...

(defn pl
  (^double [] 0.0)
  (^double [^double x] (+ x))
  (^double [^double x ^double y] (+ x y)))

(defn sum-of-squares5
  "Given a vector v, compute the sum of the squares of elements."
  [v]
  (r/fold pl (r/map square v)))

(quick-bench (sum-of-squares5 b)) ; ... Execution time mean : 9.441748 ms ...

Note #1: type hints on arguments and return value of sum-of-squares4 and sum-of-squares5 have no additional performance benefits.

Note #2: It's generally bad practice to start with optimizations. Straight-forward version (apply + (map square v)) will have good enough performance for most situations. sum-of-squares2 is very far from idiomatic and uses literally no Clojure concepts. If this is really performance critical code - better to implement it in Java and use interop. Code will be much cleaner despite of having 2 languages. Or even implement it in unmanaged code (C, C++) and use JNI (not really maintainable but if properly implemented, can give the best possible performance).

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. I'm aware that my v2 is not idiomatic, but the code is very performance sensitive (trillions of calculations) and I'd hate to have to reach for java everytime I have a patch of hot code. I would naturally prefer to use a generic clojure-esque version, but even a 10:1 performance slowdown is quite significant. So for this particular application I'll stick with my v2. I was just trying to have my cake and eat it too...approach the performance of v2 with the elegance of your transducers v3.
In other words, I'm treating v2 AS a piece of interop, without the overhead of 2 languages.
1

Why not use areduce:

(def sum-of-squares3 ^double [^doubles v]
  (areduce v idx ret 0.0
           (let [item (aget v idx)]
             (+ ret (* item item)))))

On my machine running:

(criterium/bench (sum-of-squares3 (double-array (range 100000))))

Gives a mean execution time of 1.809103 ms, your sum-of-squares2 executes the same calculation in 1.455775 ms. I think this version using areduce is more idiomatic than your version.

For squeezing a little bit more performance you can try using unchecked math (add-unchecked and multiply-unchecked). But beware, you need to be sure that your calculation cannot overflow:

(defn sum-of-squares4 ^double [^doubles v]
  (areduce v idx ret 0.0
           (let [item (aget v idx)]
             (unchecked-add ret (unchecked-multiply item item)))))

Running the same benchmark gives a mean execution time of 1.144197 ms. Your sum-of-squares2 can also benefit from unchecked math with a 1.126001 ms mean execution time.

1 Comment

Thanks Rodrigo. I wasn't aware of areduce. That's exactly what I needed, a way to tell a reduction to use aget...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.