1

I need to take 20 results from a lazy sequence of millions of hash-maps but for the 20 to be based on sorting on various values within the hashmaps.

For example:

(def population [{:id 85187153851 :name "anna" :created #inst "2012-10-23T20:36:25.626-00:00" :rank 77336}
         {:id 12595145186 :name "bob" :created #inst "2011-02-03T20:36:25.626-00:00" :rank 983666}
         {:id 98751563911 :name "cartmen" :created #inst "2007-01-13T20:36:25.626-00:00" :rank 112311}
         ...
         {:id 91514417715 :name "zaphod" :created #inst "2015-02-03T20:36:25.626-00:00" :rank 9866}]

In normal circumstances a simple sort-by would get the job done:

(sort-by :id population)
(sort-by :name population)
(sort-by :created population)
(sort-by :rank population)

But I need to do this across millions of records as fast as possible and want to do it lazily rather than having to realize the entire data set.

I looked around a lot and found a number of implementations of algorithms that work really well for sorting a sequence of values (mostly numeric) but none for a lazy sequence of hash-maps in the way I need.

Speed & efficiency being of prime importance, the best I have found has been the quicksort example from the Joy Of Clojure book (Chapter 6.4) which does just enough work to return the required result.

(ns joy.q)

(defn sort-parts
  "Lazy, tail-recursive, incremental quicksort.  Works against
   and creates partitions based on the pivot, defined as 'work'."
  [work]
  (lazy-seq
   (loop [[part & parts] work]
     (if-let [[pivot & xs] (seq part)]
       (let [smaller? #(< % pivot)]
         (recur (list*
                 (filter smaller? xs)
                 pivot
                 (remove smaller? xs)
                 parts)))
       (when-let [[x & parts] parts]
         (cons x (sort-parts parts)))))))

(defn qsort [xs]
    (sort-parts (list xs))) 

Works really well...

(time (take 10 (qsort (shuffle (range 10000000)))))
"Elapsed time: 551.714003 msecs"
(0 1 2 3 4 5 6 7 8 9)

Great! But...

However much I try I can't seem to work out how to apply this to a sequence of hashmaps.

I need something like:

(take 20 (qsort-by :created population))

1 Answer 1

4

If you only need the top N elements a full sort is too expensive (even a lazy sort as the one in the JoC: it needs to keep nearly the all data set in memory).

You only need to scan (reduce) the dataset and keep the best N items so far.

=> (defn top-by [n k coll]
     (reduce
      (fn [top x]
        (let [top (conj top x)]
          (if (> (count top) n)
            (disj top (first top))
            top)))
      (sorted-set-by #(< (k %1) (k %2))) coll))
#'user/top-by
=> (top-by 3 first [[1 2] [10 2] [9 3] [4 2] [5 6]])
#{[5 6] [9 3] [10 2]}
Sign up to request clarification or add additional context in comments.

1 Comment

Fantastic! Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.