fast sorting algorithm for multiple ordered arrays

Question

I have 4 arrays, that are ordered. I'd like to be able to combine them together into a single sorted datastructure and lazily take from that.

Is there an efficient way of doing this?

[1 3 4 6 9 10 15]
[2 3 6 7 8 9 10]
[1 3 6 7 8 9 10]
[1 2 3 4 8 9 10]

=> [1 1 1 2 2 3 3 3 3 4]

I thought I could find an easy answer, but I couldn't. Most efficient would some kind of merge sort, I think. May have to hand code it using loop/recur (or plain Java for max speed). — Alan Thompson
– Alan Thompson, Commented Jul 9, 2020 at 5:42

Rulle · Accepted Answer · 2020-07-09 11:15:55Z

3

Clojure comes with a library of functions producing or operating on lazy sequences, such as map, iterate and take-while. I believe a merge algorithm could be expressed by combining them, something like this.

(defn insert-into-sorted [dst x]
  (let [x0 (first x)
        a (take-while #(< (first %) x0) dst)
        b (drop (count a) dst)]
    (vec (concat a [x] b))))

(defn next-arrays [arrs]
  (let [[f & r] arrs
        restf (rest f)]
    (if (empty? restf)
      r
      (insert-into-sorted r restf))))

(defn merge-sorted-arrays [arrs]
  (->> arrs
       (filter seq)
       (sort-by first)
       (iterate next-arrays)
       (take-while seq)
       (map ffirst)))

And we can call it like this:

(merge-sorted-arrays [[1 3 4 6 9 10 15]
                      [2 3 6 7 8 9 10]
                      [1 3 6 7 8 9 10]
                      [1 2 3 4 8 9 10]])
;; => (1 1 1 2 2 3 3 3 3 4 4 6 6 6 7 7 8 8 8 9 9 9 9 10 10 10 10 15)

It is true that you could do something like (sort (apply concat ...)) but that could turn out inefficient if you have a lot of data.

Update: A previous version of this code contained a call to count that limited its applicability to merging sequences of finite length. By changing it to using empty? instead, there is no such limitation and we can now use it to merge sequences of infinite length:

(take 12 (merge-sorted-arrays [(iterate (partial + 1.1) 1) (iterate (partial + 1.11) 1)]))
;; => (1 1 2.1 2.1100000000000003 3.2 3.2200000000000006 4.300000000000001 4.330000000000001 5.4 5.440000000000001 6.5 6.550000000000002)

edited Jul 9, 2020 at 11:15

answered Jul 9, 2020 at 6:32

Rulle

4,9312 gold badges19 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

zcaudate Over a year ago

I'm still staring at the code trying to figure out how it works. Thanks heaps.

Rulle Over a year ago

You're welcome! The most important function is arguably iterate, used to express iterative algorithms and produce a sequence of iterated values. It is useful in many situations where you would otherwise use loop. In our case, the iterated value are the arrays sorted by their first element.

Rulle Over a year ago

I just discovered a bug that limited its use to finite sequences. See the modified answer above with an example on sequences of infinite length.

zcaudate Over a year ago

i did get a stackoverflow at about 10000 items. I'll try again with this approach.

zcaudate Over a year ago

I'm still getting the stackoverflow. It happened at around 25000 elements : at clojure.core$concat$cat__5217$fn__5218.invoke(core.clj:726) at clojure.lang.LazySeq.sval(LazySeq.java:40)

|

leetwinski · Accepted Answer · 2020-07-09 16:14:36Z

there is also a nice way to do it by just counting items' frequencies into sorted map, and then unwrapping it with repeat:

(def data [[1 3 4 6 9 10 15]
           [2 3 6 7 8 9 10]
           [1 3 6 7 8 9 10]
           [1 2 3 4 8 9 10]])

(->> data
     (apply concat)
     (reduce #(update %1 %2 (fnil inc 0)) (sorted-map))
     (mapcat (fn [[k v]] (repeat v k))))

;;=> (1 1 1 2 2 3 3 3 3 4 4 6 6 6 7 7 8 8 8 9 9 9 9 10 10 10 10 15)

also there is a less hacky approach:

(defn min-first-idx [data]
  (when-let [items (->> data
                        (keep-indexed (fn [i x] (when (seq x) [x i])))
                        seq)]    
    (second (apply min-key ffirst items))))

(defn min-and-more [data-v]
  (when-let [i (min-first-idx data-v)]
    [(first (data-v i)) (update data-v i rest)]))

user> (min-and-more [[1 2 3] [0 1 4] [4 5]])
;; [0 [[1 2 3] (1 4) [4 5]]]

so you use it to iteratively take smallest item and rest from collection:

(->> [nil (vec data)]      
     (iterate (comp min-and-more second))
     rest
     (take-while (comp seq second))
     (map first))

;; (1 1 1 2 2 3 3 3 3 4 4 6 6 6 7 7 8 8 8 9 9 9 9 10 10 10 10 15)

zcaudate · Accepted Answer · 2020-07-11 12:13:25Z

I slightly modified Rulle's answer to provide sorting on maps:

(defn sort-arrays-insert
  ([key dst x]
   (let [x0 ( key (first x))
         a  (take-while #(< (key (first %)) x0) dst)
         b  (drop (count a) dst)]
     (concat a [x] b))))

(defn sort-arrays-next
  ([key arrs]
   (let [[f & r] arrs]
     (if (<= (count f) 1)
       r
       (sort-arrays-insert key r (rest f))))))

(defn sort-arrays
  ([key arr0 arr1 & more]
   (->> (apply list arr0 arr1 more)
        (filter seq)
        (sort-by (comp key first))
        (iterate #(sort-arrays-next key %))
        (take-while seq)
        (map ffirst))))

(sort-arrays identity
             [1 3 4 6 9 10 15]
             [2 3 6 7 8 9 10]
             [1 3 6 7 8 9 10]
             [1 2 3 4 8 9 10])

=> (1 1 1 2 2 3 3 3 3 4 4 6 6 6 7 7 8 8 8 9 9 9 9 10 10 10 10 15)

(sort-arrays :time
             [{:time 1} {:time 4}]
             [{:time 2} {:time 3}]
             [{:time 3} {:time 5} {:time 7}]
             [{:time 1} {:time 10}])

=> ({:time 1} {:time 1} {:time 2} {:time 3} {:time 3} {:time 4} {:time 5} {:time 7} {:time 10})

Also doing some benchmarking on Rulle, Peter and Leetwinski's answers, I found that Peter's answer is about twice as fast as the rest:

(let [L 20000
        N 10]
    (mapv (fn [f]
            (let [arrs (vec (for [i (range N)]
                              (vec (range L))))]
              (time (doall (f arrs)))))
          [merge-sorted       ;; Rulle's
           merge-sorted-2     ;; Peter's
           merge-sorted-3     ;; Leetwinski's
]))


"Elapsed time: 721.649222 msecs"  ;; Rulle's
"Elapsed time: 373.058068 msecs"  ;; Peter's
"Elapsed time: 754.717533 msecs"  ;; Leetwinski's

A. Webb · Accepted Answer · 2020-07-19 20:34:34Z

1

You can just write this out explicitly, for a lazy performant version.

(defn merges 
  ([x] x)

  ([x y]
    (cond 
      (empty? x) y
      (empty? y) x
      (< (first x) (first y)) 
        (cons (first x) (lazy-seq (merges y (rest x))))       
     :else 
      (cons (first y) (lazy-seq (merges x (rest y))))))

   ([x y & more]
     (apply merges 
       (for [[a b] (partition-all 2 (list* x y more))]
         (merges a b)))))


(apply merges [[1 3 4 6 9 10 15]
               [2 3 6 7 8 9 10]
               [1 3 6 7 8 9 10]
               [1 2 3 4 8 9 10]])

Edit: This revision merges pairwise up a binary tree for depth log number of sequences, rather than the prior linear reduction.

edited Jul 19, 2020 at 20:34

answered Jul 18, 2020 at 17:51

A. Webb

26.5k1 gold badge67 silver badges97 bronze badges

3 Comments

peter pun Over a year ago

This can lead to stack overflow (e.g. for an input of 10000 sequences) because it "stacks" a lot of lazy sequences.

A. Webb Over a year ago

Thanks for pointing this out, but is it the stack that bears the burden or the heap/GC? Clojure will typically unwind nested lazy-seq thunks in place. Under zcaudate's benchmark with 20000/10000 parameters the heap/GC error occurs for me by simply attempting to create the arrs input! But, I was a lazy programmer in handling the more-than-two case with a linear reduction. I replaced with a tree approach. This handles tens of thousands of input sequences easily, so long each isn't too long for memory to bear in total.

peter pun Over a year ago

First of all, I think it should be clarified that, although Clojure does not use extra stack space when forcing a LazySeq thunk returns another such thunk, it does so when forcing a thunk entails forcing another one, which is the case in this answer. Of course, after the edit, the use of stack space is so little that it is unproblematic. It should also be stressed that the efficiency of this solution after the edit is not so much a result of the lower level implementation (explicit recursion etc.) but of the fact that the algorithm implemented has lower asymptotic time complexity.

peter pun · Accepted Answer · 2020-08-05 06:15:21Z

Both Rulle's solution and leetwinski's second one use iterate in quite an anamorphic manner (especially the latter). Let's define unfold using iterate (usually the opposite is done) and write an explicitly anamorphic solution:

(defn unfold [f s]
  (->> s
       (list nil)
       (iterate (comp f second))
       rest
       (take-while some?)
       (map first)))

(defn merge-sorted [s]
  (->> s
       (filter seq)
       (unfold
         (fn [s]
           (if (seq s)
             (loop [[[mf & mn :as m] & s] s, r ()]
               (if-let [[[xf :as x] & s] s]
                 (let [[m x] (if (< xf mf) [x m] [m x])]
                   (recur (cons m s) (cons x r)))
                 (list mf (if mn (cons mn r) r)))))))))

UPDATE

Here is a version of merge-sorted that uses reduce instead of loop and recur:

(defn merge-sorted [s]
  (->> s
       (filter seq)
       (unfold
         (fn [s]
           (if (seq s)
             (let [[[mf & mn] r]
                   (reduce
                     (fn [[m r] x]
                       (if (< (first x) (first m))
                         [x (cons m r)]
                         [m (cons x r)]))
                     [(first s) ()]
                     (rest s))]
               (list mf (if mn (cons mn r) r))))))))

UPDATE'

Impressed by the efficiency of A. Webb's solution after the edit and considering this problem interesting and relatively important, I had a look at the Wikipedia articles on merge and k-way merge algorithms and at this paper. I found out that there is plenty of room for analysis/experimentation/improvement and decided to (re)implement and test several algorithms. Here they are, packed in a map, preceded by some helper functions and followed by some functions useful for testing:

(require ['clojure.core.reducers :as 'reducers])

(defn mapmap [f m]
  (reduce #(update %1 %2 f) m (keys m)))

(defn unfold [f s]
  (->> s
       (list nil)
       (iterate (comp f second))
       rest
       (take-while some?)
       (map first)))

(defn foldh [f s]
  ((fn rec [v]
     (f (if (> (count v) 2)
          (let [h (quot (count v) 2)]
            (map rec [(subvec v 0 h) (subvec v h)]))
          v)))
   (vec s)))

(defn fold2 [f s]
  (loop [s s]
    (if (nnext s)
      (recur (map f (partition-all 2 s)))
      (f s))))

(def merge-sorted
  (merge
    ;direct lazy algorithms
    (mapmap
      (fn [[prepare choose insert]]
        (fn [s]
          (->> s
               (filter seq)
               prepare
               (unfold
                 (fn [s]
                   (if (seq s)
                     (let [[[xf & xn] s] (choose s)]
                       [xf (if xn (insert s xn) s)])))))))
      {:min
       [identity
        (fn [s]
          (reduce
            (fn [[x s] y]
              (if (< (first x) (first y))
                [x (cons y s)]
                [y (cons x s)]))
            [(first s) ()]
            (rest s)))
        conj]
       :sort
       [(partial sort-by first)
        (juxt first rest)
        (fn [s [xf :as x]]
          (let [[a b] (loop [a () b (seq s)]
                        (if-let [[bf & bn] b]
                          (if (< (first bf) xf)
                            (recur (cons bf a) bn)
                            [a b])
                          [a b]))]
            (into (cons x b) a)))]
       :lsort
       [(partial sort-by first)
        (juxt first rest)
        (fn [s [xf :as x]]
          ((fn rec [s]
             (lazy-seq
               (if-let [[sf] (seq s)]
                 (if (< (first sf) xf)
                   (cons sf (rec (rest s)))
                   (cons x s))
                 (list x))))
           s))]
       :heap
       [(fn [s]
          (let [h (java.util.PriorityQueue.
                    (count s)
                    #(< (first %1) (first %2)))]
            (run! #(.add h %) s)
            h))
        (fn [h] [(.poll h) h])
        (fn [h x] (.add h x) h)]})
    ;folding lazy algorithms
    (mapmap
      (letfn [(merge2 [s]
                (lazy-seq
                  (if-let [[x & s] (seq (filter seq s))]
                    (if-let [[y] s]
                      ((fn rec [x y]
                         (lazy-seq
                           (let [[[xf & xn] y]
                                 (if (< (first x) (first y))
                                   [x y]
                                   [y x])]
                             (cons xf (if xn (rec xn y) y)))))
                       x y)
                      x))))]
        (fn [fold] (partial fold merge2)))
      {:foldl #(reduce (comp %1 list) %2)
       :foldh foldh
       :fold2 fold2})
    ;folding eager algorithms
    (mapmap
      (letfn [(merge2 [s]
                (if-let [[x & s] (seq (filter seq s))]
                  (if-let [[y] s]
                    (loop [x x y y acc ()]
                      (let [[[xf & xn] y]
                            (if (< (first x) (first y))
                              [x y]
                              [y x])
                            acc (conj acc xf)]
                        (if xn
                          (recur xn y acc)
                          (into y acc))))
                    x)
                  ()))]
        (fn [fold] (partial fold merge2)))
      {:efoldp #(reducers/fold 2 (comp %1 list) (comp %1 list) (vec %2))
       :efoldh foldh
       :efold2 fold2})))

(defn gen-inp [m n]
  (->> 0
       (repeat m)
       (map
         (comp
           doall
           (partial take n)
           rest
           (partial iterate #(+ % (rand-int 100)))))
       doall))

(defn test-merge-sorted [m n & algs]
   (->> (or algs (sort (keys merge-sorted)))
        (map (juxt name merge-sorted))
        (run!
          (let [inp (gen-inp m n)]
            (fn [[id alg]]
              (println id)
              ;(java.lang.System/gc)
              (try
                (time (doall (alg inp)))
                (catch java.lang.StackOverflowError _
                  (prn "Stack overflow"))))))))

The direct lazy algorithms follow a common scheme parameterized by how the following are done:

preprocessing the input
computing one chosen sequence and the rest ones
inserting the tail of the chosen sequence into the rest ones

:min is like my first solution, which calculates a minimum at each iteration.

:sort is like Rulle's solution, which sorts the sequences initially and makes a sorted-insertion at each iteration.

:lsort is like :sort but with lazy insertion. It can cause stack overflow because of nested lazy sequences.

:heap is a simple but suboptimal implementation of heap-merge using Java's PriorityQueues.

The folding lazy algorithms follow a common scheme parameterized by how a <=2-ary merge is extended to arbitrary arities.

:foldl is like A. Webb's solution before the edit, which does a left folding using reduce. It can cause stack overflow because of nested lazy sequences.

:foldh is an implementation of divide-and-conquer-merge, which folds by splitting in half.

:fold2 is like A. Webb's solution after the edit, which folds by splitting in pairs.

The folding eager algorithms follow a scheme like that of the lazy ones but using an eager <=2-ary merge.

:efoldp is a parallel implementation of divide-and-conquer-merge using clojure.core.reducers/fold, which does various <=2-ary merges concurrently and possibly in parallel, by "forking" every time it splits in half.

:efoldh and :efold2 are like :foldh and :fold2 but eager.

As a short remark on performance I would say that, for fast lazy merging, one of :foldh, :fold2 or maybe :heap should be chosen. As for eager merging, depending on the hardware's parallelism ability and on the input's shape, :efoldp can be faster than the rest. For more information see the linked articles.

Collectives™ on Stack Overflow

fast sorting algorithm for multiple ordered arrays

5 Answers 5

9 Comments

Comments

Comments

3 Comments

UPDATE

UPDATE'

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

9 Comments

Comments

Comments

3 Comments

UPDATE

UPDATE'

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related