3

I implemented a few lazy sort & merge functions that are used heavily in my ReBAC authorization library, EACL, to lazily "merge" & deduplicate ~1M datoms emitted from Datomic's d/index-range when traversing the permission graph in parallel.

These functions behave like (sort-by pred (concat lazy-seqs)) or (dedupe (sort-by pred (concat lazy-seqs))), but with support for lazy sequences, which may be unbounded, so you can call (take 1000 ...) for cursor-pagination.

I found this great answer by peter pun which should improve the core performance of EACL, but it does not support deduplication. I have considered calling dedupe on the resulting merged sequence (haven't benchmarked this yet), but I suspect it will be faster to deduplicate while advancing the sequences.

Can someone smarter than me please help select the best algorithm from Peter's answer, and help me add deduplication support?

Here is my implementation of lazy-merge-dedupe-sort:

(defn lazy-merge-dedupe-sort
  "Lazily merges multiple _sorted_ sequences, deduplicating values.
   Takes an optional initial lowest value (default: 0) and a collection of sorted sequences.
   Returns a lazy sequence of deduplicated, sorted values.
   If a sequence is not sorted, any value than prior lowest will be discarded, but this is undefined behaviour."
  ([lazy-colls] (lazy-merge-dedupe-sort 0 lazy-colls))
  ([lowest lazy-colls]
   (lazy-seq
     (when-let [non-empty-seqs (seq (filter seq lazy-colls))]
       (let [heads         (map first non-empty-seqs)
             min-val       (apply min heads)
             ;; Advance sequences that have the minimum value at their head
             advanced-seqs (map (fn [s]
                                  (if (and (seq s) (<= (first s) min-val))
                                    (rest s)
                                    s))
                             lazy-colls)]
         (if (> min-val lowest)
           ;; Yield the value and continue with new lowest
           (cons min-val (lazy-merge-dedupe-sort min-val advanced-seqs))
           ;; Skip the value (deduplicate) and continue
           (lazy-merge-dedupe-sort lowest advanced-seqs)))))))

Example Usage of lazy-merge-dedupe-sort

(let [seq1 [1 3 5 7 9]
      seq2 [2 3 6 -1 8 9 10]   ; note the unsorted -1 will be skipped because not sorted
      seq3 [1 4 5 6 11]]
  (take 20 (lazy-merge-dedupe-sort [seq1 seq2 seq3])))
; => '(1 2 3 4 5 6 7 8 9 10 11)

I then added support for a predicate pred in lazy-merge-dedupe-sort-by:

(defn lazy-merge-dedupe-sort-by
  "Lazily merges multiple _sorted_ sequences, deduplicating based on pred values.
   Takes a predicate function, an optional initial lowest pred value, and a collection of sorted sequences.
   Returns a lazy sequence of deduplicated, sorted values (original values, not pred results).
   If a sequence is not sorted by pred, any value with pred result less than prior lowest will be discarded."
  ([pred lazy-colls] (lazy-merge-dedupe-sort-by pred 0 lazy-colls))
  ([pred lowest lazy-colls]
   (lazy-seq
     (when-let [non-empty-seqs (seq (filter seq lazy-colls))]
       (let [heads         (map first non-empty-seqs)
             pred-vals     (map pred heads)
             min-pred-val  (apply min pred-vals)
             ;; Find the first value with the minimum pred value
             min-val       (first (filter #(= (pred %) min-pred-val) heads)) ; this can be optimized.
             ;; Advance sequences that have pred values <= minimum pred value
             advanced-seqs (map (fn [s]
                                  (if (and (seq s) (<= (pred (first s)) min-pred-val))
                                    (rest s)
                                    s))
                             lazy-colls)]
         (if (> min-pred-val lowest)
           ;; Yield the value and continue with new lowest
           (cons min-val (lazy-merge-dedupe-sort-by pred min-pred-val advanced-seqs))
           ;; Skip the value (deduplicate) and continue
           (lazy-merge-dedupe-sort-by pred lowest advanced-seqs)))))))

Example Usage of lazy-merge-dedupe-sort-by

; some sequences which contain usages and an unsorted element to show behaviour:
(def seq1 [[1 :a1] [3 :c1] [5 :e1] [7 :g1] [9 :i1]])
(def seq2 [[2 :b2] [3 :c3] [6 :f1] [-1 :bad2] [8 :h2] [9 :i2] [10 :j2]]) ; note unsorted -1 will be skipped by dedupe because undefined behaviour if unsorted.
(def seq3 [[1 :a3] [4 :d3] [5 :e3] [6 :f3] [11 :k3]])

(is (= '([1 :a1] [2 :b2] [3 :c1]
         [4 :d3] [5 :e1] [6 :f1]
         [7 :g1] [8 :h2] [9 :i1] [10 :j2] [11 :k3])
      (take 20 (lazy-merge-dedupe-sort-by first [seq1 seq2 seq3]))))

I extracted Peter's :efold2 algorithm, which performs significantly better than my implementation when only merging, but it does not deduplicate:

(defn lazy-merge2
  "Lazily merges two sorted sequences using a comparison function.
   cmp should return true if first arg should come before second arg."
  [cmp x y]
  (lazy-seq
    (cond
      (nil? (seq x)) y
      (nil? (seq y)) x
      :else
      (let [[xf & xn] x
            [yf & yn] y]
        (if (cmp xf yf)
          (cons xf (lazy-merge2 cmp xn y))
          (cons yf (lazy-merge2 cmp x yn)))))))

(defn lazy-merge-all
  "Lazily merges a collection of sorted sequences.
   Returns empty seq if input is empty."
  [cmp seqs]
  (lazy-seq
    (let [non-empty (seq (filter seq seqs))]
      (when non-empty
        (if-let [[y] (next non-empty)]
          ;; Two or more sequences
          (lazy-merge2 cmp (first non-empty) y)
          ;; Only one sequence left
          (first non-empty))))))

(defn fold2
  "Repeatedly applies function f to pairs of elements until one remains.
   Uses a tournament-style folding approach."
  [f s]
  (loop [s s]
    (if (next (next s)) ; if more than 2 elements
      (recur (map f (partition-all 2 s)))
      (f s))))

(defn lazy-fold2-merge-sorted
  "Merges multiple sorted sequences using lazy fold2 algorithm.
   Sequences should already be sorted according to cmp.
   cmp is a comparison function that returns true if first arg < second arg."
  [cmp seqs]
  (fold2 (partial lazy-merge-all cmp) seqs))

(defn lazy-fold2-merge-sorted-by
  "Merges multiple sorted sequences using lazy fold2 algorithm.
   keyfn extracts the comparison key from each element.
   Sequences should already be sorted according to (keyfn element).

   Example:
   (lazy-fold2-merge-sorted-by identity [[1 3 5] [2 4 6] [0 7 8]])"
  [keyfn seqs]
  (lazy-fold2-merge-sorted #(< (keyfn %1) (keyfn %2)) seqs))

Example Usage of Peter's

;; Merge sorted sequences of numbers
(lazy-fold2-merge-sorted-by identity
  [[1 3 5 7]
   [1 2 4 6 8]
   [0 9 10]])
;; => (0 1 2 3 4 5 6 7 8 9 10)

;; Merge sorted sequences by custom key
(lazy-fold2-merge-sorted-by :priority
  [{:priority 1 :name "low"}
   {:priority 5 :name "high"}]
  [{:priority 2 :name "medium"}
   {:priority 3 :name "mid"}])

;; Works lazily - doesn't realize entire sequence at once
(take 5
  (lazy-fold2-merge-sorted-by identity
    [(range 0 1000 2)
     (range 1 1000 2)
     (range 0 1000 3)]))

How would I add deduplication to Peter's implementation to improve the performance of EACL?

4
  • 1
    I managed to get lazy fold2 merge dedupe working. Big performance improvement. Will post answer with impl. later. Also tried calling (dedupe) on the result of lazy-fold2-merge-sorted-by, but it was wayyy too slow. Commented Oct 16 at 12:04
  • Peter's lazy-merge-all does not do what it says. (lazy-merge-all < [[1 7 9] [2 2 5] [1 4 8]]) produces => (1 2 2 5 7 9). It ignores all but the first two non-empty sequences. Commented Oct 23 at 9:42
  • That is expected: lazy-merge-all is distinct from lazy-merge-dedupe. I added dedupe behaviour on my end. lazy-merge would behave like (sort-by pred (apply concat seqs)), whereas dedupe behaviour would behave as (sort-by pred (distinct (apply concat seqs))) (although distinct does not consider pred). Commented Oct 26 at 20:38
  • 1. In Peter's code, lazy-merge-all can, I think, be simplified to (defn lazy-merge-all [cmp [xs ys]] (lazy-merge2 cmp xs ys)). This gets rid of one lazy sequence per level. 2. Your first example of Peter's code is wrong. It doesn't de-dupe. The result should be(0 1 1 2 3 4 5 6 7 8 9 10). I've tried it with both the original and simplified lazy-merge-all. Commented Oct 28 at 8:44

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.