3

I implemented a function that returns the n-grams of a given input collection as a lazy seq.

(defn gen-ngrams
  [n coll]
  (if (>= (count coll) n)
    (lazy-seq (cons (take n coll) (gen-ngrams n (rest coll))))))

When I call this function with larger input collections, I would expect to see a linear increase in execution time. However, the timing I observe is worse than that:

user> (time (count (gen-ngrams 3 (take 1000 corpus))))
"Elapsed time: 59.426 msecs"
998
user> (time (count (gen-ngrams 3 (take 10000 corpus))))
"Elapsed time: 5863.971 msecs"
9998
user> (time (count (gen-ngrams 3 (take 20000 corpus))))
"Elapsed time: 23584.226 msecs"
19998
user> (time (count (gen-ngrams 3 (take 30000 corpus))))
"Elapsed time: 54905.999 msecs"
29998
user> (time (count (gen-ngrams 3 (take 40000 corpus))))
"Elapsed time: 100978.962 msecs"
39998

corpus is a Cons of string tokens.

What causes this behavior and how can I improve the performance?

2 Answers 2

5

I think your issue is with "(count coll)", which is iterating over the coll for each call to ngrams.

The solution would be to use the build in partition function:

user=> (time (count (gen-ngrams 3 (take 20000 corpus))))
"Elapsed time: 6212.894932 msecs"
19998
user=> (time (count (partition 3 1 (take 20000 corpus))))
"Elapsed time: 12.57996 msecs"
19998

Have a look in the partition source if curious about the implementation http://clojuredocs.org/clojure_core/clojure.core/partition

Sign up to request clarification or add additional context in comments.

Comments

0

I am far from a Clojure expert, but I think the cons function causes this problem. Try to use list instead:

(defn gen-ngrams
  [n coll]
  (if (>= (count coll) n)
    (lazy-seq (list (take n coll) (gen-ngrams n (rest coll))))))

I think cons construct a new seq which is more generic than a list, and therefore is slower.

Edit: and if "corpus is a Cons of string tokens", then try to make it a list...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.