8

I'm writing a simple desktop search engine in Clojure as a way to learn more about the language. Until now, the performance during the text processing phase of my program is really bad.

During the text processing I've to:

  • Clean up unwanted characters;
  • Convert the string to lowercase;
  • Split the document to get a list of words;
  • Build a map which associates each word to its occurrences in the document.

Here is the code:

(ns txt-processing.core
  (:require [clojure.java.io :as cjio])
  (:require [clojure.string :as cjstr])
  (:gen-class))

(defn all-files [path]
  (let [entries (file-seq (cjio/file path))]
    (filter (memfn isFile) entries)))

(def char-val
  (let [value #(Character/getNumericValue %)]
    {:a (value \a) :z (value \z)
     :A (value \A) :Z (value \Z)
     :0 (value \0) :9 (value \9)}))

(defn is-ascii-alpha-num [c]
  (let [n (Character/getNumericValue c)]
    (or (and (>= n (char-val :a)) (<= n (char-val :z)))
        (and (>= n (char-val :A)) (<= n (char-val :Z)))
        (and (>= n (char-val :0)) (<= n (char-val :9))))))

(defn is-valid [c]
    (or (is-ascii-alpha-num c)
        (Character/isSpaceChar c)
        (.equals (str \newline) (str c))))

(defn lower-and-replace [c]
  (if (.equals (str \newline) (str c)) \space (Character/toLowerCase c)))

(defn tokenize [content]
  (let [filtered (filter is-valid content)
        lowered (map lower-and-replace filtered)]
    (cjstr/split (apply str lowered) #"\s+")))

(defn process-content [content]
  (let [words (tokenize content)]
    (loop [ws words i 0 hmap (hash-map)]
      (if (empty? ws)
        hmap
        (recur (rest ws) (+ i 1) (update-in hmap [(first ws)] #(conj % i)))))))

(defn -main [& args]
  (doseq [file (all-files (first args))]
    (let [content (slurp file)
          oc-list (process-content content)]
      (println "File:" (.getPath file)
               "| Words to be indexed:" (count oc-list )))))

As I have another implementation of this problem in Haskell, I compared both as you can see in the following outputs.

Clojure version:

$ lein uberjar
Compiling txt-processing.core
Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT.jar
Including txt-processing-0.1.0-SNAPSHOT.jar
Including clojure-1.5.1.jar
Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT-standalone.jar
$ time java -jar target/txt-processing-0.1.0-SNAPSHOT-standalone.jar ../data
File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033
File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028
File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562
File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754
File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418
File: ../data/.directory | Words to be indexed: 3
File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191
File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378
File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451
File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049
File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721
File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494
File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642

real    2m2.164s
user    2m3.868s
sys     0m0.978s

Haskell version:

$ ghc -rtsopts --make txt-processing.hs 
[1 of 1] Compiling Main             ( txt-processing.hs, txt-processing.o )
Linking txt-processing ...
$ time ./txt-processing ../data/ +RTS -K12m
File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033
File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028
File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562
File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754
File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418
File: ../data/.directory | Words to be indexed: 3
File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191
File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378
File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451
File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049
File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721
File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494
File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642

real    0m9.086s
user    0m8.591s
sys     0m0.463s

I think the (string -> lazy sequence) conversion in the Clojure implementation is killing the performance. How can I improve it?

P.S: All the code and data used in these tests can be downloaded here.

6
  • 1
    what is the JVM startup time for the jar? Does Haskell have similar overhead? Commented Apr 27, 2013 at 23:40
  • the JVM startup time in my machine is around one second. I think Haskell has some overhead due the runtime system (RTS), but I should be considerably lower than the JVM. Commented Apr 28, 2013 at 0:41
  • s/I should/it should/ Commented Apr 28, 2013 at 1:18
  • 1
    Use (inc i) instead of (+ i 1) Commented Apr 28, 2013 at 1:48
  • @luisgabriel: Could you post the end version and it's results as well ? Commented Apr 29, 2013 at 12:50

2 Answers 2

4

Some things you could do that would probably speed this code up:

1) Instead of mapping your chars to char-val, just do direct value comparisons between the characters. This is faster for the same reason it would faster in Java.

2) You repeatedly use str to convert single-character values to full-fledged strings. Again, consider using the character values directly. Again, object creation is slow, same as in Java.

3) You should replace process-content with clojure.core/frequencies. Perhaps inspect frequencies source to see how it is faster.

4) If you must update a (hash-map) in a loop, use transient. See: http://clojuredocs.org/clojure_core/clojure.core/transient

Also note that (hash-map) returns a PersistentArrayMap, so you are creating new instances with each call to update-in - hence slow and why you should use transients.

5) This is your friend: (set! *warn-on-reflection* true) - You have quite a bit of reflection that could benefit from type hints

 Reflection warning, scratch.clj:10:13 - call to isFile can't be resolved.
 Reflection warning, scratch.clj:13:16 - call to getNumericValue can't be resolved.
 Reflection warning, scratch.clj:19:11 - call to getNumericValue can't be resolved.
 Reflection warning, scratch.clj:26:9 - call to isSpaceChar can't be resolved.
 Reflection warning, scratch.clj:30:47 - call to toLowerCase can't be resolved.
 Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved.
 Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved.
Sign up to request clarification or add additional context in comments.

2 Comments

perfect! just putting the type hints the total time decreased to 7s! ;)
I cannot use clojure.core/frequencies because I need the word positions to further phases like indexing and querying.
0

Just for comparison's sake, here's a regexp based Clojure version

(defn re-index
     "Returns lazy sequence of vectors of regexp matches and their start index" 
     [^java.util.regex.Pattern re s]
     (let [m (re-matcher re s)]
       ((fn step []
          (when (. m (find))
            (cons (vector (re-groups m)(.start m)) (lazy-seq (step))))))))

(defn group-by-keep
  "Returns a map of the elements of coll keyed by the result of
  f on each element. The value at each key will be a vector of the
  results of r on the corresponding elements."
  [f r coll]  
  (persistent!
    (reduce
      (fn [ret x]
        (let [k (f x)]
          (assoc! ret k (conj (get ret k []) (r x)))))
      (transient {}) coll)))

(defn word-indexed
  [s]
  (group-by-keep
    (comp clojure.string/lower-case first)
    second
    (re-index #"\w+" s)))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.