3

Suppose I have 60 words and I want to check if an input is any of these words

Which is faster:
1) making a regex and ORing the words in it
2) looping on an array and searching?

8
  • 3
    What have you tried? This is something that is very easily tested. Commented Apr 28, 2014 at 1:55
  • I am still trying it but I am thinking it may depend on the input words themselves I donno?! Commented Apr 28, 2014 at 2:01
  • You can run the same test on different sets of test words. Commented Apr 28, 2014 at 2:02
  • 2
    @Doggynub If you just want to know the fastest way, assuming the list of 60 words is constant and you are going to vary the input, it would be to put all words in a Set<String> and test .contains() for each input string. This will compute a hashCode for each word in the Set at insertion time, and compare the input string to a known set of hash codes, in constant time. Commented Apr 28, 2014 at 2:03
  • @ColinMorelli this is fastest than brute and regex? Commented Apr 28, 2014 at 2:08

1 Answer 1

3

You can easily test this yourself. Out of curiosity, I created a test case of four different scenarios:

  1. Pattern.matcher().matches() with an on-demand Pattern instance (created for each run)
  2. Pattern.matcher().matches() with a cached Pattern instance (created before all runs)
  3. String.equals() for each element in the array, executed within a loop
  4. Set.contains() on a cached Set (created before all runs)

Data Set: Input array containing 6000 randomly generated strings of 6 characters each. Each test was executed 10,000 times, the results of all runs were totaled and averaged.

The results (all times in ms - lower is obviously better). The first number is the total execution time of all 10,000 runs, the second number is the average of each run:

On-Demand Regex:    12934 (1.29 avg)
Pre-compile Regex:    458 (0.05 avg)
Loop:                  77 (0.01 avg)
Set.contains:           4 (0.00 avg)

Long story short: if you're going to use a regular expression (which you shouldn't) at least create and cache the Pattern. But assuming performance is what matters, you're not going to beat Set.contains() if you know the list of words ahead of time.

Note The On-demand regex test includes the cost of constructing the StringBuilder instance that is given to the Pattern.compile() method, so not necessarily all of the extra time is spent in regex compilation. The Set.contains test also has a slight advantage in that it's inlined, and avoids the extra stack creation of the method call. I modified the test to have that execute inside of a separate method, but it didn't materially affect the results.

Sign up to request clarification or add additional context in comments.

1 Comment

Awesome results! How is the set searching so fast? Is there anywhere I can learn more about why set searching is faster than regex?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.