14

Given a String containing a comma delimited list representing a proper noun & category/description pair, what are the pros & cons of using String.split() versus Pattern & Matcher approach to find a particular proper noun and extract the associated category/description pair?

The haystack String format will not change. It will always contain comma delimited data in the form of PROPER_NOUN|CATEGORY/DESCRIPTION

Common variables for both approaches:

String haystack="EARTH|PLANET/COMFORTABLE,MARS|PLANET/HARDTOBREATHE,PLUTO|DWARF_PLANET/FARAWAY";
String needle="PLUTO";
String result=null;

Using String.split():

for (String current : haystack.split(","))
    if (current.contains(needle))
    {
        result=current.split("\\|")[1]);
        break; // *edit* Not part of original code - added in response to comment from Pshemo
    {

Using Pattern & Matcher:

Pattern pattern = pattern.compile("(" +needle+ "\|)(\w+/\w+)");
Matcher matches = pattern.matcher(haystack);

if (matches.find())
    result=matches.group(2);

Both approaches provide the information I require.

I'm wondering if any reason exists to choose one over the other. I am not currently using Pattern & Matcher within my project so this approach will require imports from java.util.regex

And, of course, if there is an objectively 'better' way to parse the information I will welcome your input.

Thank you for your time!

Conclusion

I've opted for the Pattern/Matcher approach. While a little tricky to read w/the regex, it is faster than .split()/.contains()/.split() and, more importantly to me, captures the first match only.

For what it is worth, here are the results of my imperfect benchmark tests, in nanoseconds, after 100,000 iterations:

.split()/.contains()/.split

304,212,973

Pattern/Matcher w/ Pattern.compile() invoked for each iteration

230,511,000

Pattern/Matcher w/Pattern.compile() invoked prior to iteration

111,545,646

5
  • 3
    Just a small comment: if you're constructing a pattern manually from user input, always use Pattern.quote() to escape the string. Commented Jul 17, 2014 at 21:23
  • 1
    Only advantage of Pattern/Matcher solution is that it will stop iterating over your input when it will find needle|\w+/\w+ while split(",") will iterate over entire input and then will iterate again until it find sting which contains needle. I am not sure if contains is right method here, unless you are sure that searched noun will never appear as part of category/description pair. Commented Jul 17, 2014 at 21:31
  • @biziclop: Thanks for the .quote() tip, I wasn't aware of that method. Commented Jul 17, 2014 at 22:21
  • @Pshemo: Thank you VERY much for the contains() callout. There shouldn't be duplicate nouns, but I work with other humans and we trend away from infallability. If I go with the .split() route I'll include a 'break;' Commented Jul 17, 2014 at 22:23
  • Note that your two implementations behave differently on some input strings. For example, if the needle appears on the right side of the |, or if the left side does not contain a /, the String.split()-based implementation will accept it but not the Pattern implementation. Commented Mar 18, 2016 at 1:23

3 Answers 3

16

In a small case such as this, it won't matter that much. However, if you have extremely large strings, it may be beneficial to use Pattern/Matcher directly.

Most string functions that use regular expressions (such as matches(), split(), replaceAll(), etc.) makes use of Matcher/Pattern directly. Thus it will create a Matcher object every time, causing inefficiency when used in a large loop.

Thus if you really want speed, you can use Matcher/Pattern directly and ideally only create a single Matcher object.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks Xinzz! This answer captures what I was looking for, and does it succinctly. +1 @Pshemo, though, for the potential pitfalls of the way in which I was implementing contains().
1

There are no advantages to using pattern/matcher in cases where the manipulation to be done is as simple as this.

You can look at String.split() as a convenience method that leverages many of the same functionalities you use when you use a pattern/matcher directly.

When you need to do more complex matching/manipulation, use a pattern/matcher, but when String.split() meets your needs, the obvious advantage to using it is that it reduces code complexity considerably - and I can think of no good reason to pass this advantage up.

Comments

1

I would say that the split() version is much better here due to the following reasons:

  • The split() code is very clear, and it is easy to see what it does. The regex version demands much more analysis.
  • Regular expressions are more complex, and therefore the code becomes more error-prone.

3 Comments

I fully agree that .split()/.contains()/.split() is the more legible of the two options, and that some people are regexaphobes & see how regex usage can cause problems. I am a little uncertain on your third point, though. After reading @Xinzz's answer I ran some very rudimentary benchmarks and Pattern/Matcher was faster, even with instantiating multiple Matcher objects. In terms of cost do you mean CPU usage?
@Idus: You are probably right. I have been thinking about this a bit more, and the third point should be removed. It is the least significant to consider anyway.
The paramter to String.split() is also a regex.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.