How to detect duplicate words from a String in Java?

Question

What are the ways by which duplicate word in a String can be detected?

e.g. "this is a test message for duplicate test" contains one duplicate word test.

Here, the objective is to detect all duplicate words which occur in a String.

Use of regular expression is preferable to achieve the goal.

Stephen C · Accepted Answer · 2012-09-19 15:03:52Z

8

The best you can do with regexes is O(N^2) search complexity. You can easily achieve O(N) time and space search complexity by splitting the input into words and using a HashSet to detect duplicates.

answered Sep 19, 2012 at 15:03

Stephen C

723k95 gold badges849 silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

gtgaxiola Over a year ago

Then the tradeoff again is time vs space since you need a backing data structure for detection

Stephen C Over a year ago

Yes, but as I said the space overhead is O(N); i.e. proportional to the size of the input.

Debadyuti Maiti Over a year ago

@StephenC But can you provide any link which shows O(N^2) time complexity? Because this link claims it as O(N). stackoverflow.com/questions/5892115/…

Stephen C Over a year ago

That Answer is referring to real regular expressions (in the theoretical sense). A real regex does not allow back-references. And if you don't believe me, I suggest that you do some experiments to see how the performance of your regex scales for larger and larger input strings.

Debadyuti Maiti Over a year ago

@StephenC Can you provide code example [i.e. dealing with HashSet]? Because I think for "splitting the input into words", I have to use regular expression. Again each word has to be changed to lowerCase or upperCase, otherwise I don't think HashSet will be able to distinguish between duplicate Strings with mixed cases. So, for large input,the String objects [just for comparing] created will be very high, & for changing lower case,splitting the input to words altogether should have some performance overhead.

|

Debadyuti Maiti · Accepted Answer · 2012-09-20 08:23:26Z

3

The following Java code resolves the problem of detecting duplicates from a String. There should not be any problem if the duplicate word is separated by newline or punctuation symbols.

    String duplicatePattern = "(?i)\\b(\\w+)\\b[\\w\\W]*\\b\\1\\b";
    Pattern p = Pattern.compile(duplicatePattern);
    String phrase = "this is#$;%@;<>?|\\` p is a is Test\n of duplicate test";
    Matcher m = p.matcher(phrase);
    String val = null;
    while (m.find()) {
        val = m.group();
        System.out.println("Matching segment is \"" + val + "\"");
        System.out.println("Duplicate word: " + m.group(1)+ "\n");
    }

The output of the code will be:

Matching segment is "is#$;%@;<>?|\` p is a is"
Duplicate word: is

Matching segment is "Test
 of duplicate test"
Duplicate word: Test

Here, m.group(1) statement represents the String matched against 1st group of Pattern [here, it's (\\w+)].

edited Sep 20, 2012 at 8:23

answered Sep 19, 2012 at 14:55

Debadyuti Maiti

1,1994 gold badges18 silver badges31 bronze badges

2 Comments

Debadyuti Maiti Over a year ago

@BrianAgnew If there's any issue with the code for some edge test cases, please inform me.

Brian Agnew Over a year ago

@DebadyutiMaiti - I'm not worried about edge cases so much as how it performs with increasing amounts of text (see Stephen C's answer above)

Collectives™ on Stack Overflow

How to detect duplicate words from a String in Java?

2 Answers 2

6 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related