0

What are the ways by which duplicate word in a String can be detected?

e.g. "this is a test message for duplicate test" contains one duplicate word test.

Here, the objective is to detect all duplicate words which occur in a String.

Use of regular expression is preferable to achieve the goal.

0

2 Answers 2

8

The best you can do with regexes is O(N^2) search complexity. You can easily achieve O(N) time and space search complexity by splitting the input into words and using a HashSet to detect duplicates.

Sign up to request clarification or add additional context in comments.

6 Comments

Then the tradeoff again is time vs space since you need a backing data structure for detection
Yes, but as I said the space overhead is O(N); i.e. proportional to the size of the input.
@StephenC But can you provide any link which shows O(N^2) time complexity? Because this link claims it as O(N). stackoverflow.com/questions/5892115/…
That Answer is referring to real regular expressions (in the theoretical sense). A real regex does not allow back-references. And if you don't believe me, I suggest that you do some experiments to see how the performance of your regex scales for larger and larger input strings.
@StephenC Can you provide code example [i.e. dealing with HashSet]? Because I think for "splitting the input into words", I have to use regular expression. Again each word has to be changed to lowerCase or upperCase, otherwise I don't think HashSet will be able to distinguish between duplicate Strings with mixed cases. So, for large input,the String objects [just for comparing] created will be very high, & for changing lower case,splitting the input to words altogether should have some performance overhead.
|
3

The following Java code resolves the problem of detecting duplicates from a String. There should not be any problem if the duplicate word is separated by newline or punctuation symbols.

    String duplicatePattern = "(?i)\\b(\\w+)\\b[\\w\\W]*\\b\\1\\b";
    Pattern p = Pattern.compile(duplicatePattern);
    String phrase = "this is#$;%@;<>?|\\` p is a is Test\n of duplicate test";
    Matcher m = p.matcher(phrase);
    String val = null;
    while (m.find()) {
        val = m.group();
        System.out.println("Matching segment is \"" + val + "\"");
        System.out.println("Duplicate word: " + m.group(1)+ "\n");
    }

The output of the code will be:

Matching segment is "is#$;%@;<>?|\` p is a is"
Duplicate word: is

Matching segment is "Test
 of duplicate test"
Duplicate word: Test

Here, m.group(1) statement represents the String matched against 1st group of Pattern [here, it's (\\w+)].

2 Comments

@BrianAgnew If there's any issue with the code for some edge test cases, please inform me.
@DebadyutiMaiti - I'm not worried about edge cases so much as how it performs with increasing amounts of text (see Stephen C's answer above)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.