Find the smallest unique substring for each string in an array

Question

(I'm writing this in the context of JavaScript, but will accept an algorithmically correct answer in any language)

How do you find the shortest substring of each element in an array of strings where the substring is NOT contained within any of the other elements, ignoring case?

Suppose I have an input array such as:

var names = ["Anne", "Anthony", "LouAnn", "Kant", "Louise", "ark"];

The output should be something like:

var uniqueNames = ["ne", "h", "ua", "ka", "i", "r"];

For my purposes, you can safely assume that no element will be wholly contained within another element.

My Thoughts:
It seems that one could probably brute force this, along the lines of:

var names = ["Anne", "Anthony", "LouAnn", "Kant", "Louise", "ark"];
var uniqueNames = [], nameInd, windowSize, substrInd, substr, otherNameInd, foundMatch;
// For each name
for (nameInd = 0; nameInd < names.length; nameInd++)
{
    var name = names[nameInd];
    // For each possible substring length
    windowLoop:
    for (windowSize = 1; windowSize <= name.length; windowSize++)
    {
        // For each starting index of a substring
        for (substrInd = 0; substrInd <= name.length-windowSize; substrInd++)
        {
            substr = name.substring(substrInd,substrInd+windowSize).toLowerCase();
            foundMatch = false;
            // For each other name
            for (otherNameInd = 0; otherNameInd < names.length; otherNameInd++)
            {
                if (nameInd != otherNameInd && names[otherNameInd].toLowerCase().indexOf(substr) > -1)
                {
                    foundMatch = true;
                    break;
                }
            }

            if (!foundMatch)
            {
                // This substr works!
                uniqueNames[nameInd] = substr;
                break windowLoop;
            }
        }
    }
}

But I have to imagine there's a more elegant solution using tries/prefix trees, suffix arrays, or something interesting like that.

Edit: I believe this is the form the selected answer would take programmatically in JavaScript:

var names = ["Anne", "Anthony", "LouAnn", "Kant", "Louise", "ark"];
var uniqueNames = [], permutations = {}, permutation, nameInd, windowSize, substrInd, substr;

// For each name
for (nameInd = 0; nameInd < names.length; nameInd++)
{
    var name = names[nameInd];
    // For each possible substring length
    windowLoop:
    for (windowSize = 1; windowSize <= name.length; windowSize++)
    {
        // For each starting index of a substring
        for (substrInd = 0; substrInd <= name.length-windowSize; substrInd++)
        {
            substr = name.substring(substrInd,substrInd+windowSize).toLowerCase();
            permutations[substr] = (typeof permutations[substr] === "undefined")?nameInd:-1;
        }
    }
}

for (substr in permutations)
{
    permutation = permutations[substr];
    if (permutation !== -1 && ((typeof uniqueNames[permutation] === "string" && substr.length < uniqueNames[permutation].length) || typeof uniqueNames[permutation] === "undefined"))
    {
        uniqueNames[permutation] = substr;
    }
}

Is your sample output incorrect? I don't see s and y in there whereas is see i, h and r... — Icarus
– Icarus, Commented Jun 28, 2012 at 15:05
@Icarus Ah, good point. s and y are not present only because I'm not looking for all smallest substrings that fit the criteria, rather any one is good enough. I would accept an answer that returned a two dimensional array of all of them, but I don't really need that level of detail. An equally valid output could be var uniqueNames = ["ne", "y", "ua", "ka", "i", "s"]; — Patrick
– Patrick, Commented Jun 28, 2012 at 18:00
Is it possible to limit your input alphabet to 26 characters (or something like this, just limit it)? — Saeed Amiri
– Saeed Amiri, Commented Jun 29, 2012 at 8:13
@SaeedAmiri I'm not really sure exactly what route you're headed down, but my actual input contains only characters from [0-9a-zA-Z_-'&,\.\s] in the input and you can limit output to only contain alphanumeric characters, though I will probably choose an answer with fewer limitations over one with more, you know? — Patrick
– Patrick, Commented Jun 29, 2012 at 15:44
@Patrick There is an O(M) solution using suffix arrays; where M is the sum of the lengths of all the strings. — bloops
– bloops, Commented Jul 16, 2012 at 19:50

Mukesh · Accepted Answer · 2013-08-28 05:13:04Z

5

This problem can be solved in O(N*L*L*L) complexity. The approach will be using suffix tries. Each node of the trie will be also storing the prefix count which will refer to the number of times the substring formed while traversing to that node from the root have appeared in all of the suffixes inserted till now.

We will be constructing N+1 tries. The first trie will be global and we will be inserting all the suffixes of all N string into it. The next N tries will be local for each of the N strings containing corresponding suffixes.

This preprocessing step of constructing tries will be done in O(N*L*L).

Now once the tries have been constructed, for each string, we can start quering the number of times a substring ( starting from minimum length) has occured in the global trie and the trie corresponding to that string. If it is same in both then it implies that it is not included in any other strings except itself. This can be achieved in O(N*L*L*L). The complexity can be explained as N for each string, L*L for considering each substring and L for performing query in the trie.

answered Aug 28, 2013 at 5:13

Mukesh

3035 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mike Over a year ago

I think you meant "substring" by "suffix".

KWillets · Accepted Answer · 2015-05-28 00:16:11Z

If you build a generalized suffix tree, you just need to find the shallowest point at which an infix of each string branches off from the infixes of the other strings, and take the label to that branching point plus one "distinguishing" character. The kicker is that there has to be such an extra character (it could be branching off only at the metacharacter stuck on the end of each string), and the branching-off point might not lead to a leaf, it might lead to a subtree with leaves all from the same string (so internal nodes have to be considered).

For each string S find the shallowest (by parent label depth) node N that only contains leaves from S, and whose edge label contains at least one character. The path label from root to the parent of N, plus one character from the edge label leading to N, is the shortest infix of S not found in other strings.

I believe the labeling of nodes that only contain leaves from one string can be done during construction or by O(N) scan of the GST; then it's a simple matter to scan the final tree and keep a running minimum for each string. So it's all O(N).

(edit -- I can't reply to comments yet)

To clarify, every suffix in a suffix tree has a node where it branches off from the other suffixes; the goal here is to find the/a suffix for each string that branches off from suffixes of all the other strings at the minimum depth, as measured by the path label to that node. All we need is one extra character after that point to have a substring that doesn't appear in any other string.

Example:

Strings: abbc, abc

Using Ukonnen's algorithm, after the first string we have a suffix tree of just the suffixes from that string; I'll label them with [1] here:

abbc[1]
b
 bc[1]
 c[1]
c[1]

Next we insert string 2's suffixes:

ab
  bc[1]
  c[2]
b
 bc[1]
 c
  [1]
  [2]
c
 [1]
 [2]

Now we want to find the shortest string that leads to a branch with only [1]'s under it; we can do that by scanning all the [1]'s and looking at their immediate parents, which I will list here by path label, plus one character (which I will use below):

abbc:  abb
bbc: bb
bc: bc[1]
c: c[1]

Note that I've included [1] since it's the metacharacter that distinguishes otherwise identical suffixes of [1] and [2]. This is handy when identifying substrings that repeat in multiple strings, but it's not useful for our problem, since if we delete the [1] we end up with a string that occurs in [2] also, ie it's not a candidate.

Now, none of the labels on the right occurs in any other string, so we choose the shortest one not including a metacharacter, which is bb.

Similarly, the second string has these candidates:

abc: abc
bc: bc[2]
c: c[2]

Only one doesn't have a metacharacter at the end, so we have to go with abc.

My final point is that this per-string minimum-finding doesn't have to happen one-at-a-time; the GST can be scanned once to label nodes as either containing leaves from one string ([1],[2],..[n]) or "mixed", and then the per-string minimum non-shared strings (I would call these "distinguishing infixes") can be calculated in one pass as well.

That sounds like the interesting approach I imagined might exist, but I'm still not quite visualizing what this would look like. Could I trouble you to add something like pseudo code or algorithm steps. If I can wrap my head around getting this into O(N), I'll definitely move my selection to this answer.
This is an alternate explanation of the same algorithm: reddit.com/r/algorithms/comments/372egn/…

hamstergene · Accepted Answer · 2012-07-01 08:52:55Z

2

Say N is number of strings and L is maximum length of string. You're doing up to N*L*L*N iterations.

I can only improve it a bit by trading one iteration for extra memory. For each possible substring length (L iterations),

enumerate all substrings of that length in each name (N*L), and store it among with name's index into a hashtable (1). If there is already an index for this substring, you know it won't work, then you replace index with some special value, like -1.
walk the hashtable, picking up substrings for which index is not -1 — that are the answers for their corresponding indexes, but only use them if that names don't already have a shorter answer from a previous iteration

The memory usage can be greatly reduced by storing reference back into existing string instead of copying substrings.

answered Jul 1, 2012 at 8:52

hamstergene

24.5k5 gold badges60 silver badges73 bronze badges

2 Comments

Patrick Over a year ago

Since it seems no one is really suggesting an entirely different algorithm than the initially provided brute force, I'm going to accept this answer as the more clearly defined improvement suggestion.

Patrick Over a year ago

I would slightly disagree with your big O estimation, though. Since indexOf is an iterative operation over L, I believe the original brute force would be more like O(N*L*L*N*L). So, removing the last N*L and instead iterating over a hash table of all possible permutations of all elements of the original array seems only marginally better. With a canary array, the iterated array could be smaller, though.

Rohit Katta · Accepted Answer · 2021-05-23 16:57:39Z

   for(String s : strArr) { //O(n)
      //Assume the given string as shortest and override with shortest
       result.put(s, s);   
       for(int i = 0; i < s.length(); i++) { // O(m)              
          for (int j = i + 1; j <=s.length(); j++) {
               String subStr = s.substring(i, j);
               boolean isValid = true;
               for(String str2: strArr) { // O(n)
                   if(str2.equals(s)) // Same string cannot be a substring
                     continue;
                     
                    if(str2.contains(subStr)) {
                        isValid = false;
                        break;
                    }
               }

               if(isValid && subStr.length() < result.get(s).length()) 
                   result.put(s, subStr);
           }
        } 
   } 
    
   return result;

Collectives™ on Stack Overflow

Find the smallest unique substring for each string in an array

4 Answers 4

1 Comment

2 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related