12

(I'm writing this in the context of JavaScript, but will accept an algorithmically correct answer in any language)

How do you find the shortest substring of each element in an array of strings where the substring is NOT contained within any of the other elements, ignoring case?

Suppose I have an input array such as:

var names = ["Anne", "Anthony", "LouAnn", "Kant", "Louise", "ark"];

The output should be something like:

var uniqueNames = ["ne", "h", "ua", "ka", "i", "r"];

For my purposes, you can safely assume that no element will be wholly contained within another element.

My Thoughts:
It seems that one could probably brute force this, along the lines of:

var names = ["Anne", "Anthony", "LouAnn", "Kant", "Louise", "ark"];
var uniqueNames = [], nameInd, windowSize, substrInd, substr, otherNameInd, foundMatch;
// For each name
for (nameInd = 0; nameInd < names.length; nameInd++)
{
    var name = names[nameInd];
    // For each possible substring length
    windowLoop:
    for (windowSize = 1; windowSize <= name.length; windowSize++)
    {
        // For each starting index of a substring
        for (substrInd = 0; substrInd <= name.length-windowSize; substrInd++)
        {
            substr = name.substring(substrInd,substrInd+windowSize).toLowerCase();
            foundMatch = false;
            // For each other name
            for (otherNameInd = 0; otherNameInd < names.length; otherNameInd++)
            {
                if (nameInd != otherNameInd && names[otherNameInd].toLowerCase().indexOf(substr) > -1)
                {
                    foundMatch = true;
                    break;
                }
            }

            if (!foundMatch)
            {
                // This substr works!
                uniqueNames[nameInd] = substr;
                break windowLoop;
            }
        }
    }
}

But I have to imagine there's a more elegant solution using tries/prefix trees, suffix arrays, or something interesting like that.

Edit: I believe this is the form the selected answer would take programmatically in JavaScript:

var names = ["Anne", "Anthony", "LouAnn", "Kant", "Louise", "ark"];
var uniqueNames = [], permutations = {}, permutation, nameInd, windowSize, substrInd, substr;

// For each name
for (nameInd = 0; nameInd < names.length; nameInd++)
{
    var name = names[nameInd];
    // For each possible substring length
    windowLoop:
    for (windowSize = 1; windowSize <= name.length; windowSize++)
    {
        // For each starting index of a substring
        for (substrInd = 0; substrInd <= name.length-windowSize; substrInd++)
        {
            substr = name.substring(substrInd,substrInd+windowSize).toLowerCase();
            permutations[substr] = (typeof permutations[substr] === "undefined")?nameInd:-1;
        }
    }
}

for (substr in permutations)
{
    permutation = permutations[substr];
    if (permutation !== -1 && ((typeof uniqueNames[permutation] === "string" && substr.length < uniqueNames[permutation].length) || typeof uniqueNames[permutation] === "undefined"))
    {
        uniqueNames[permutation] = substr;
    }
}
6
  • Is your sample output incorrect? I don't see s and y in there whereas is see i, h and r... Commented Jun 28, 2012 at 15:05
  • @Icarus Ah, good point. s and y are not present only because I'm not looking for all smallest substrings that fit the criteria, rather any one is good enough. I would accept an answer that returned a two dimensional array of all of them, but I don't really need that level of detail. An equally valid output could be var uniqueNames = ["ne", "y", "ua", "ka", "i", "s"]; Commented Jun 28, 2012 at 18:00
  • Is it possible to limit your input alphabet to 26 characters (or something like this, just limit it)? Commented Jun 29, 2012 at 8:13
  • @SaeedAmiri I'm not really sure exactly what route you're headed down, but my actual input contains only characters from [0-9a-zA-Z_-'&,\.\s] in the input and you can limit output to only contain alphanumeric characters, though I will probably choose an answer with fewer limitations over one with more, you know? Commented Jun 29, 2012 at 15:44
  • @Patrick There is an O(M) solution using suffix arrays; where M is the sum of the lengths of all the strings. Commented Jul 16, 2012 at 19:50

4 Answers 4

5

This problem can be solved in O(N*L*L*L) complexity. The approach will be using suffix tries. Each node of the trie will be also storing the prefix count which will refer to the number of times the substring formed while traversing to that node from the root have appeared in all of the suffixes inserted till now.

We will be constructing N+1 tries. The first trie will be global and we will be inserting all the suffixes of all N string into it. The next N tries will be local for each of the N strings containing corresponding suffixes.

This preprocessing step of constructing tries will be done in O(N*L*L).

Now once the tries have been constructed, for each string, we can start quering the number of times a substring ( starting from minimum length) has occured in the global trie and the trie corresponding to that string. If it is same in both then it implies that it is not included in any other strings except itself. This can be achieved in O(N*L*L*L). The complexity can be explained as N for each string, L*L for considering each substring and L for performing query in the trie.

Sign up to request clarification or add additional context in comments.

1 Comment

I think you meant "substring" by "suffix".
3

If you build a generalized suffix tree, you just need to find the shallowest point at which an infix of each string branches off from the infixes of the other strings, and take the label to that branching point plus one "distinguishing" character. The kicker is that there has to be such an extra character (it could be branching off only at the metacharacter stuck on the end of each string), and the branching-off point might not lead to a leaf, it might lead to a subtree with leaves all from the same string (so internal nodes have to be considered).

For each string S find the shallowest (by parent label depth) node N that only contains leaves from S, and whose edge label contains at least one character. The path label from root to the parent of N, plus one character from the edge label leading to N, is the shortest infix of S not found in other strings.

I believe the labeling of nodes that only contain leaves from one string can be done during construction or by O(N) scan of the GST; then it's a simple matter to scan the final tree and keep a running minimum for each string. So it's all O(N).

(edit -- I can't reply to comments yet)

To clarify, every suffix in a suffix tree has a node where it branches off from the other suffixes; the goal here is to find the/a suffix for each string that branches off from suffixes of all the other strings at the minimum depth, as measured by the path label to that node. All we need is one extra character after that point to have a substring that doesn't appear in any other string.

Example:

Strings: abbc, abc

Using Ukonnen's algorithm, after the first string we have a suffix tree of just the suffixes from that string; I'll label them with [1] here:

abbc[1]
b
 bc[1]
 c[1]
c[1]

Next we insert string 2's suffixes:

ab
  bc[1]
  c[2]
b
 bc[1]
 c
  [1]
  [2]
c
 [1]
 [2]

Now we want to find the shortest string that leads to a branch with only [1]'s under it; we can do that by scanning all the [1]'s and looking at their immediate parents, which I will list here by path label, plus one character (which I will use below):

abbc:  abb
bbc: bb
bc: bc[1]
c: c[1]

Note that I've included [1] since it's the metacharacter that distinguishes otherwise identical suffixes of [1] and [2]. This is handy when identifying substrings that repeat in multiple strings, but it's not useful for our problem, since if we delete the [1] we end up with a string that occurs in [2] also, ie it's not a candidate.

Now, none of the labels on the right occurs in any other string, so we choose the shortest one not including a metacharacter, which is bb.

Similarly, the second string has these candidates:

abc: abc
bc: bc[2]
c: c[2]

Only one doesn't have a metacharacter at the end, so we have to go with abc.

My final point is that this per-string minimum-finding doesn't have to happen one-at-a-time; the GST can be scanned once to label nodes as either containing leaves from one string ([1],[2],..[n]) or "mixed", and then the per-string minimum non-shared strings (I would call these "distinguishing infixes") can be calculated in one pass as well.

2 Comments

That sounds like the interesting approach I imagined might exist, but I'm still not quite visualizing what this would look like. Could I trouble you to add something like pseudo code or algorithm steps. If I can wrap my head around getting this into O(N), I'll definitely move my selection to this answer.
This is an alternate explanation of the same algorithm: reddit.com/r/algorithms/comments/372egn/…
2

Say N is number of strings and L is maximum length of string. You're doing up to N*L*L*N iterations.

I can only improve it a bit by trading one iteration for extra memory. For each possible substring length (L iterations),

  • enumerate all substrings of that length in each name (N*L), and store it among with name's index into a hashtable (1). If there is already an index for this substring, you know it won't work, then you replace index with some special value, like -1.

  • walk the hashtable, picking up substrings for which index is not -1 — that are the answers for their corresponding indexes, but only use them if that names don't already have a shorter answer from a previous iteration

The memory usage can be greatly reduced by storing reference back into existing string instead of copying substrings.

2 Comments

Since it seems no one is really suggesting an entirely different algorithm than the initially provided brute force, I'm going to accept this answer as the more clearly defined improvement suggestion.
I would slightly disagree with your big O estimation, though. Since indexOf is an iterative operation over L, I believe the original brute force would be more like O(N*L*L*N*L). So, removing the last N*L and instead iterating over a hash table of all possible permutations of all elements of the original array seems only marginally better. With a canary array, the iterated array could be smaller, though.
-1
   for(String s : strArr) { //O(n)
      //Assume the given string as shortest and override with shortest
       result.put(s, s);   
       for(int i = 0; i < s.length(); i++) { // O(m)              
          for (int j = i + 1; j <=s.length(); j++) {
               String subStr = s.substring(i, j);
               boolean isValid = true;
               for(String str2: strArr) { // O(n)
                   if(str2.equals(s)) // Same string cannot be a substring
                     continue;
                     
                    if(str2.contains(subStr)) {
                        isValid = false;
                        break;
                    }
               }

               if(isValid && subStr.length() < result.get(s).length()) 
                   result.put(s, subStr);
           }
        } 
   } 
    
   return result;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.