Algorithm for checking if a string was built from a list of substrings

Question

You are given a string and an array of strings. How to quickly check, if this string can be built by concatenating some of the strings in the array?

This is a theoretical question, I don't need it for practical reasons. But I would like to know, if there is some good algorithm for this.

EDIT Reading some answer I have noticed, that this is probably NP-Complete problem. Even finding a subset of strings, that will together have same length, as a given string is a classic subset sum problem.

So I guess there is no easy answer to this.

EDIT

Now it seems, that it is not a NP-Complete problem after all. That's way cooler :-)

EDIT

I have came up with a solution that passes some tests:

def can_build_from_substrings(string, substrings):
    prefixes = [True] + [False] * (len(string) - 1)
    while True:
        old = list(prefixes)
        for s in substrings:
            for index, is_set in enumerate(prefixes):
                if is_set and string[index:].startswith(s):
                    if string[index:] == s:
                        return True
                    prefixes[index + len(s)] = True
        if old == prefixes: # nothing has changed in this iteration
            return False

I believe the time is O(n * m^3), where n is length of substrings and m is length of string. What do you think?

Feels like a variant of the knapsack problem.

Anders Lindahl
– Anders Lindahl

2011-05-13 19:15:11 +00:00
Commented May 13, 2011 at 19:15 — Anders Lindahl
– Anders Lindahl, Commented May 13, 2011 at 19:15
Can you use a string from the array only once?

Anders Lindahl
– Anders Lindahl

2011-05-13 19:22:08 +00:00
Commented May 13, 2011 at 19:22 — Anders Lindahl
– Anders Lindahl, Commented May 13, 2011 at 19:22

Tess Rosania · Accepted Answer · 2011-05-15 17:07:16Z

10

Note: I assume here that you can use each substring more than once. You can generalize the solution to include this restriction by changing how we define subproblems. That will have a negative impact on space as well as expected runtime, but the problem remains polynomial.

This is a dynamic programming problem. (And a great question!)

Let's define composable(S, W) to be true if the string S can be written using the list of substrings W.

S is composable if and only if:

S starts with a substring w in W.
The remainder of S after w is also composable.

Let's write some pseudocode:

COMPOSABLE(S, W):
  return TRUE if S = "" # Base case
  return memo[S] if memo[S]

  memo[S] = false

  for w in W:
    length <- LENGTH(w)
    start  <- S[1..length]
    rest   <- S[length+1..-1]
    if start = w AND COMPOSABLE(rest, W) :
      memo[S] = true # Memoize

  return memo[S]

This algorithm has O(m*n) runtime, assuming the length of the substrings is not linear w/r/t to the string itself, in which case runtime would be O(m*n^2) (where m is the size of the substring list and n is the length of the string in question). It uses O(n) space for memoization.

(N.B. as written the pseudocode uses O(n^2) space, but hashing the memoization keys would alleviate this.)

EDIT

Here is a working Ruby implementation:

def composable(str, words)
  composable_aux(str, words, {})
end

def composable_aux(str, words, memo)
  return true if str == ""                # The base case
  return memo[str] unless memo[str].nil?  # Return the answer if we already know it

  memo[str] = false              # Assume the answer is `false`

  words.each do |word|           # For each word in the list:
    length = word.length
    start  = str[0..length-1]
    rest   = str[length..-1]

    # If the test string starts with this word,
    # and the remaining part of the test string
    # is also composable, the answer is true.
    if start == word and composable_aux(rest, words, memo)
      memo[str] = true           # Mark the answer as true
    end
  end

  memo[str]                      # Return the answer
end

edited May 15, 2011 at 17:07

answered May 13, 2011 at 19:41

Tess Rosania

10.1k2 gold badges22 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

14 Comments

gruszczy Over a year ago

This sounds much better. Does it mean it is not really NP-Complete or is the runtime pseudo-polynomial here?

Tess Rosania Over a year ago

My answer assumes you can use substrings more than once, which simplifies the solution a bit. However, even if you add that restriction the problem is not NP-Complete. (Although you'll use a lot more memory!) Dynamic programming requires optimal substructure and overlapping subproblems. You have both of those properties here.

Aryabhatta Over a year ago

@grus: The difference here is that the size of input is different, so a pseudo-polynomial algorithm for the corresponding NP-Complete problem, actually is polynomial time algorithm for your problem. For instance if the input to Subset-Sum is in unary, it is in P. That is why people speak of Strongly NP-Complete, which remains NP-Complete, even when input is unary, like bin-packing.

Aryabhatta Over a year ago

@grus: For instance, if your alphabet was just one character 'a'. and the input was an array of lengths (instead of the strings themselves), then your problem would be NP-Complete.

gruszczy Over a year ago

Cool, I can see a different approach to the problem. There is however, another problem: how to solve this problem if you can't choose the same string twice? I wonder, if same trick as in 0-1 knapsack problem can be used here. Anyway, thinking about this problem taught me a lot about DP. Thanks a lot :-)

|

cnicutar · Accepted Answer · 2011-05-13 19:55:57Z

2

It's definitely not quick but you here's an idea:

Iterate over all the strings, checking if the target string "begins" with any of them
Take the longest string with which the target string begins, remove it from the list and trim it from the main string
Rinse, repeat

Stop when you're left with a 0 length target string.

As I said before, this is definitely not fast but should give you a baseline ("it shouldn't get much worse than this").

EDIT

As pointed out in the comments, this will not work. You will have to store the partial matches and fall back on them when you find there is no way further.

When you find that a string is the head of the target, push it onto a list. After building the list, you will naturally try the biggest "head" of the target
When you find that the head you tried doesn't fit with what's left, try the next best head

This way, eventually you'll explore the entire space of solutions. For every candidate head you'll try every possible tail.

edited May 13, 2011 at 19:55

answered May 13, 2011 at 19:16

cnicutar

184k26 gold badges378 silver badges398 bronze badges

13 Comments

user180247 Over a year ago

Depending on how big your main string is, you might want to spot substring matches to the array items using a state machine constructed from the strings in that array. When running, you need to take into account that a string may start anywhere, so the initial (no chars checked) state is a possibility for every position - needs the usual multiple-states-valid-at-a-time nondeterministic finite automaton style of engine, even though the state model can be deterministic (maybe a simple tree, if you don't want to minimise). Basically a regex search with a finite choice regex.

user180247 Over a year ago

Once you have a set of (probably overlapping) matched substrings, that's the time to get greedy (or dynamic).

Marino Šimić Over a year ago

this can give false negatives however

S M Kamran Over a year ago

The biggest substring approach with whome the string start will be incorrect... fore axample the string is "abcdefght" and available substrings are "ab", "abc", "abcde", "cdef", "ght"... In this case the biggest starting substring is "abcde" but the substring which makes the string actually are "ab", "cdef", "ght".... so in first pass you'll find all the substrings and then try variations with backtracking...

cnicutar Over a year ago

@S M Kamran Good catch, you are right. One pass is not enough.

|

Localghost · Accepted Answer · 2011-05-13 19:13:20Z

1

This is how I would do it.

Determine the length of the target string.
Determine the length of each string in the substring array
Determine which combination of substrings would yield a string with the same length as the target string (If any, if not you're done)
Generate all permutations of the substring combinations determined in step 3. Check if any of them match the target string.

Generating all permutations is a processor heavy task, so if you can cut down on your 'n' (input size), you'll gain some considerable efficiency.

answered May 13, 2011 at 19:13

Localghost

7203 silver badges7 bronze badges

2 Comments

Bill the Lizard Over a year ago

Steps 1 through 3 are equivalent to the subset sum problem, but this is still a considerable improvement over brute force. :)

user180247 Over a year ago

+1 for checking if a string of the right length can be formed first - may well save a lot of time. However, as you said, generating permutations is slow (because you get so many of them), and there are ways to avoid considering most of them based on substring matching.

Anders Lindahl · Accepted Answer · 2011-05-13 20:03:16Z

1

Inspired by @cnicutars answer:

function Possible(array A, string s)
- If s is empty, return true.
- compute the array P of all strings in A that are a prefix of s.
- If P is empty, return false.
- for each string p in P:
  - if Possible(A with p removed, s with prefix p removed) return true
- return false

edited May 13, 2011 at 20:03

answered May 13, 2011 at 19:41

Anders Lindahl

43.3k9 gold badges93 silver badges94 bronze badges

2 Comments

Marino Šimić Over a year ago

s with p removed -> removed where, multiple occurencies are not the same, if you remove at one place or the other it changes a lot, you can give false negatives: "1234123" : "12","34","123"

Anders Lindahl Over a year ago

p should be removed from the start of S, as it is a prefix. I have updated my answer to clarify.

lupos · Accepted Answer · 2011-05-13 19:17:01Z

two options sprint to mind but neither of them seem very elegant.

1) brute force: do it like you would a password generator i.e. word1+word1+word1 > word1+word1+word2 > word1+word1+word3 etc etc etc

the trick there is the length so youd have to try all combinations of 2 or more words and you don't know where to set the limit. Very time consuming.

2) take the string in question and run a find in on it for every word you have 1 at a time. maybe check the length and if its greater than 0 do it again. keep doing it till you hit zero it cant find any more results. if you hit 0 its a win if not its a lose. I think this method would be a lot better than the first but I imagine someone will have a better suggestion.

S M Kamran · Accepted Answer · 2011-05-13 19:20:20Z

0

It seems to me a problem can be solved by simple linearly traversing of array and comparison. However there could be multiple pass. You can devise a strategy to minimize the passes. For example constructing a sub array of all the substrings of the original string in first pass. Then try out different variations linearly.

answered May 13, 2011 at 19:20

S M Kamran

4,5317 gold badges27 silver badges36 bronze badges

2 Comments

Marino Šimić Over a year ago

something like this - called permutations of N - for each permutation of size N from the substrings that is equal in size to the original string check it, if none, do a permutation of N+1 items, end so forth

S M Kamran Over a year ago

Thanks for giving this a fancy keyword :P

pstrjds · Accepted Answer · 2011-05-13 19:21:53Z

0

Here is a rough idea that should work.

Copy the source string into a new string
While the copy string still has data and there are still substrings a. Grab a sub string, if copy.contains(substr) copy.remove(substr)
If the copy is now empty then yes, you could construct the string
If copy is not empty, throw out the first substr that was removed from the string and repeat.
If all substrings are gone and copy is still not empty then no, you can't construct it.

Edit: A way to possibly improve this would be to first iterate all of the substrings and throw out any that are not contained in the main string. Then go through the above steps.

answered May 13, 2011 at 19:21

pstrjds

17.6k7 gold badges54 silver badges61 bronze badges

3 Comments

pstrjds Over a year ago

You could certainly tweak this with length checks and other things to improve it, its just a rough idea.

Marino Šimić Over a year ago

substrings can occur on multiple positions, and removing from one position or the other is not the same, and this like the other ideas will give false negatives

pstrjds Over a year ago

@Marino Good call, hadn't thought that all the way through.

zubrabubra · Accepted Answer · 2011-05-13 20:12:19Z

0

Let me suggest using Suffix Trees (using Ukkonen's online algorithm to build it) which seems to be suitable in terms of searching common substrings in two texts. You could find more information in wikipedia/special sources. The task is

Find all z occurrences of the patterns P1..Pn of total length m
enter code hereas substrings in O(m + z) time.

so you see there exists very cool solution. Hope this will work for you. This is actually more suitable for repeating scans, rather than a single scan.

answered May 13, 2011 at 20:12

zubrabubra

5143 silver badges8 bronze badges

Comments

Marino Šimić · Accepted Answer · 2011-05-15 02:14:34Z

If each substring must be used only once but not all of them must be used...

For each permutation of size N from the substrings that is equal in size to the original string check it, if none, do a permutation of N+1 items, end so forth, until you exhaust all the permutations.

Of course NP complete, slow as hell, but i think that no normal solutions exist.

To explain why the solutions where removing substrings from the original string won't ever work:

Have a string "1234123" and array "12","34","123". If you remove "123" from the start, you have a false negative. A similar example where removing from the end would be: "1234123" : "23,"41","123".

With backtracking with greedy: (m string length 7, n num elements 3) - take the longest: 123 - remove it from first occurence O(3) - try other two with the rest: no go + O((n-1)*(m-3)) - backtrack O(1) - remove from second: O(m-3) - try other two O((n-1)*m-3) = O(30)

Permutations of 1 + 2 + 3 = O(3) + O(4) + O(6) = O(13). So for small subset lenght permutations are actually faster than backtracking. This will change if you ask for a lot of substrings to find (in most cases but not all).

You can remove only the nonexisting substrings from the array to lower the number of permutations from n^n to n^(n-1) for each removed nonexisting substring.

AJed · Accepted Answer · 2011-05-15 04:21:30Z

What you are looking for is a parser. A parser will check whether a certain word belongs to a certain language. I am not sure of the exact computattional complexity of your problem. Some of the above seems to be correct (there is no need at all for exhaustive search). One thing for sure, it s not NP-Complete.

The alphabet of your language would be all the small substrings. The word you are looking for is the string you have. A regular expression can be a simple Kleene star, or a a very simply context free grammar that is nothing but Or's.

The main issue in the algorithm is: what if the some of the substrings are actually substrings to other substrings ... that is, what if we have substrings: "ab", "abc", "abcd", ... , In this case, the order of checking the substrings will change the complexity. For this, we have LR-parsers. I guess they are the best in solving such problems.

I will find you the exact solution soon.

Mussab Elfatih · Accepted Answer · 2022-08-31 09:41:12Z

0

I have come up with a one liner for this using dynamic programming principles. The function foo takes a string and a list of substrings to decide if it's possible to construct the given string from the given list

foo = lambda s, arr:  not s or any(foo(s[len(n):], arr) for n in arr if s[:len(n)]==n)

The one liner broken down for readability:

def foo(s, arr):
if not s: return True
for sub in arr:
    if s[:len(sub)] == sub:
        if foo(s[len(sub):], arr):
            return True
return False

answered Aug 31, 2022 at 9:41

Mussab Elfatih

1

Comments

AtjovDavid · Accepted Answer · 2023-07-11 09:09:23Z

0

What about this?

substrings = ["a", "bb", "c"]

def checkString(string):
    for substring in substrings:
        if substring in string:
            string = string.replace(substring, "")
    if string == "":
        return True
    else:
        print(string)
        return False

It esentially just goes through all the allowed substrings and if they're present in the string it removes them and then checks if the resulting string is "".

answered Jul 11, 2023 at 9:09

AtjovDavid

113 bronze badges

Collectives™ on Stack Overflow

Algorithm for checking if a string was built from a list of substrings

12 Answers 12

14 Comments

13 Comments

2 Comments

2 Comments

Comments

2 Comments

3 Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

14 Comments

13 Comments

2 Comments

2 Comments

Comments

2 Comments

3 Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related