5

I read through some of the posts about determining whether a set A is a subset of another set B. But I find it hard to determine what algorithm to use. Here is an outline of the problem:

  • I have an array of strings A which I receive at the beginning of my program. Not much is known about the structure. Each string in the array can be arbitrarily long and the number of entries is not limited. Though usually it can be assumed that the number of entries in the array won't be overly large (< 100).
  • Then I iterate through a list of objects of length n.
  • Each of the n objects will also have an array of strings B, i.e. there will be n B arrays. Once the program runs the Bs will be fixed, i.e. they do not change during runtime.
  • I want to determine for each object if A is a subset of B.

Now, I thought about hashtables. However, they would, in my opinion only be efficient if the was only one B and a lot of As. Then I could make a hashtable for B and check each string array of each object against my hashtable. But this is not the case because there is only one A but n Bs. What would be an efficient algorithm to do this?

Example:

A:  ["A", "G", "T"]
B1: ["C", "G"]
B2: ["K", "A", "U", "T", "G"]
.
.
.
Bn: ["T", "I", "G", "O", "L"]

Here A is a subset of B2 but not of B1, and not of Bn.

8
  • How often do you need to check whether some string in A is a substring of B? Is your program running for a long time during which you check the same list of strings A multiple times for the (maybe changed) B? Commented Jan 18, 2016 at 11:07
  • At each run of the program A is fixed at the beginning. The B are also fixed during the runtime of the program. They do not change. But there are a lot of Bs. Commented Jan 18, 2016 at 11:11
  • "while B varies from object to object" so if B is fixed, what does that mean. Could you edit your answer and post some example data for A and B? Commented Jan 18, 2016 at 11:13
  • What can you tell us about the distribution of the string length ? (In your example, they are all 1.) Commented Jan 18, 2016 at 11:38
  • 1
    Sorry, I am not satisfied with this answer. Are the strings short or long or very long (1 megachar) ? All of them ? ... Commented Jan 18, 2016 at 11:42

3 Answers 3

2

An efficient approach is to represent the set A as a trie. This allows to check if a given string belongs to A in time linear in the string length.

Then there is no better way then checking exhaustively for all Bi and all strings in Bi if it belongs to A. The search stops as soon as all strings in A have been matched (flagging a string when it has been found).

The running time will be at worse proportional to the total number of characters in all strings in all B. In practice, a significant fraction of the characters will be skipped, as

  • the search for a string not in A can terminate early,

  • the subset test can conclude positively even if there are strings left in Bi,

  • the subset test can conclude negatively when there are more unmatched strings in A than strings left in Bi.

This approach is certainly worst-case optimal as you read the characters at most once and perform a constant number of operations per character.

Sign up to request clarification or add additional context in comments.

3 Comments

Hm, I like that approach. I just recently implemented a trie so that comes in handy as well.
I'm just thinking. It can probably be made even more space efficient by using a ternary-search-tree.
Yep, if you need space efficiency, there are more compact variants of the tries. But in the case of alphabetic words, a trie "only" consumes 26 pointers per letter.
1

As a first approach I would pre-calculate some general properties of sets, which (hopefully) will allow you to filter some of Bs quickly. These may be, for example:

  • a number of strings – A certainly can't be a subset of B if it contains more elements than B;
  • a length of the longest string – A certainly is not a subset of B if the longest string in A is longer than the longest string in B;
  • a sum of strings' lengths.

For easier checking you might require every set to be ordered alphabetically. That will allow checking A against a single B in a (linear) scan through both sets of strings.

For small A and big B sets it might be more efficient to look for a string in B with binary search rather than with a linear scan; that would require pre-sorting of B, too.

2 Comments

All these pre-calculations should be done as the actual algorithm is run in the form of some heuristic as doing them before running some algorithm requires multiple loops over the potential big input. Also binary search would be pretty neat if A is much smaller than any element in B.
@Glubus All these parameters require just one scan through each set with just one length test of every string, so it makes a strictly linear time cost and minimum memory cost (storing three integers per string set). // Yes, as I said above, the binary search makes sense for size A much less than size B. For comparable sizes semi-parallel linear scan would be easier. However binary search requires strings in B to be ordered, which may cause additional cost of sorting.
1

As you know A beforehand, you can design a collision-free hash function to hash all elements of A.

Then operate only on the hashes in the search step, not the strings. For each element of B, compute its hash and then use it to look up an element of A. If an element is found, it means that the hashes match; then you'll also need to compare the strings to detect if its a true positive or just an accidental match.

Count the number of matches. When that number is equal to the size of A, stop and return a positive result. If all elements of a B have been processed and the number of matches is less than size of A, return a negative result.

3 Comments

How would you detect A to be a subset of an element out of B if the element in B is a strict superset of A? Their hashes won't be the same right?
@Glubus oh, I misread the question. Updated the answer for now describing the simple algorithm. It's possible to do better I think.
Yea I see what you mean now, and this makes more sense. I do like the idea of binary search suggested by CiaPan though, if A is much smaller than some B, the search might be more efficient than walking over the entire list.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.