Efficient algorithm to check for subset against a range of sets

Question

I read through some of the posts about determining whether a set A is a subset of another set B. But I find it hard to determine what algorithm to use. Here is an outline of the problem:

I have an array of strings A which I receive at the beginning of my program. Not much is known about the structure. Each string in the array can be arbitrarily long and the number of entries is not limited. Though usually it can be assumed that the number of entries in the array won't be overly large (< 100).
Then I iterate through a list of objects of length n.
Each of the n objects will also have an array of strings B, i.e. there will be n B arrays. Once the program runs the Bs will be fixed, i.e. they do not change during runtime.
I want to determine for each object if A is a subset of B.

Now, I thought about hashtables. However, they would, in my opinion only be efficient if the was only one B and a lot of As. Then I could make a hashtable for B and check each string array of each object against my hashtable. But this is not the case because there is only one A but n Bs. What would be an efficient algorithm to do this?

Example:

A:  ["A", "G", "T"]
B1: ["C", "G"]
B2: ["K", "A", "U", "T", "G"]
.
.
.
Bn: ["T", "I", "G", "O", "L"]

Here A is a subset of B2 but not of B1, and not of Bn.

How often do you need to check whether some string in A is a substring of B? Is your program running for a long time during which you check the same list of strings A multiple times for the (maybe changed) B? — Glubus
– Glubus, Commented Jan 18, 2016 at 11:07
At each run of the program A is fixed at the beginning. The B are also fixed during the runtime of the program. They do not change. But there are a lot of Bs. — lord.garbage
– lord.garbage, Commented Jan 18, 2016 at 11:11
"while B varies from object to object" so if B is fixed, what does that mean. Could you edit your answer and post some example data for A and B? — Glubus
– Glubus, Commented Jan 18, 2016 at 11:13
What can you tell us about the distribution of the string length ? (In your example, they are all 1.) — user1196549
– user1196549, Commented Jan 18, 2016 at 11:38
Sorry, I am not satisfied with this answer. Are the strings short or long or very long (1 megachar) ? All of them ? ... — user1196549
– user1196549, Commented Jan 18, 2016 at 11:42

score 2 · Accepted Answer · 2016-01-18 12:29:46Z

2

An efficient approach is to represent the set A as a trie. This allows to check if a given string belongs to A in time linear in the string length.

Then there is no better way then checking exhaustively for all Bi and all strings in Bi if it belongs to A. The search stops as soon as all strings in A have been matched (flagging a string when it has been found).

The running time will be at worse proportional to the total number of characters in all strings in all B. In practice, a significant fraction of the characters will be skipped, as

the search for a string not in A can terminate early,
the subset test can conclude positively even if there are strings left in Bi,
the subset test can conclude negatively when there are more unmatched strings in A than strings left in Bi.

This approach is certainly worst-case optimal as you read the characters at most once and perform a constant number of operations per character.

edited Jan 18, 2016 at 12:29

answered Jan 18, 2016 at 11:53

user1196549

Sign up to request clarification or add additional context in comments.

3 Comments

lord.garbage Over a year ago

Hm, I like that approach. I just recently implemented a trie so that comes in handy as well.

lord.garbage Over a year ago

I'm just thinking. It can probably be made even more space efficient by using a ternary-search-tree.

user1196549 Over a year ago

Yep, if you need space efficiency, there are more compact variants of the tries. But in the case of alphabetic words, a trie "only" consumes 26 pointers per letter.

CiaPan · Accepted Answer · 2016-01-18 11:32:54Z

1

As a first approach I would pre-calculate some general properties of sets, which (hopefully) will allow you to filter some of Bs quickly. These may be, for example:

a number of strings – A certainly can't be a subset of B if it contains more elements than B;
a length of the longest string – A certainly is not a subset of B if the longest string in A is longer than the longest string in B;
a sum of strings' lengths.

For easier checking you might require every set to be ordered alphabetically. That will allow checking A against a single B in a (linear) scan through both sets of strings.

For small A and big B sets it might be more efficient to look for a string in B with binary search rather than with a linear scan; that would require pre-sorting of B, too.

answered Jan 18, 2016 at 11:32

CiaPan

9,5802 gold badges23 silver badges38 bronze badges

2 Comments

Glubus Over a year ago

All these pre-calculations should be done as the actual algorithm is run in the form of some heuristic as doing them before running some algorithm requires multiple loops over the potential big input. Also binary search would be pretty neat if A is much smaller than any element in B.

CiaPan Over a year ago

@Glubus All these parameters require just one scan through each set with just one length test of every string, so it makes a strictly linear time cost and minimum memory cost (storing three integers per string set). // Yes, as I said above, the binary search makes sense for size A much less than size B. For comparable sizes semi-parallel linear scan would be easier. However binary search requires strings in B to be ordered, which may cause additional cost of sorting.

Борат Сагдиев · Accepted Answer · 2021-05-21 14:03:22Z

1

As you know A beforehand, you can design a collision-free hash function to hash all elements of A.

Then operate only on the hashes in the search step, not the strings. For each element of B, compute its hash and then use it to look up an element of A. If an element is found, it means that the hashes match; then you'll also need to compare the strings to detect if its a true positive or just an accidental match.

Count the number of matches. When that number is equal to the size of A, stop and return a positive result. If all elements of a B have been processed and the number of matches is less than size of A, return a negative result.

edited May 21, 2021 at 14:03

Борат Сагдиев

2651 silver badge14 bronze badges

answered Jan 18, 2016 at 11:24

kfx

8,5773 gold badges33 silver badges60 bronze badges

3 Comments

Glubus Over a year ago

How would you detect A to be a subset of an element out of B if the element in B is a strict superset of A? Their hashes won't be the same right?

kfx Over a year ago

@Glubus oh, I misread the question. Updated the answer for now describing the simple algorithm. It's possible to do better I think.

Glubus Over a year ago

Yea I see what you mean now, and this makes more sense. I do like the idea of binary search suggested by CiaPan though, if A is much smaller than some B, the search might be more efficient than walking over the entire list.

Collectives™ on Stack Overflow

Efficient algorithm to check for subset against a range of sets

3 Answers 3

3 Comments

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related