I am working on a project in which I need to generate several identifiers for combinatorial pooling of different molecules. To do so, I assign each molecule an n-bit string (where n is the number of pools I have. In this case, 79 pools) and each string has 4 "on" bits (4 bits equal to 1) corresponding to which pools that molecule will appear in. Next, I want to pare down the number of strings such that no two molecules appear in the same pool more than twice (in other words, the greatest number of overlapping bits between two strings can be no greater than 2).
To do this, I: 1) compiled a list of all n-bit strings with k "on" bits, 2) generated a list of lists where each element is a list of indices where the bit is on using re.finditer and 3) iterate through the list to compare strings, adding only strings that meet my criteria into my final list of strings.
The code I use to compare strings:
drug_strings = [] #To store suitable strings for combinatorial pooling rules
class badString(Exception): pass
for k in range(len(strings)):
bit_current = bit_list[k]
try:
for bits in bit_list[:k]:
intersect = set.intersection(set(bit_current),set(bits))
if len(intersect) > 2:
raise badString() #pass on to next iteration if string has overlaps in previous set
drug_strings.append(strings[k])
except badString:
pass
However, this code takes forever to run. I am running this with n=79-bit strings with k=4 "on" bits per string (~1.5M possible strings) so I assume that the long runtime is because I am comparing each string to every previous string. Is there an alternative/smarter way to go about doing this? An algorithm that would work faster and be more robust?
EDIT: I realized that the simpler way to approach this problem instead of identifying the entire subset of strings that would be suitable for my project was to just randomly sample the larger set of n-bit strings with k "on" bits, store only the strings that fit my criteria, and then once I have an appropriate amount of suitable strings, simply take as many as I need from those. New code is as follows:
my_strings = []
my_bits = []
for k in range(2000):
random = np.random.randint(0, len(strings_77))
string = strings_77.pop(random)
bits = [m.start()+1 for m in re.finditer('1',string)]
if all(len(set(bits) & set(my_bit)) <= 2
for my_bit in my_bits[:k]):
my_strings.append(string)
my_bits.append(bits)
Now I only have to compare against strings I've already pulled (at most 1999 previous strings instead of up to 1 million). It runs much more quickly this way. Thanks for the help!