1

I am working on a project in which I need to generate several identifiers for combinatorial pooling of different molecules. To do so, I assign each molecule an n-bit string (where n is the number of pools I have. In this case, 79 pools) and each string has 4 "on" bits (4 bits equal to 1) corresponding to which pools that molecule will appear in. Next, I want to pare down the number of strings such that no two molecules appear in the same pool more than twice (in other words, the greatest number of overlapping bits between two strings can be no greater than 2).

To do this, I: 1) compiled a list of all n-bit strings with k "on" bits, 2) generated a list of lists where each element is a list of indices where the bit is on using re.finditer and 3) iterate through the list to compare strings, adding only strings that meet my criteria into my final list of strings.

The code I use to compare strings:

drug_strings = [] #To store suitable strings for combinatorial pooling rules

class badString(Exception): pass

for k in range(len(strings)):
    bit_current = bit_list[k]
    try:
        for bits in bit_list[:k]:
            intersect = set.intersection(set(bit_current),set(bits))
            if len(intersect) > 2:
                raise badString() #pass on to next iteration if string has overlaps in previous set
        drug_strings.append(strings[k])
    except badString:
        pass 

However, this code takes forever to run. I am running this with n=79-bit strings with k=4 "on" bits per string (~1.5M possible strings) so I assume that the long runtime is because I am comparing each string to every previous string. Is there an alternative/smarter way to go about doing this? An algorithm that would work faster and be more robust?

EDIT: I realized that the simpler way to approach this problem instead of identifying the entire subset of strings that would be suitable for my project was to just randomly sample the larger set of n-bit strings with k "on" bits, store only the strings that fit my criteria, and then once I have an appropriate amount of suitable strings, simply take as many as I need from those. New code is as follows:

my_strings = []
my_bits = []

for k in range(2000):
    random = np.random.randint(0, len(strings_77))
    string = strings_77.pop(random)
    bits = [m.start()+1 for m in re.finditer('1',string)]
      
    if all(len(set(bits) & set(my_bit)) <= 2
           for my_bit in my_bits[:k]):
        my_strings.append(string)
        my_bits.append(bits)

Now I only have to compare against strings I've already pulled (at most 1999 previous strings instead of up to 1 million). It runs much more quickly this way. Thanks for the help!

0

2 Answers 2

1

Raising exceptions is expensive. A complex data structure is created and the stack has to be unwound. In fact, setting up a try/except block is expensive.

Really you're wanting to check that all intersections have length less than or equal to two, and then append. There is no need for exceptions.

for k in range(len(strings)):
    bit_current = bit_list[k]

    if all(len(set(bit_current) & set(bits)) <= 2
           for bits in bit_list[:k]):
        drug_strings.append(strings[k])

Also, instead of having to look up the strings and bit_list index, you can iterate over all the parts you need at the same time. You still need the index for the bit_list slice:

for index, (drug_string, bit_current) in enumerate(zip(strings, bit_list)):
    if all(len(set(bit_current) & set(bits)) <= 2
           for bits in bit_list[:k]):
        drug_strings.append(drug_string)

You can also avoid creating the bit_current set with each loop:

for index, (drug_string, bit_current) in enumerate(zip(strings, bit_list)):
    bit_set = set(bit_current)
    if all(len(bit_set & set(bits)) <= 2
           for bits in bit_list[:k]):
        drug_strings.append(drug_string)
Sign up to request clarification or add additional context in comments.

1 Comment

Something that carries over from my original code is that each bit_set is compared to the set of bits from every string that comes before the current string. Would it be faster to restructure the code to look ahead instead of backward? As it is, each iteration becomes progressively longer, but if restructured, each iteration becomes shorter. Is there a benefit either way?
1

Some minor things that I would improve in your code that may be causing some overhead:

  1. set(bit_current) move this outside the inner loop;
  2. remove the raise except part;
  3. Since you have this condition if len(intersect) > 2: you could try to implement the interception method to stop when that condition is meet. So that you avoid unnecessary computation.

So the code would become:

for k in range(len(strings)):
    bit_current = set(bit_list[k])
    intersect = []
    for bits in bit_list[:k]:
        intersect = []
        b = set(bits)
        for i in bit_current:
            if i in b:
                intersect.append(i)
            if len(intersect) > 2:
                break

        if len(intersect) > 2:
            break
    if len(intersect) <= 2:
       drug_strings.append(strings[k])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.