0

I have a data frame that contains 2000 words. I am trying to create permutations of all the words with 4 at a time.

   perms = it.permutations(words.col, 4)
   done=object()
   nextperm=next(perms,done)
   while(nextperm is not done):
      #perform operations...
      permutedString=' '.join(nextperm )
      if(permutedString.count('a')==2 and permutedString.count('b')==3 and 
         permutedString.count('c')==1):
          //calculate  Md5 hash with 
          hash=hashlib.md5(permutedString.encode('utf-8')).hexdigest()
      nextperm=next(perms,done)

The above script is taking too long for me. Its been running for hours. Is there any way to improve performnce of this.

Any help on this is highly appreciated.

9
  • 5
    Every permutation of 4 words from a set of 2000 is a HUGE number, just under 16 trillion. How long did you expect it to take? Commented Jul 7, 2018 at 13:28
  • Yes. The numberof permutations will be a huge number. I am new to Python. Here trying to optimize my script for better performance. Commented Jul 7, 2018 at 13:30
  • The operations you're performing inside the loop will be a far bigger factor than the perm generation. Commented Jul 7, 2018 at 13:32
  • 2
    You need to change your algorithm. Nothing you do in Python or C or whatever language will make this tractable. Commented Jul 7, 2018 at 13:34
  • 2
    If there are only one or two matching strings, you absolutely do not need to enumerate every possible permutation and check whether there's a match! Please put in the rest of your code and add some explanation, there has to be a better way to do this. Commented Jul 7, 2018 at 13:36

2 Answers 2

2

As tzaman has pointed out, what's going on inside your loop is key to coming up with a better way to address the problem. The fundamental idea is that you want to minimize the amount of pairwise (or N-wise) operations you need to perform.

In your case, you're apparently trying to select phrases with the right number of specific letters, trying to crack some kind of "correct horse battery staple" password scheme with the benefit of constraints on the letter counts. In that case, since you're only allowed 1 letter 'c', there's no point in dealing with any permutation which has two 'c's. Etc.

We can do better than just ruling out useless words, too: we don't actually need to compare all words to see if they match, we can simply compare word count sets. That is, we can group all the words by their a, b, and c letter counts, which we can do in linear time, and then we can just iterate over four of those count sets to see if they sum to the right number of letters. Now we only have to do the count logic for elements drawn from a set of ~10s, not ~2000. (Really we could do even better than this, because we can either recursively or using partition techniques directly build the appropriate possible count sets, but let's start simple.)

Now, you've said "There will be only one or two strings that will match this condition", and I'm going to take you at your word, and limit the amount of optimization I'm going to do to handle a case like that.

If there are only a handful which satisfy the constraint, there must be only a few letter count groups which satisfy the constraint too, and not many words in that group. So something like this should work:

from collections import Counter
from itertools import permutations, product, combinations_with_replacement
import hashlib

# make a fake set of words
with open('/usr/share/dict/words') as fp:
    words = [word.lower() for word in fp.read().split()]
words = set(words[::len(words)//2000][:2000])

# set the target to something which has <2000 matching 4-words
target_counts = Counter({"a": 5, "b": 4, "d": 8})

# collect the words by counts
by_count = {}
for word in words:
    lcount = {letter: word.count(letter) for letter in target_counts}
    by_count.setdefault(tuple(sorted(lcount.items())), []).append(word)

collected_hashes = {}
# loop over every possible collection of word count groups
for i, groups in enumerate(combinations_with_replacement(by_count, 4)):
    if i % 10000 == 0:
        print(i, groups)

    # check to see whether the letter set sums appropriately
    total_count = sum((Counter(dict(group)) for group in groups), Counter())
    if total_count != target_counts:
        continue

    # the sums are right! loop over every word draw; for simplicity
    # we won't worry about duplicate word draws, we'll just skip
    # them if we see them
    for choices in product(*(by_count[group] for group in groups)):
        if len(set(choices)) != len(choices):
            # skip duplicate words
            continue
        for perm in permutations(choices):
            joined = ' '.join(perm)
            hashed = hashlib.md5(joined.encode("utf-8")).hexdigest()
            collected_hashes.setdefault(hashed, set()).add(joined)

which takes about 10 seconds to produce a dictionary of hashes, e.g.

In [25]: len(collected_hashes)
Out[25]: 960

In [26]: collected_hashes
Out[26]: 
{'67a0da67d938a6090beedcb849f66596': {'barbed badlands saddlebag skidded'},
 '8da55149bf66f89f27b658a7ef7d7126': {'barbed badlands skidded saddlebag'},
 '69dd0f1c7af2e9973fedcc585af15df4': {'barbed saddlebag badlands skidded'},
 '106a366ef772a5f68ebb4800b2281a09': {'barbed saddlebag skidded badlands'},
...

where each password indeed has the right number of target letter counts:

In [30]: c = Counter('barbed badlands saddlebag skidded')

In [31]: c['a'], c['b'], c['d']
Out[31]: (5, 4, 8)
Sign up to request clarification or add additional context in comments.

Comments

0

Outside of the fact that P(2000, 4) is an enormous number, you don't need to manually check your loop with a sentinel, you can just iterate directly over the perms object. That's about as much optimization as can be expected on the code you've provided, whatever is happening inside the loop will be the determining factor.

perms = it.permutations(words.col, 4)
for perm in perms:
    # do stuff with perm

5 Comments

Changing to for loop can make a difference?
It saves you an object comparison on every loop, but it's just a minor benefit. You'd be far better served using a different approach entirely.
Minor benefit means a lot here. WIll do that.
@Ajith: minor benefits are of no use at this scale. You need a change of approach, not simply rearranging your current algorithm.
Please suggest a better approach.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.