reduce time-complexity of nested loop in python

Question

here is my code. it takes 17 hours to complete.could you please suggest any alternative code to reduce the time of computation?

# test algorithm1 - fuzzy
matched_pair = []
for x in dataset1['full_name_eng']:
    for y in dataset2['name']:
        if (fuzz.token_sort_ratio(x,y) > 85):
            matched_pair.append((x,y))
            print((x,y))

I tried different ones but did not work ((.

dataset1 - 10krows, dataset2 - 1M rows, fuzz.token_sort_ratio(x,y) - is a function which takes 2 parameters (2strings) and outputs integer - the similarity of these 2 strings

Can you provide more details? What is dataset1? how big is that? Can you post sample data of that? What is fuzz? — Shiva
– Shiva, Commented Apr 29, 2020 at 7:28
You can have a look at locality sensitive hashing (LSH) for faster similar string search. Here is an article explaining it — A Co
– A Co, Commented Apr 29, 2020 at 7:47

maxbachmann · Accepted Answer · 2020-04-29 11:33:10Z

1

Since the dataframe is not really used here I will simply work with the following two lists:

import string
import random

random.seed(18)
dataset1 = [''.join(random.choice(string.ascii_lowercase + ' ') for _ in range(random.randint(13, 20))) for s in range(1000)]
dataset2 = [''.join(random.choice(string.ascii_lowercase + ' ') for _ in range(random.randint(13, 20))) for s in range(1000)]

using these two lists with the code you provided using fuzzywuzzy. As a first change you could use RapidFuzz (I am the author) which is basically doing the same as FuzzyWuzzy, but is quite a bit faster. When using my tests lists this was about 7 times as fast as your code. Another issue is that when using fuzz.token_sort_ratio the strings are always lowercased and e.g. punctuation is removed. While this makes sence for the string matching, your doing it multiple times for each string in the list, which adds up when working with bigger lists. Using RapidFuzz and preprocessing only once is about 14 times as fast on these lists.

from rapidfuzz import fuzz, utils

dataset2_processed = [utils.default_process(x) for x in dataset2]
dataset1_processed = [utils.default_process(x) for x in dataset1]

matched_pair = []
for word1, word1_processed in zip(dataset1, dataset1_processed):
    for word2, word2_processed in zip(dataset2, dataset2_processed):
        if fuzz.token_sort_ratio(word1_processed, word2_processed, processor=None, score_cutoff=85):
            matched_pair.append((word1, word2))

edited Apr 29, 2020 at 11:33

answered Apr 29, 2020 at 11:15

maxbachmann

3,3551 gold badge16 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Guram Keretchashvili Over a year ago

thank you! a weaker algorithm, but 5x times faster!

Collectives™ on Stack Overflow

reduce time-complexity of nested loop in python

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related