1

here is my code. it takes 17 hours to complete.could you please suggest any alternative code to reduce the time of computation?

# test algorithm1 - fuzzy
matched_pair = []
for x in dataset1['full_name_eng']:
    for y in dataset2['name']:
        if (fuzz.token_sort_ratio(x,y) > 85):
            matched_pair.append((x,y))
            print((x,y))

I tried different ones but did not work ((.

dataset1 - 10krows, dataset2 - 1M rows, fuzz.token_sort_ratio(x,y) - is a function which takes 2 parameters (2strings) and outputs integer - the similarity of these 2 strings

10
  • 2
    Can you provide more details? What is dataset1? how big is that? Can you post sample data of that? What is fuzz? Commented Apr 29, 2020 at 7:28
  • 1
    Split your list and process it in parallel Commented Apr 29, 2020 at 7:34
  • 1
    Please see How to Ask, help center. Commented Apr 29, 2020 at 7:42
  • please, see I have edited question - added some details Commented Apr 29, 2020 at 7:47
  • You can have a look at locality sensitive hashing (LSH) for faster similar string search. Here is an article explaining it Commented Apr 29, 2020 at 7:47

1 Answer 1

1

Since the dataframe is not really used here I will simply work with the following two lists:

import string
import random

random.seed(18)
dataset1 = [''.join(random.choice(string.ascii_lowercase + ' ') for _ in range(random.randint(13, 20))) for s in range(1000)]
dataset2 = [''.join(random.choice(string.ascii_lowercase + ' ') for _ in range(random.randint(13, 20))) for s in range(1000)]

using these two lists with the code you provided using fuzzywuzzy. As a first change you could use RapidFuzz (I am the author) which is basically doing the same as FuzzyWuzzy, but is quite a bit faster. When using my tests lists this was about 7 times as fast as your code. Another issue is that when using fuzz.token_sort_ratio the strings are always lowercased and e.g. punctuation is removed. While this makes sence for the string matching, your doing it multiple times for each string in the list, which adds up when working with bigger lists. Using RapidFuzz and preprocessing only once is about 14 times as fast on these lists.

from rapidfuzz import fuzz, utils

dataset2_processed = [utils.default_process(x) for x in dataset2]
dataset1_processed = [utils.default_process(x) for x in dataset1]

matched_pair = []
for word1, word1_processed in zip(dataset1, dataset1_processed):
    for word2, word2_processed in zip(dataset2, dataset2_processed):
        if fuzz.token_sort_ratio(word1_processed, word2_processed, processor=None, score_cutoff=85):
            matched_pair.append((word1, word2))
Sign up to request clarification or add additional context in comments.

1 Comment

thank you! a weaker algorithm, but 5x times faster!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.