2

I have to loop through a list of over 4000 item and check their similarity with a recommendation algorithm in python.

The script takes a long time to run (10-11 Hours) and I wanted to incorporate multi-threading to improve speed but dont know how to do it exactly.

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt

    data=pd.read_csv('data.csv',index_col=0, encoding="ISO-8859-1")       

    # Get list of unique items
    itemList=list(set(data["product_ref"].tolist()))

    # Get count of customers
    userCount=len(set(data["customer_id"].tolist()))

    # Create an empty data frame to store item affinity scores for items.
    itemAffinity= pd.DataFrame(columns=('item1', 'item2', 'score'))

    def itemUsers(ind):
      return data[data.product_ref==itemList[ind]]["customer_id"].tolist()

    rowCount=0
    for ind1 in range(len(itemList)): 
        item1Users = itemUsers(ind1) 
        pool = Pool()
        pool.map(loop2, data_inputs)
        for ind2 in range(ind1+1, len(itemList)): 
            print(ind1, ":", ind2)       
            item2Users = itemUsers(ind2) 
            commonUsers= len(set(item1Users).intersection(set(item2Users))) 
            score=commonUsers / userCount
            itemAffinity.loc[rowCount] = [itemList[ind1],itemList[ind2],score] 
            rowCount +=1
6
  • What exactly is itemUsers doing? Can you give a small example input and expected output for this? Commented Jun 19, 2018 at 14:29
  • 1
    Hold on, this is in a DataFrame? Have you not considered doing this using Pandas tools to try and vectorize the calculation? I'm not clear on any restriction that would force you to drop out of pandas; you should be able to do this is a couple of seconds (at a very conservative estimate) Commented Jun 19, 2018 at 14:32
  • i updated my post, it only returns the users which bought the item number ind Commented Jun 19, 2018 at 14:32
  • @roganjosh yes im using DataFrames. How do i vectorize the calculation using pandas ? Commented Jun 19, 2018 at 14:34
  • Please consider rewriting this as an MCVE for a Pandas solution. At the moment we don't have anything to test on. An efficient python solution in the meantime will be of interest, but you should be using pandas where possible. Commented Jun 19, 2018 at 14:35

1 Answer 1

1

Incoprating multi-threading will not improve your running time.

Think about it this way, when you multi-thread you spread your computational time between multiple threads - When you could of spread it on one process.

It could help when you have on thread waiting for user input for example and you want to compute while waiting, but this isn't your case.

Sign up to request clarification or add additional context in comments.

2 Comments

he's looking for multiprocessing
@Rohi Yes exactly im looking for multiprocessing

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.