How to Parallelize python loop with large dataset

Question

I am trying to construct hierarchies given a dataset, where each row represents a student, the course they've taken, and some other metadata. From this dataset, i'm trying to construct an adjacency matrix and determine the hierarchies based on what classes students have taken, and the path that different students take when choosing classes.

That being said, to construct this adjacency matrix, it is computationally expensive. Here is the code I have currently, which has been running for around 2 hours.

uniqueStudentIds = df.Id.unique()
uniqueClasses = df['Course_Title'].unique()
for studentID in uniqueStudentIds:
    for course1 in uniqueClasses:
        for course2 in uniqueClasses:
            if (course1 != course2 and have_taken_both_courses(course1, course2, studentID)):
                x = vertexDict[course1]
                y = vertexDict[course2]
                # Assuming symmetry
                adjacency_matrix[x][y] += 1
                adjacency_matrix[y][x] += 1
                print(course1 + ', ' + course2)


def have_taken_both_courses(course1, course2, studentID):
    hasTakenFirstCourse = len(df.loc[(df['Course_Title'] == course1) & (df['Id'] == studentID)]) > 0
    if hasTakenFirstCourse:
        return len(df.loc[(df['Course_Title'] == course2) & (df['Id'] == studentID)]) > 0
    else:
        return False

Given that I have a very large dataset size, I have tried to consult online resources in parallelizing/multithreading this computationally expensive for loop. However, i'm new to python and multiprocessing, so any guidance would be greatly appreciated!

you might be better off doing this in sql. you're just looking for all courses x, y such that there is a student who took both x and y? one thing to look at is using sets instead of lists. they are faster when doing a lookup. — dstandish
– dstandish, Commented May 19, 2018 at 5:28
Although i'd like to do this in SQL, I need to construct the matrix and then do some data formatting to funnel into an API. I'm not sure how i'd do that given that my data file is a CSV and I have to eventually return a json object that represents a graph — Varun Vu
– Varun Vu, Commented May 19, 2018 at 5:31
You are essentially doing linear scans for every unique pair to get counts, redunantly iterating every time. This: len(df.loc[(df['Course_Title'] == course1) & (df['Id'] == studentID)]) > 0 is very expensive in your tight loop. Parallelization isn't going to help nearly as much as counting more efficiently. Also, if you are going to be for-looping over unique id's, just convert to lists and don't use numpy arrays — juanpa.arrivillaga
– juanpa.arrivillaga, Commented May 19, 2018 at 6:00

dstandish · Accepted Answer · 2018-05-19 07:48:56Z

3

It appears are looping way more than you have to. For every student you do NxN iterations, where N is the total number of classes. But your student has only taken a subset of those classes. So you can cut down on iterations significantly.

Your have_taken_both_courses() lookup is also more expensive than it needs to be.

Something like this will probably go a lot faster:

import numpy as np
import itertools
import pandas as pd

df = pd.read_table('/path/to/data.tsv')

students_df = pd.DataFrame(df['student'].unique())
students_lkp = {x[1][0]: x[0] for x in students_df.iterrows()}

classes_df = pd.DataFrame(df['class'].unique())
classes_lkp = {x[1][0]: x[0] for x in classes_df.iterrows()}

df['student_key'] = df['student'].apply(lambda x: students_lkp[x])
df['class_key'] = df['class'].apply(lambda x: classes_lkp[x])

df.set_index(['student_key', 'class_key'], inplace=True)

matr = np.zeros((len(classes_df), len(classes_df)))

for s in range(0, len(students_df)):
    print s
    # get all the classes for this student
    classes = df.loc[s].index.unique().tolist()
    for x, y in itertools.permutations(classes, 2):
        matr[x][y] += 1

edited May 19, 2018 at 7:48

answered May 19, 2018 at 7:05

dstandish

2,43823 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Varun Vu Over a year ago

Thanks! This definitely will be a lot faster. Just a couple semantic things. At the bottom, the x and y, what do those numbers represent?

dstandish Over a year ago

those are the surrogate keys for the classes. i.e. the index for the class in classes_df

Varun Vu Over a year ago

And when i'd want to retrieve the metadata of the row after scanning the matrix, what command would I use given the class_key indexes?

dstandish Over a year ago

classes_df.loc[x] where x is the key you're looking up

Collectives™ on Stack Overflow

How to Parallelize python loop with large dataset

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related