I am trying to construct hierarchies given a dataset, where each row represents a student, the course they've taken, and some other metadata. From this dataset, i'm trying to construct an adjacency matrix and determine the hierarchies based on what classes students have taken, and the path that different students take when choosing classes.
That being said, to construct this adjacency matrix, it is computationally expensive. Here is the code I have currently, which has been running for around 2 hours.
uniqueStudentIds = df.Id.unique()
uniqueClasses = df['Course_Title'].unique()
for studentID in uniqueStudentIds:
for course1 in uniqueClasses:
for course2 in uniqueClasses:
if (course1 != course2 and have_taken_both_courses(course1, course2, studentID)):
x = vertexDict[course1]
y = vertexDict[course2]
# Assuming symmetry
adjacency_matrix[x][y] += 1
adjacency_matrix[y][x] += 1
print(course1 + ', ' + course2)
def have_taken_both_courses(course1, course2, studentID):
hasTakenFirstCourse = len(df.loc[(df['Course_Title'] == course1) & (df['Id'] == studentID)]) > 0
if hasTakenFirstCourse:
return len(df.loc[(df['Course_Title'] == course2) & (df['Id'] == studentID)]) > 0
else:
return False
Given that I have a very large dataset size, I have tried to consult online resources in parallelizing/multithreading this computationally expensive for loop. However, i'm new to python and multiprocessing, so any guidance would be greatly appreciated!
len(df.loc[(df['Course_Title'] == course1) & (df['Id'] == studentID)]) > 0is very expensive in your tight loop. Parallelization isn't going to help nearly as much as counting more efficiently. Also, if you are going to be for-looping over unique id's, just convert to lists and don't use numpy arrays