0

I am trying to construct hierarchies given a dataset, where each row represents a student, the course they've taken, and some other metadata. From this dataset, i'm trying to construct an adjacency matrix and determine the hierarchies based on what classes students have taken, and the path that different students take when choosing classes.

That being said, to construct this adjacency matrix, it is computationally expensive. Here is the code I have currently, which has been running for around 2 hours.

uniqueStudentIds = df.Id.unique()
uniqueClasses = df['Course_Title'].unique()
for studentID in uniqueStudentIds:
    for course1 in uniqueClasses:
        for course2 in uniqueClasses:
            if (course1 != course2 and have_taken_both_courses(course1, course2, studentID)):
                x = vertexDict[course1]
                y = vertexDict[course2]
                # Assuming symmetry
                adjacency_matrix[x][y] += 1
                adjacency_matrix[y][x] += 1
                print(course1 + ', ' + course2)


def have_taken_both_courses(course1, course2, studentID):
    hasTakenFirstCourse = len(df.loc[(df['Course_Title'] == course1) & (df['Id'] == studentID)]) > 0
    if hasTakenFirstCourse:
        return len(df.loc[(df['Course_Title'] == course2) & (df['Id'] == studentID)]) > 0
    else:
        return False

Given that I have a very large dataset size, I have tried to consult online resources in parallelizing/multithreading this computationally expensive for loop. However, i'm new to python and multiprocessing, so any guidance would be greatly appreciated!

3
  • you might be better off doing this in sql. you're just looking for all courses x, y such that there is a student who took both x and y? one thing to look at is using sets instead of lists. they are faster when doing a lookup. Commented May 19, 2018 at 5:28
  • Although i'd like to do this in SQL, I need to construct the matrix and then do some data formatting to funnel into an API. I'm not sure how i'd do that given that my data file is a CSV and I have to eventually return a json object that represents a graph Commented May 19, 2018 at 5:31
  • 1
    You are essentially doing linear scans for every unique pair to get counts, redunantly iterating every time. This: len(df.loc[(df['Course_Title'] == course1) & (df['Id'] == studentID)]) > 0 is very expensive in your tight loop. Parallelization isn't going to help nearly as much as counting more efficiently. Also, if you are going to be for-looping over unique id's, just convert to lists and don't use numpy arrays Commented May 19, 2018 at 6:00

1 Answer 1

3

It appears are looping way more than you have to. For every student you do NxN iterations, where N is the total number of classes. But your student has only taken a subset of those classes. So you can cut down on iterations significantly.

Your have_taken_both_courses() lookup is also more expensive than it needs to be.

Something like this will probably go a lot faster:

import numpy as np
import itertools
import pandas as pd

df = pd.read_table('/path/to/data.tsv')

students_df = pd.DataFrame(df['student'].unique())
students_lkp = {x[1][0]: x[0] for x in students_df.iterrows()}

classes_df = pd.DataFrame(df['class'].unique())
classes_lkp = {x[1][0]: x[0] for x in classes_df.iterrows()}

df['student_key'] = df['student'].apply(lambda x: students_lkp[x])
df['class_key'] = df['class'].apply(lambda x: classes_lkp[x])

df.set_index(['student_key', 'class_key'], inplace=True)

matr = np.zeros((len(classes_df), len(classes_df)))

for s in range(0, len(students_df)):
    print s
    # get all the classes for this student
    classes = df.loc[s].index.unique().tolist()
    for x, y in itertools.permutations(classes, 2):
        matr[x][y] += 1
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! This definitely will be a lot faster. Just a couple semantic things. At the bottom, the x and y, what do those numbers represent?
those are the surrogate keys for the classes. i.e. the index for the class in classes_df
And when i'd want to retrieve the metadata of the row after scanning the matrix, what command would I use given the class_key indexes?
classes_df.loc[x] where x is the key you're looking up

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.