The question I'm asking is similar to the one I posted here a while ago: Comparing 2 Pandas dataframes row by row and performing a calculation on each row
I got a very helpful answer to that question and I'm trying to use that information to help me answer my current question.
Task: Group a dataframe by columns trial, RECORDING_SESSION_LABEL, and IP_INDEX. For each group, I need to calculate the Euclidean distance between a row and all rows above it (so from Row 2 to Row n) using the values in columns CURRENT_FIX_X and CURRENT_FIX_Y. If the distance is less than 58.93, I need to add the value of CURRENT_FIX_INDEX from the row I'm comparing to (not against) to a list, and then concatenate that list into a string and add it to a new column (refix_list) so the string is in the new column of the row I'm comparing against.
Example: I'm on Row 7, so I'm comparing the distance of Row 7 to Rows 6, 5, 4, 3, 2, and 1 of that group. If the distance between Row 7 and Rows 5, 3, and 1 are less than 58.93, I want a comma-separated string that contains the CURRENT_FIX_INDEX value of each of those 3 rows in the refix_list column at Row 7.
Problem: I have code that I'm working with, and I'm not sure if it's working because I get a 'ValueError: Length of values (0) does not match length of index (297)' when I try to print the df so I know there's an issue either creating the list or more likely, concatenating it into a string and assigning it to the specific row.
Here's the code I'm working with (with sample data for 1 participant):
import pandas as pd
import numpy as np
data_df = {
'IP_INDEX': [1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 1, 1, 2, 3, 3, 3, 4, 4, 4, 4],
'RECORDING_SESSION_LABEL': ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a'],
'trial': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
'CURRENT_FIX_INDEX': [1, 2, 3, 1, 2, 3, 4, 5, 1, 2, 3, 1, 1, 2, 1, 1, 2, 3, 1, 2, 3, 4],
'CURRENT_FIX_X': [550, 575, 250, 300, 500, 475, 275, 550, 675, 650, 800, 325, 450, 400, 375, 650, 700, 675, 825, 400, 375, 150],
'CURRENT_FIX_Y': [275, 250, 600, 650, 300, 325, 675, 300, 850, 875, 250, 625, 225, 150, 675, 250, 300, 275, 150, 225, 250, 650]
}
# Create DF1
df = pd.DataFrame(data_df)
# Define a function to calculate Euclidean distance
def euclidean_distance(x1, y1, x2, y2):
return np.sqrt((x1 - x2)**2 + (y1 - y2)**2)
# Grouping the DataFrame by RECORDING_SESSION_LABEL, trial, and IP_INDEX
grouped = df.groupby(['RECORDING_SESSION_LABEL', 'trial', 'IP_INDEX'])
# List to store CURRENT_FIX_INDEX for each row
index_list = []
refix_values = []
# Iterate over each group
for group_name, group_df in grouped:
# Sort the group_df by some unique column
group_df = group_df.sort_values(by='trial')
# Calculate Euclidean distance for each row
for i, row in group_df.iterrows():
current_x = row['CURRENT_FIX_X']
current_y = row['CURRENT_FIX_Y']
# Calculate distance with every row above it
for j, prev_row in group_df.iloc[:i].iterrows():
current_index = prev_row['CURRENT_FIX_INDEX']
prev_x = prev_row['CURRENT_FIX_X']
prev_y = prev_row['CURRENT_FIX_Y']
distance = euclidean_distance(current_x, current_y, prev_x, prev_y)
# If distance is less than or equal to 58.93, store CURRENT_FIX_INDEX
if distance <= 58.93:
index_list.append(current_index)
refix_values.append(','.join(map(str, index_list))) #Add list of matching INDEX values to list of lists
df['refix_list'] = []
# Iterate over the DataFrame to access each row and its index
for index, row in df.iterrows():
# Assign the list to the current row in the specified column
df.at[index, refix_list] = refix_values
print(df)
Expected Output:
| IP_INDEX | RECORDING_SESSION_LABEL | trial | CURRENT_FIX_INDEX | CURRENT_FIX_X | CURRENT_FIX_Y | refix_list |
|---|---|---|---|---|---|---|
| 1 | a | 1 | 1 | 550 | 275 | |
| 1 | a | 1 | 2 | 575 | 250 | 1 |
| 1 | a | 1 | 3 | 250 | 600 | |
| 2 | a | 1 | 1 | 300 | 650 | |
| 2 | a | 1 | 2 | 500 | 300 | |
| 2 | a | 1 | 3 | 500 | 325 | 2 |
| 2 | a | 1 | 4 | 275 | 675 | 1 |
| 2 | a | 1 | 5 | 550 | 300 | 3, 2 |
| 3 | a | 1 | 1 | 675 | 850 | |
| 3 | a | 1 | 2 | 650 | 875 | 1 |
| 3 | a | 1 | 3 | 800 | 250 | |
| 4 | a | 1 | 1 | 325 | 625 | |
| 1 | a | 2 | 1 | 450 | 225 | |
| 1 | a | 2 | 2 | 400 | 150 | |
| 2 | a | 2 | 1 | 375 | 675 | |
| 3 | a | 2 | 1 | 650 | 250 | |
| 3 | a | 2 | 2 | 700 | 300 | |
| 3 | a | 2 | 3 | 675 | 275 | 2, 1 |
| 4 | a | 2 | 1 | 825 | 150 | |
| 4 | a | 2 | 2 | 400 | 225 | |
| 4 | a | 2 | 3 | 375 | 250 | 2 |
| 4 | a | 2 | 4 | 150 | 650 |
From my limited knowledge, I'm guessing the issue is in the last block of code, but I'm not positive. Any help is appreciated!