0

The question I'm asking is similar to the one I posted here a while ago: Comparing 2 Pandas dataframes row by row and performing a calculation on each row

I got a very helpful answer to that question and I'm trying to use that information to help me answer my current question.

Task: Group a dataframe by columns trial, RECORDING_SESSION_LABEL, and IP_INDEX. For each group, I need to calculate the Euclidean distance between a row and all rows above it (so from Row 2 to Row n) using the values in columns CURRENT_FIX_X and CURRENT_FIX_Y. If the distance is less than 58.93, I need to add the value of CURRENT_FIX_INDEX from the row I'm comparing to (not against) to a list, and then concatenate that list into a string and add it to a new column (refix_list) so the string is in the new column of the row I'm comparing against.

Example: I'm on Row 7, so I'm comparing the distance of Row 7 to Rows 6, 5, 4, 3, 2, and 1 of that group. If the distance between Row 7 and Rows 5, 3, and 1 are less than 58.93, I want a comma-separated string that contains the CURRENT_FIX_INDEX value of each of those 3 rows in the refix_list column at Row 7.

Problem: I have code that I'm working with, and I'm not sure if it's working because I get a 'ValueError: Length of values (0) does not match length of index (297)' when I try to print the df so I know there's an issue either creating the list or more likely, concatenating it into a string and assigning it to the specific row.

Here's the code I'm working with (with sample data for 1 participant):

import pandas as pd
import numpy as np

data_df = {
    'IP_INDEX': [1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 1, 1, 2, 3, 3, 3, 4, 4, 4, 4],
    'RECORDING_SESSION_LABEL': ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a'],
    'trial': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
    'CURRENT_FIX_INDEX': [1, 2, 3, 1, 2, 3, 4, 5, 1, 2, 3, 1, 1, 2, 1, 1, 2, 3, 1, 2, 3, 4],
    'CURRENT_FIX_X': [550, 575, 250, 300, 500, 475, 275, 550, 675, 650, 800, 325, 450, 400, 375, 650, 700, 675, 825, 400, 375, 150],
    'CURRENT_FIX_Y': [275, 250, 600, 650, 300, 325, 675, 300, 850, 875, 250, 625, 225, 150, 675, 250, 300, 275, 150, 225, 250, 650]

}

# Create DF1
df = pd.DataFrame(data_df)

# Define a function to calculate Euclidean distance
def euclidean_distance(x1, y1, x2, y2):
    return np.sqrt((x1 - x2)**2 + (y1 - y2)**2)

# Grouping the DataFrame by RECORDING_SESSION_LABEL, trial, and IP_INDEX
grouped = df.groupby(['RECORDING_SESSION_LABEL', 'trial', 'IP_INDEX'])

# List to store CURRENT_FIX_INDEX for each row
index_list = []
refix_values = []

# Iterate over each group
for group_name, group_df in grouped:
    # Sort the group_df by some unique column
    group_df = group_df.sort_values(by='trial')
    
    # Calculate Euclidean distance for each row
    for i, row in group_df.iterrows():
        current_x = row['CURRENT_FIX_X']
        current_y = row['CURRENT_FIX_Y']
        
        # Calculate distance with every row above it
        for j, prev_row in group_df.iloc[:i].iterrows():
            current_index = prev_row['CURRENT_FIX_INDEX']
            prev_x = prev_row['CURRENT_FIX_X']
            prev_y = prev_row['CURRENT_FIX_Y']
            
            distance = euclidean_distance(current_x, current_y, prev_x, prev_y)
            
            # If distance is less than or equal to 58.93, store CURRENT_FIX_INDEX
            if distance <= 58.93:
                index_list.append(current_index)
    refix_values.append(','.join(map(str, index_list))) #Add list of matching INDEX values to list of lists

df['refix_list'] = []

# Iterate over the DataFrame to access each row and its index
for index, row in df.iterrows():
    # Assign the list to the current row in the specified column
    df.at[index, refix_list] = refix_values

print(df)

Expected Output:

IP_INDEX RECORDING_SESSION_LABEL trial CURRENT_FIX_INDEX CURRENT_FIX_X CURRENT_FIX_Y refix_list
1 a 1 1 550 275
1 a 1 2 575 250 1
1 a 1 3 250 600
2 a 1 1 300 650
2 a 1 2 500 300
2 a 1 3 500 325 2
2 a 1 4 275 675 1
2 a 1 5 550 300 3, 2
3 a 1 1 675 850
3 a 1 2 650 875 1
3 a 1 3 800 250
4 a 1 1 325 625
1 a 2 1 450 225
1 a 2 2 400 150
2 a 2 1 375 675
3 a 2 1 650 250
3 a 2 2 700 300
3 a 2 3 675 275 2, 1
4 a 2 1 825 150
4 a 2 2 400 225
4 a 2 3 375 250 2
4 a 2 4 150 650

From my limited knowledge, I'm guessing the issue is in the last block of code, but I'm not positive. Any help is appreciated!

3
  • you should provide data of what your input/output data is supposed to look like Commented May 10, 2024 at 18:09
  • Are you entirely sure of roe 8? ````2 a 1 5 550 300 3, 2``` ? Commented May 11, 2024 at 19:33
  • @SergedeGossondeVarennes yes. The distance between row 8 and rows 5&6 is less than the specified amount. Since I need the data to be grouped by trial, recording_session_label, and ip_index, the calculation for row 8 would stop at row 4. Thus, it wouldn't check if row 8 matched with any rows above that. Commented May 12, 2024 at 14:50

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.