Pandas function vectorization

Question

I developed a program that needs to calibrate >1 milion data and I want to vectorize it for time performances.

I have a dataframe with columns: ['time', 'raw_data'] and I want to create a new column with the calibrated data

I have another dataframe in which I have the calibrations data. The dataframe is organized with columns: ['calibration_name', 'raw_value', 'calibrated_value']

Now I developed a function that retrieves the calibrated_value, and I can use apply method to do so:

def calibrate(value, calibration):
    df_calibrations = pd.read_csv('calibration_data.csv', usecols=['calibration_name', 'raw_value', 'calibrated_value'])
    y_out = df_calibrations.loc[df_calibrations ['calibration_name'] == value]['calibrated_value'].iloc[0]


df = pd.read_csv('data_to_calibrate.csv', usecols=['time', 'raw'])
calibration = 'calibration_name'
df['eng'] = df['raw'].apply(calibrate, calibration=calibration)

Now my code works fine but I want to improve performances, so I decided to vectorize as:

df['eng'] = calibrate(df['raw'], calibration)

However I get an error such as:

('Lengths must match to compare', (11,), (7630,))

I cannot come up with a solution to vectorize the line:

y_out = df_calibrations.loc[df_calibrations ['calibration_name'] == value]['calibrated_value'].iloc[0]

Is there a way to do so?

data_to_calibrate.csv:

time,   raw
1571348671638000000,    1
1571348676493000000,    3
1571348681180000000,  2

calibration_data.csv:

calibration_name,  raw_value,   raw_value
XXXX01  0   A
XXXX01  1   B
XXXX01  2   C
XXXX01  3   D

Can you use merge instead of applying using two dataframes. This looks really inefficient — alparslan mimaroğlu
– alparslan mimaroğlu, Commented Aug 30, 2021 at 12:42
How to use merge? I have the correspondance between the raw and calibrated value on a different file — angie866
– angie866, Commented Aug 30, 2021 at 12:50
Can you share a sample of both data. It seems like you only have to merge on calibration_name It should be relatively easy. — alparslan mimaroğlu
– alparslan mimaroğlu, Commented Aug 30, 2021 at 12:52
I have added it to the question so it is more readable. In the example the new column of data_to_calibrate.csv shall be B-D-C — angie866
– angie866, Commented Aug 30, 2021 at 12:57

alparslan mimaroğlu · Accepted Answer · 2021-08-30 13:26:57Z

1

By merging on the common column you can perform all the necessary business logic in a vectorized manner

data_to_calibrate = data_to_calibrate.merge(calibration_data, how='left', left_on='raw', right_on='raw_value')

data_to_calibrate.loc[data_to_calibrate['raw_value'].notna(), 'time'] = data_to_calibrate['raw_value']

answered Aug 30, 2021 at 13:26

alparslan mimaroğlu

1,4901 gold badge14 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

angie866 Over a year ago

Thank you. I have actually used a different method. I transformed my data_to_calibrate dataframe to a dictionary and I have mapped the data as: df['eng'] = df.raw.map(df2)

Collectives™ on Stack Overflow

Pandas function vectorization

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related