I have the following dataset (with different values, just multiplied same rows). I need to combine the columns and hash them, specifically with the library hashlib and the algorithm provided.
The problem is that it takes too long, and somehow I have the feeling I could vectorize the function but I am not an expert.
The function is pretty simple and I feel like it can be vectorized, but struggling to implement.
I am working with millions of rows and it takes hours, even if hashing 4 columns values.
import pandas as pd
import hashlib
data = pd.DataFrame({'first_identifier':['ALP1x','RDX2b']* 100000,'second_identifier':['RED413','BLU031']* 100000})
def _mutate_hash(row):
return hashlib.md5(row.sum().lower().encode()).hexdigest()
%timeit data['row_hash']=data.apply(_mutate_hash,axis=1)
map(), butswifterornumbamay be fastest. Here are 12 ways to do it: towardsdatascience.com/…. And a handy comparison graph !heremapwould be a good start. But I can't even run your code snippet to see how slow it is: Thedata = pd.DataFrameline yieldsValueError: arrays must all be same length.