Vectorizing hashing function in pandas

Question

I have the following dataset (with different values, just multiplied same rows). I need to combine the columns and hash them, specifically with the library hashlib and the algorithm provided.

The problem is that it takes too long, and somehow I have the feeling I could vectorize the function but I am not an expert.

The function is pretty simple and I feel like it can be vectorized, but struggling to implement.

I am working with millions of rows and it takes hours, even if hashing 4 columns values.

import pandas as pd
import hashlib

data = pd.DataFrame({'first_identifier':['ALP1x','RDX2b']* 100000,'second_identifier':['RED413','BLU031']* 100000})

def _mutate_hash(row):
    return hashlib.md5(row.sum().lower().encode()).hexdigest()

%timeit data['row_hash']=data.apply(_mutate_hash,axis=1)

Not a full answer, just sharing: My first instinct would be to use map(), but swifter or numba may be fastest. Here are 12 ways to do it: towardsdatascience.com/…. And a handy comparison graph !here — sh37211
– sh37211, Commented Sep 7, 2021 at 23:34
@sh37211 Thanks for sharing but I can't use external frameworks — Alejandro A
– Alejandro A, Commented Sep 7, 2021 at 23:36
Ok then map would be a good start. But I can't even run your code snippet to see how slow it is: The data = pd.DataFrame line yields ValueError: arrays must all be same length. — sh37211
– sh37211, Commented Sep 7, 2021 at 23:40
@sh37211 I am very sorry, forgot to add a "* 100000" on my second row when creating the dataframe, it should be workign now and thanks for the answers so far — Alejandro A
– Alejandro A, Commented Sep 7, 2021 at 23:52

sh37211 · Accepted Answer · 2021-09-08 01:20:07Z

2

Using a list comprehension will get you a significant speedup.

First your original:

import pandas as pd
import hashlib

n = 100000
data = pd.DataFrame({'first_identifier':['ALP1x','RDX2b']* n,'second_identifier':['RED413','BLU031']* n})

def _mutate_hash(row):
    return hashlib.md5(row.sum().lower().encode()).hexdigest()

%timeit data['row_hash']=data.apply(_mutate_hash,axis=1)

1 loop, best of 5: 26.1 s per loop

Then as a list comprehension:

data = pd.DataFrame({'first_identifier':['ALP1x','RDX2b']* n,'second_identifier':['RED413','BLU031']* n})

def list_comp(df):
    return pd.Series([ _mutate_hash(row) for row in df.to_numpy() ])

%timeit data['row_hash']=list_comp(data)

1 loop, best of 5: 872 ms per loop

...i.e., a speedup of ~30x.

As a check: You can check that these two methods yield equivalent results by putting the first one in "data2" and the second one in "data3" and then check that they're equal:

data2, data3 = pd.DataFrame([]), pd.DataFrame([])
%timeit data2['row_hash']=data.apply(_mutate_hash,axis=1)
...
%timeit data3['row_hash']=list_comp(data)
...
data2.equals(data3)
True

edited Sep 8, 2021 at 1:20

answered Sep 8, 2021 at 1:06

sh37211

1,5421 gold badge20 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

sh37211 Over a year ago

@Marat For further evidence of list comprehensions outperforming .apply(), see the link I provided the comment above on "12 ways to do it" and the scaling figure at the end: towardsdatascience.com/… He shows .apply() is ~25 times slower than list comprehension for 10,000 to 1,000,000 rows.

Alejandro A Over a year ago

I am getting the following error: "ValueError: ndarray is not C-contiguous" after copypasting your code and running list_comp

Marat Over a year ago

Sorry, I deleted that comment after noticing it is actually applied to 10k rows. (for those who missed it, TLDR: the answer is misleading because apply has overhead due to optimization).

sh37211 Over a year ago

@AlejandroA Uh...that error message is produced by Cython. Not sure why you're seeing a Cython error message. Here's the Colab where I wrote & ran the code above: colab.research.google.com/drive/… Just re-ran it after Factory Reset. Still works.

sh37211 Over a year ago

@Marat Confirmed that yours is fastest so far by more than 2x over mine: 1 loop, best of 5: 390 ms per loop, for a total speedup of ~66x over the original. Added your code the to Colab linked in previous comment. Good job! You should add it as an answer. (I would not have imagined writing all that.)

|

onepan · Accepted Answer · 2021-09-08 01:13:42Z

0

The easiest performance boost comes from using vectorized string operations. If you do the string prep (lowercasing and encoding) before applying the hash function, your performance is much more reasonable.

data = pd.DataFrame(
    {
        "first_identifier": ["ALP1x", "RDX2b"] * 1000000,
        "second_identifier": ["RED413", "BLU031"] * 1000000,
    }
)



def _mutate_hash(row):
    return hashlib.md5(row).hexdigest()


prepped_data = data.apply(lambda col: col.str.lower().str.encode("utf8")).sum(axis=1)

data["row_hash"] = prepped_data.map(_mutate_hash)

I see ~25x speedup with that change.

answered Sep 8, 2021 at 1:13

onepan

9545 silver badges8 bronze badges

Collectives™ on Stack Overflow

Vectorizing hashing function in pandas

2 Answers 2

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related