1

I have the following dataset (with different values, just multiplied same rows). I need to combine the columns and hash them, specifically with the library hashlib and the algorithm provided.

The problem is that it takes too long, and somehow I have the feeling I could vectorize the function but I am not an expert.

The function is pretty simple and I feel like it can be vectorized, but struggling to implement.

I am working with millions of rows and it takes hours, even if hashing 4 columns values.

import pandas as pd
import hashlib

data = pd.DataFrame({'first_identifier':['ALP1x','RDX2b']* 100000,'second_identifier':['RED413','BLU031']* 100000})

def _mutate_hash(row):
    return hashlib.md5(row.sum().lower().encode()).hexdigest()

%timeit data['row_hash']=data.apply(_mutate_hash,axis=1)

4
  • 1
    Not a full answer, just sharing: My first instinct would be to use map(), but swifter or numba may be fastest. Here are 12 ways to do it: towardsdatascience.com/…. And a handy comparison graph !here Commented Sep 7, 2021 at 23:34
  • @sh37211 Thanks for sharing but I can't use external frameworks Commented Sep 7, 2021 at 23:36
  • 1
    Ok then map would be a good start. But I can't even run your code snippet to see how slow it is: The data = pd.DataFrame line yields ValueError: arrays must all be same length. Commented Sep 7, 2021 at 23:40
  • @sh37211 I am very sorry, forgot to add a "* 100000" on my second row when creating the dataframe, it should be workign now and thanks for the answers so far Commented Sep 7, 2021 at 23:52

2 Answers 2

2

Using a list comprehension will get you a significant speedup.

First your original:

import pandas as pd
import hashlib

n = 100000
data = pd.DataFrame({'first_identifier':['ALP1x','RDX2b']* n,'second_identifier':['RED413','BLU031']* n})

def _mutate_hash(row):
    return hashlib.md5(row.sum().lower().encode()).hexdigest()

%timeit data['row_hash']=data.apply(_mutate_hash,axis=1)

1 loop, best of 5: 26.1 s per loop

Then as a list comprehension:

data = pd.DataFrame({'first_identifier':['ALP1x','RDX2b']* n,'second_identifier':['RED413','BLU031']* n})

def list_comp(df):
    return pd.Series([ _mutate_hash(row) for row in df.to_numpy() ])

%timeit data['row_hash']=list_comp(data)

1 loop, best of 5: 872 ms per loop

...i.e., a speedup of ~30x.

As a check: You can check that these two methods yield equivalent results by putting the first one in "data2" and the second one in "data3" and then check that they're equal:

data2, data3 = pd.DataFrame([]), pd.DataFrame([])
%timeit data2['row_hash']=data.apply(_mutate_hash,axis=1)
...
%timeit data3['row_hash']=list_comp(data)
...
data2.equals(data3)
True
Sign up to request clarification or add additional context in comments.

7 Comments

@Marat For further evidence of list comprehensions outperforming .apply(), see the link I provided the comment above on "12 ways to do it" and the scaling figure at the end: towardsdatascience.com/… He shows .apply() is ~25 times slower than list comprehension for 10,000 to 1,000,000 rows.
I am getting the following error: "ValueError: ndarray is not C-contiguous" after copypasting your code and running list_comp
Sorry, I deleted that comment after noticing it is actually applied to 10k rows. (for those who missed it, TLDR: the answer is misleading because apply has overhead due to optimization).
@AlejandroA Uh...that error message is produced by Cython. Not sure why you're seeing a Cython error message. Here's the Colab where I wrote & ran the code above: colab.research.google.com/drive/… Just re-ran it after Factory Reset. Still works.
@Marat Confirmed that yours is fastest so far by more than 2x over mine: 1 loop, best of 5: 390 ms per loop, for a total speedup of ~66x over the original. Added your code the to Colab linked in previous comment. Good job! You should add it as an answer. (I would not have imagined writing all that.)
|
0

The easiest performance boost comes from using vectorized string operations. If you do the string prep (lowercasing and encoding) before applying the hash function, your performance is much more reasonable.

data = pd.DataFrame(
    {
        "first_identifier": ["ALP1x", "RDX2b"] * 1000000,
        "second_identifier": ["RED413", "BLU031"] * 1000000,
    }
)



def _mutate_hash(row):
    return hashlib.md5(row).hexdigest()


prepped_data = data.apply(lambda col: col.str.lower().str.encode("utf8")).sum(axis=1)

data["row_hash"] = prepped_data.map(_mutate_hash)

I see ~25x speedup with that change.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.