1

I am trying to generate hash code for dataframe using hashlib.md5 in pyspark. It only accepts a string to generate hash code.

I need to convert each row of a dataframe to string.

I tried concat_ws function to concatenate all columns and make it as a string but no result.

My dataframe has columns of Id, name, marks

I tried:

str=df.select(concat_ws("id","name","marks"))

print(hashlib.md5(str.encode(encoding='utf_8', errors='strict')).hexdigest())

I got this error:

AttributeError: 'DataFrame' object has no attribute 'encode'
1
  • Why don't you use md5 Spark standard function? Commented Dec 21, 2017 at 21:04

1 Answer 1

2

Can you try

df.select("colname").rdd.map(lambda x: hashlib.md5(str(x).encode(encoding='utf_8', errors='strict')).hexdigest()).collect()

you should see something like

['1dd55a7d40667d697743612f826b71e1', '64a537f89bd95f34374b619452b1a5ab']

In your case,

df.select(expr("concat_ws(id,name,marks)").alias("mycolumn")).rdd.map(lambda x: hashlib.md5(str(x).encode(encoding='utf_8', errors='strict')).hexdigest()).collect()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.