0

We have an old, custom, C# hashing algorithm we use to mask e-mail addresses for PII purposes. I'm trying to build a Python version of this algorithm, but I'm struggling handling the differences in how C# and Python handle bytes/byte arrays, thus producing the wrong hash value. For reference, this is Python 2.7, but a Python 3+ solution would work just as well.

C# code:

using System.Text;
using System.Security;
using System.Security.Cryptography;

public class Program
{
    public static void Main()
    {
        string emailAddressStr = "[email protected]";
        emailAddressStr = emailAddressStr.Trim().ToLower();
        SHA256 objCrypt = new SHA256Managed();
        byte[] b = (new ASCIIEncoding()).GetBytes(emailAddressStr);
        byte[] bRet = objCrypt.ComputeHash(b);

        string retStr = "";
        byte c;

        for (int i = 0; i < bRet.Length; i++)
        {
            c = (byte)bRet[i];
            retStr += ((char)(c / 10 + 97)).ToString().ToLower();
            retStr += ((char)(c % 10 + 97)).ToString().ToLower();
        }
        Console.WriteLine(retStr);
    }
}

The (correct) value that gets returned is uhgbnaijlgchcfqcrgpicdvczapepbtifiwagitbecjfqalhufudieofyfdhzera

Python translation:

import hashlib

emltst = "[email protected]"

emltst = emltst.strip().lower()
b = bytearray(bytes(emltst).encode("ascii"))
bRet = bytearray(bytes(hashlib.sha256(b)))

emailhash=""

for i in bRet:
    c = bytes(i)
    emailhash = emailhash + str(chr((i / 10) + 97)).lower()
    emailhash = emailhash + str(chr((i % 10) + 97)).lower()

print(emailhash)

The (incorrect) value I get here is galfkejhfafdfedchcgfidhcdclbjikgkbjjlgdcgedceimaejeifakajhfekceifggc

The "business end" of the code is in the loop where c is not translating well between languages. C# produces a numeric value for the calculation, but in Python, c is a string (so I'm using i). I've stepped through both sets of code and I know that I'm producing the same hash value right before the loop. I hope someone here might be able help me out. TIA!

EDIT (2020-04-09)

Oguz Ozgul has a good solution below. I found a savvy programmer at work who suggested this working, Python 3 solution (this contains code for the broader solution of ingesting a list of e-mails and using PySpark to write a table):

myfile=sys.argv[1]
with open(myfile) as fql:
    insql=fql.read()

emails=[]
emails=insql.splitlines()

mytable=sys.argv[2]

def getSha256Hash(email):
    b = bytearray(bytes(email, 'ascii'))
    res = hashlib.sha256(b)
    bRet = bytearray.fromhex(res.hexdigest())
    emailhash=""
    for i in bRet:
        c1 = i / 10 + 97
        c2 = i % 10 + 97
        c1 = int(c1)
        c2 = int(c2)
        emailhash = emailhash + str(chr(c1)).lower()
        emailhash = emailhash + str(chr(c2)).lower()
    return(emailhash)
###################################

emailhashes = []

isascii = lambda s: len(s) == len(s.encode())

for e in emails:
    e = e.strip().lower()
    if isascii(e) == True:
        emailhashret = getSha256Hash(e)
        emailhashes.append(emailhashret)

findf = spark.createDataFrame(emailhashes, StringType())

spark.sql("SET spark.sql.hive.convertMetastoreParquet=false")

findf.repartition(1).write.format("parquet").mode("overwrite").saveAsTable(mytable)
3
  • I don't think this algorithm passes legal requirements for PII handling. Commented Apr 8, 2020 at 20:04
  • You've tagged this as both python-3.x and python-2.7 which one are you using? It looks like 2.x from your code. Commented Apr 8, 2020 at 20:06
  • You are absolutely sure that "bRet" contains the same in both algorithms? Have you tested the loops with a simple byte sequence like e. g. (0, 1, 2, 3) as "bRet"? Commented Apr 8, 2020 at 20:14

1 Answer 1

1

Here you go (python 3.0)

Notes:

  1. hashAlgorithm.update expects encoded string, hence b"[email protected]"
  2. chr((i / 10) + 97 fails with "expect int found float", hence //

import hashlib

emltst = b"[email protected]"

emltst = emltst.strip().lower()

hashAlgorithm = hashlib.sha256()
hashAlgorithm.update(emltst)
# Thanks to Mark Meyer for pointing out.
# bytearray(bytes( are redundant
bRet = hashAlgorithm.digest()

emailhash=""

for i in bRet:
    c = bytes(i)
    emailhash = emailhash + str(chr((i // 10) + 97)).lower()
    emailhash = emailhash + str(chr((i % 10) + 97)).lower()

print(emailhash)

OUTPUT:

uhgbnaijlgchcfqcrgpicdvczapepbtifiwagitbecjfqalhufudieofyfdhzera                                                      
Sign up to request clarification or add additional context in comments.

1 Comment

hashAlgorithm.digest() returns bytes. bRet = hashAlgorithm.digest() should be enough.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.