Need Help Converting C# "Byte Math" to Python

Question

We have an old, custom, C# hashing algorithm we use to mask e-mail addresses for PII purposes. I'm trying to build a Python version of this algorithm, but I'm struggling handling the differences in how C# and Python handle bytes/byte arrays, thus producing the wrong hash value. For reference, this is Python 2.7, but a Python 3+ solution would work just as well.

C# code:

using System.Text;
using System.Security;
using System.Security.Cryptography;

public class Program
{
    public static void Main()
    {
        string emailAddressStr = "[email protected]";
        emailAddressStr = emailAddressStr.Trim().ToLower();
        SHA256 objCrypt = new SHA256Managed();
        byte[] b = (new ASCIIEncoding()).GetBytes(emailAddressStr);
        byte[] bRet = objCrypt.ComputeHash(b);

        string retStr = "";
        byte c;

        for (int i = 0; i < bRet.Length; i++)
        {
            c = (byte)bRet[i];
            retStr += ((char)(c / 10 + 97)).ToString().ToLower();
            retStr += ((char)(c % 10 + 97)).ToString().ToLower();
        }
        Console.WriteLine(retStr);
    }
}

The (correct) value that gets returned is uhgbnaijlgchcfqcrgpicdvczapepbtifiwagitbecjfqalhufudieofyfdhzera

Python translation:

import hashlib

emltst = "[email protected]"

emltst = emltst.strip().lower()
b = bytearray(bytes(emltst).encode("ascii"))
bRet = bytearray(bytes(hashlib.sha256(b)))

emailhash=""

for i in bRet:
    c = bytes(i)
    emailhash = emailhash + str(chr((i / 10) + 97)).lower()
    emailhash = emailhash + str(chr((i % 10) + 97)).lower()

print(emailhash)

The (incorrect) value I get here is galfkejhfafdfedchcgfidhcdclbjikgkbjjlgdcgedceimaejeifakajhfekceifggc

The "business end" of the code is in the loop where c is not translating well between languages. C# produces a numeric value for the calculation, but in Python, c is a string (so I'm using i). I've stepped through both sets of code and I know that I'm producing the same hash value right before the loop. I hope someone here might be able help me out. TIA!

EDIT (2020-04-09)

Oguz Ozgul has a good solution below. I found a savvy programmer at work who suggested this working, Python 3 solution (this contains code for the broader solution of ingesting a list of e-mails and using PySpark to write a table):

myfile=sys.argv[1]
with open(myfile) as fql:
    insql=fql.read()

emails=[]
emails=insql.splitlines()

mytable=sys.argv[2]

def getSha256Hash(email):
    b = bytearray(bytes(email, 'ascii'))
    res = hashlib.sha256(b)
    bRet = bytearray.fromhex(res.hexdigest())
    emailhash=""
    for i in bRet:
        c1 = i / 10 + 97
        c2 = i % 10 + 97
        c1 = int(c1)
        c2 = int(c2)
        emailhash = emailhash + str(chr(c1)).lower()
        emailhash = emailhash + str(chr(c2)).lower()
    return(emailhash)
###################################

emailhashes = []

isascii = lambda s: len(s) == len(s.encode())

for e in emails:
    e = e.strip().lower()
    if isascii(e) == True:
        emailhashret = getSha256Hash(e)
        emailhashes.append(emailhashret)

findf = spark.createDataFrame(emailhashes, StringType())

spark.sql("SET spark.sql.hive.convertMetastoreParquet=false")

findf.repartition(1).write.format("parquet").mode("overwrite").saveAsTable(mytable)

I don't think this algorithm passes legal requirements for PII handling. — user2357112
– user2357112, Commented Apr 8, 2020 at 20:04
You've tagged this as both python-3.x and python-2.7 which one are you using? It looks like 2.x from your code. — Mark
– Mark, Commented Apr 8, 2020 at 20:06
You are absolutely sure that "bRet" contains the same in both algorithms? Have you tested the loops with a simple byte sequence like e. g. (0, 1, 2, 3) as "bRet"? — Michael Butscher
– Michael Butscher, Commented Apr 8, 2020 at 20:14

Oguz Ozgul · Accepted Answer · 2020-04-08 21:07:04Z

1

Here you go (python 3.0)

Notes:

hashAlgorithm.update expects encoded string, hence b"[email protected]"
chr((i / 10) + 97 fails with "expect int found float", hence //

import hashlib

emltst = b"[email protected]"

emltst = emltst.strip().lower()

hashAlgorithm = hashlib.sha256()
hashAlgorithm.update(emltst)
# Thanks to Mark Meyer for pointing out.
# bytearray(bytes( are redundant
bRet = hashAlgorithm.digest()

emailhash=""

for i in bRet:
    c = bytes(i)
    emailhash = emailhash + str(chr((i // 10) + 97)).lower()
    emailhash = emailhash + str(chr((i % 10) + 97)).lower()

print(emailhash)

OUTPUT:

uhgbnaijlgchcfqcrgpicdvczapepbtifiwagitbecjfqalhufudieofyfdhzera

edited Apr 8, 2020 at 21:07

answered Apr 8, 2020 at 20:25

Oguz Ozgul

7,2251 gold badge19 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mark Over a year ago

hashAlgorithm.digest() returns bytes. bRet = hashAlgorithm.digest() should be enough.

Collectives™ on Stack Overflow

Need Help Converting C# "Byte Math" to Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related