I'm using the HASHBYTES function in T-SQL to generate an MD5 hash of some data, but I am getting some unexpected results, even though hashing the same data. What am I doing wrong here?
For demonstration purposes I'll create a table and insert a random guid as the 'CustomerId' and a random email address as the 'EmailAddress'. The 'ConcatHash' is a computed column which should create an MD5 hash of the two columns joined together by the pipe character. So it's easier to see whats going on I have also added a ConcatColumn so you can see what the CONCAT_WS is doing.
CREATE TABLE dbo.CustomerTest
(
CustomerId UNIQUEIDENTIFIER NOT NULL
, EmailAddress VARCHAR(255) NOT NULL
, ConcatColumn AS (CONCAT_WS('|', CustomerId, EmailAddress))
, ConcatHash AS (HASHBYTES('MD5', CONCAT_WS('|', CustomerId, EmailAddress))) PERSISTED
)
GO
INSERT INTO dbo.CustomerTest
VALUES
('8E38101D-988E-4BF1-B8F1-E8E0B8DAA891', '[email protected]')
GO
SELECT * FROM dbo.CustomerTest
I'll now query the same data from a different table, using CONCAT_WS and HASHBYTES in exactly the same way as I did previously.
SELECT CustomerId
, Email
, CONCAT_WS('|', CustomerId, Email) As ConcatColumn
, HASHBYTES('MD5', CONCAT_WS('|', CustomerId, Email)) AS ConcatHash
FROM dbo.Customers
WHERE CustomerId = '8E38101D-988E-4BF1-B8F1-E8E0B8DAA891'
Here are the results side-by-side, and you can see the data is the same, the concatanated data is the same, yet the MD5 is different...

To save you the trouble of looking at the 'ConcatColumn' column letter by letter, I have already verified they are identical. So why is the MD5 hash different?


NVARCHAR, orVARCHARwith identical collations.VARCHARfields, collation also determines how characters are encoded. ForNVARCHARfields it does not, as they are always UTF-16. (OK, technically the collations do not need to be identical --Latin1_General_CI_ASandLatin1_General_CI_AIencode the same because only accent sensitivity rules are different, for example. But, say,Japanese_is quite different.)