0

I'm using the HASHBYTES function in T-SQL to generate an MD5 hash of some data, but I am getting some unexpected results, even though hashing the same data. What am I doing wrong here?

For demonstration purposes I'll create a table and insert a random guid as the 'CustomerId' and a random email address as the 'EmailAddress'. The 'ConcatHash' is a computed column which should create an MD5 hash of the two columns joined together by the pipe character. So it's easier to see whats going on I have also added a ConcatColumn so you can see what the CONCAT_WS is doing.

CREATE TABLE dbo.CustomerTest
(
    CustomerId UNIQUEIDENTIFIER NOT NULL
  , EmailAddress VARCHAR(255) NOT NULL
  , ConcatColumn AS (CONCAT_WS('|', CustomerId, EmailAddress))
  , ConcatHash AS (HASHBYTES('MD5', CONCAT_WS('|', CustomerId, EmailAddress))) PERSISTED
)
GO

INSERT INTO dbo.CustomerTest
VALUES
('8E38101D-988E-4BF1-B8F1-E8E0B8DAA891', '[email protected]')
GO

SELECT * FROM dbo.CustomerTest

Here is the result... enter image description here

I'll now query the same data from a different table, using CONCAT_WS and HASHBYTES in exactly the same way as I did previously.

SELECT CustomerId
     , Email
     , CONCAT_WS('|', CustomerId, Email)                   As ConcatColumn
     , HASHBYTES('MD5', CONCAT_WS('|', CustomerId, Email)) AS ConcatHash
FROM dbo.Customers
WHERE CustomerId = '8E38101D-988E-4BF1-B8F1-E8E0B8DAA891'

Here is the result... enter image description here

Here are the results side-by-side, and you can see the data is the same, the concatanated data is the same, yet the MD5 is different... enter image description here

To save you the trouble of looking at the 'ConcatColumn' column letter by letter, I have already verified they are identical. So why is the MD5 hash different?

9
  • Hash functions operate on bytes, not characters. Both columns need to be either NVARCHAR, or VARCHAR with identical collations. Commented May 25, 2021 at 21:57
  • …is it collation dependant ? Commented May 25, 2021 at 22:12
  • Nailed it @lptr, thanks! Commented May 25, 2021 at 22:14
  • @lptr: Yes, because for VARCHAR fields, collation also determines how characters are encoded. For NVARCHAR fields it does not, as they are always UTF-16. (OK, technically the collations do not need to be identical -- Latin1_General_CI_AS and Latin1_General_CI_AI encode the same because only accent sensitivity rules are different, for example. But, say, Japanese_ is quite different.) Commented May 25, 2021 at 22:15
  • ..@BrokenBad … 👍… Commented May 25, 2021 at 22:15

1 Answer 1

2

varchar and nvarchar columns do not produce the same hash results...

-- Setup demo data...
create table dbo.Customers1 (
  CustomerId varchar(255),
  Email varchar(255),
);
insert dbo.Customers1 (CustomerId, Email) values
  ('8E38101D-988E-4BF1-B8F1-E8E0B8DAA891', '[email protected]');

create table dbo.Customers2 (
  CustomerId varchar(255),
  Email nvarchar(255),
);
insert dbo.Customers2 (CustomerId, Email) values
  ('8E38101D-988E-4BF1-B8F1-E8E0B8DAA891', '[email protected]');

-- Query data...
SELECT CustomerId
     , Email
     , HASHBYTES('MD5', CONCAT_WS('|', CustomerId, Email)) AS ConcatHash
FROM dbo.Customers1
WHERE CustomerId = '8E38101D-988E-4BF1-B8F1-E8E0B8DAA891'

SELECT CustomerId
     , Email
     , HASHBYTES('MD5', CONCAT_WS('|', CustomerId, Email)) AS ConcatHash
FROM dbo.Customers2
WHERE CustomerId = '8E38101D-988E-4BF1-B8F1-E8E0B8DAA891'

Which yields...

CustomerId Email ConcatHash
8E38101D-988E-4BF1-B8F1-E8E0B8DAA891 [email protected] 0xB3CF062CD2FAB8601A1B58E53D1F705B

and...

CustomerId Email ConcatHash
8E38101D-988E-4BF1-B8F1-E8E0B8DAA891 [email protected] 0xFACC935D24A15B73B4F6B864D3BA536
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.