3

On SQL Server (2016+), I have data stored in a varbinary column, saved by some Java application, which contains a mixture of binary data and ASCII text. I want to search the column using a like operator or otherwise to look for certain ASCII strings, and then view the returned values as ASCII (so that I can read the surrounding text).

The data contains non-characters such as "00" (0x00), and these seem to stop SQL Server from converting the string as might otherwise be possible according to the answers at Hex to ASCII string conversion on the fly . In the example below, it can be seen that the byte "00" stops the parsing of the ASCII.

select convert(varchar(max),0x48454C4C004F205000455445,0) as v1       -- HELL
select convert(varchar(max),0x48454C4C4F205000455445,0) as v2         -- HELLO P
select convert(varchar(max),0x48454C4C4F2050455445,0) as v3           -- HELLO PETE

How can I have

 select convert(varchar(max), 0x48454C4C004F205000455445, 0)

...return something like this?:

HELL?O P?ETE

(Or, less ideally, have an expression similar to

convert(varchar(max), 0x48454C4C004F205000455445, 0) like '%HE%ETE%'

...return the row?)

It works on the website https://www.rapidtables.com/convert/number/hex-to-ascii.html with 48454C4C004F205000455445 as input.

enter image description here

I'm not overly concerned about performance, but I want to stay within SQL Server, and ideally within the scope of T-SQL which can be copied and pasted easily.

I've tried using replace on "00", but this could causes problems with characters ending with 0, as in "5000" in the examples above. There may be bytes other than 0x00 which cause string conversion to stop as well.

18
  • See dba.stackexchange.com/questions/132996/… for an existing similar question - and possible anwers. Commented Jul 23, 2020 at 10:55
  • If you get ? it means the data is not ASCII, or at least, it's not in the server's codepage. ? is an error character returned when a conversion from one codepage to another fails. You're trying to fix an application bug (the bad storage format) after the fact, without even knowing what the data is - is it really mixed text and binary? Or is it an unfortunate attempt to store UTF8 as binary instead of using nvarchar? Did the application try to "fix" Unicode storage by breaking it? Commented Jul 23, 2020 at 11:14
  • It works on the website no, it doesn't. That site fails to convert those characters but doesn't emit the error character. The real fix would be to fix the application bug. Which SQL Server version are you using? 2019 added UTF8 as an encoding so maybe you decode that string. You won't be able to index or easily search that field even then Commented Jul 23, 2020 at 11:15
  • 1
    Did the application try to normalise accents, replacing the normalised diacritics with 0x00 perhaps? You'll have to remove all 0x00 bytes from the string before converting. You should really get the application developers to fix this, or store the converted text separately. The conversion means that any query won't be able to. use any indexes and result in a full table scan and conversion, each time you try to search the data. At the very least, you should consider a trigger or persisted computed column that removes 0x00 and converts the bytes to text Commented Jul 23, 2020 at 11:26
  • 1
    There's no second byte, 00 is a single byte Commented Jul 23, 2020 at 11:34

1 Answer 1

2

To return the row (the more limited version of this question), a simple like operator on the value appears to work when run directly on the binary value, despite the intervening 0x00 values:

    0x48454C4C004F205000455445 like 'HE%ETE%'

In other words, like can cope where convert can't.

To view the actual value, the best I've managed so far is this:

convert(varchar(max),convert(varbinary(max),
  REPLACE(
    convert(varchar(max), 0x48454C4C004F205000455445, 1)
    ,'00',''
  )
,1),0) 

This gives HELLO PETE, and works well enough on the actual data, getting to its end.

(It depends on the heuristic of not caring about converting e.g. 0x50 0x03 to 0x53 and similar, but I can live with that, as 0x0z, where z is 1 to f, represents control characters, which don't occur around the text I'm interested in).

(thanks to Panagiotis Kanavos for prodding me in a useful direction!)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.