in MySQL there is a function called UNHEX which takes a string like '1DB8948899F511E6A18374D02B45FC30' and turns it into a sequence of bits, a binary field. It is what I use for storing UUIDs. The reverse operation is implemented in the function HEX.
I store protein sequences, each protein sequence is a finite sequence of letters, there are at most 21 different letters. Instead of storing each sequence as a clear text, I would like to store them as binary fields.
Thus I would like to implement a custom function, similar to UNHEX, which replaces each letter by a given sequence of 5 bits (this quantity of bits is enough). I will also implement the reciprocal.
How to implement such a function?
I looked into the function COMPRESS, but it provides longer output when run on sequences of length around 63, 64, and the compression factor for sequences of length below 150 is less than the 1.6 factor of compression that I would achieve with my custom function. The sequences with length below 150 are numerous, thus, I will not gain much by using the function COMPRESS.
My MySQL version is 14.14 Distrib 5.5.52, for debian-linux-gnu (x86_64), and you can think a protein sequence as a finite sequence of letters from A to U (the actual letters are not relevant here, I will adapt the code).
What I would like to make is a function which take a string made of these letters from A to U as argument and turn them into a sequence of bits. since 2^4 < 21 <= 2^5, 5 bits for each letter is needed and enough.
I am looking into making a plugin for MySQL which will define both these functions. Am I going too far? Is there an easier way? If so, will the functions gain efficiency by being programmed into a plugin?
HEX()andUNHEX()backwards.HEX()unpacks 8 bits per byte of binary data into 4 bits per byte of hexadecimal,UNHEX()packs 4 bits per byte of hex-encoded data into 8 bits per byte of binary data.