Here is a method using a lookup table:
>>> alphabet = np.array(list('ACGT'))
>>> alphabet
array(['A', 'C', 'G', 'T'], dtype='<U1')
To use a lookup table we need to reinterpret the letters as indices, this is done via view casting:
>>> alph_as_num = alphabet.view(np.int32)
>>> alph_as_num
array([65, 67, 71, 84], dtype=int32)
We can now build the lookup table it needs 85 slots of which we will actually only be using 4, namely 65, 67, 71 and 84. As for the output format we are free to choose what best meets our requirements:
Example one - output as bytestring:
>>> lookup_1 = np.zeros((alph_as_num.max()+1), dtype='S4')
>>> lookup_1[alph_as_num] = [b'0001000'[i:i+4] for i in range(4)]
Example two - output as uint8:
>>> lookup_2 = np.zeros((alph_as_num.max()+1), dtype=np.uint8)
>>> lookup_2[alph_as_num] = 1 << np.arange(4)
Example three - output as four uint8 per letter:
>>> lookup_3 = np.zeros((alph_as_num.max()+1, 4), dtype=np.uint8)
>>> lookup_3[alph_as_num[::-1]] = np.identity(4)
Now let's apply this to a 100 letter sequence:
>>> seq
array(['CATTTCTCCACCATTTTGGTTTTTCATTGATCCGTTAGGTGGAGCCGGACTATGTCTACCGAAAGATGCACCTGCGCCGGGTCTGGTCTATCTCTTAATG'],
dtype='<U100')
The translation is compact and fast since it relies only on
numpy's builtin advanced indexing which gives us very fast lookup (much faster than Python dictionaries for example)
view casting which is essentially free since all it does is reinterpret the data buffer (no copying or transformation whatsoever)
Example one - bytestrings:
>>> lookup_1[seq.view(np.int32)]
array([b'0010', b'0001', b'1000', b'1000', b'1000', b'0010', b'1000',
b'0010', b'0010', b'0001', b'0010', b'0010', b'0001', b'1000',
b'1000', b'1000', b'1000', b'0100', b'0100', b'1000', b'1000',
b'1000', b'1000', b'1000', b'0010', b'0001', b'1000', b'1000',
b'0100', b'0001', b'1000', b'0010', b'0010', b'0100', b'1000',
b'1000', b'0001', b'0100', b'0100', b'1000', b'0100', b'0100',
b'0001', b'0100', b'0010', b'0010', b'0100', b'0100', b'0001',
b'0010', b'1000', b'0001', b'1000', b'0100', b'1000', b'0010',
b'1000', b'0001', b'0010', b'0010', b'0100', b'0001', b'0001',
b'0001', b'0100', b'0001', b'1000', b'0100', b'0010', b'0001',
b'0010', b'0010', b'1000', b'0100', b'0010', b'0100', b'0010',
b'0010', b'0100', b'0100', b'0100', b'1000', b'0010', b'1000',
b'0100', b'0100', b'1000', b'0010', b'1000', b'0001', b'1000',
b'0010', b'1000', b'0010', b'1000', b'1000', b'0001', b'0001',
b'1000', b'0100'], dtype='|S4')
As a matter of preference these can also be view cast into one long sequence:
>>> lookup_1[seq.view(np.int32)].view('S400')
array([b'0010000110001000100000101000001000100001001000100001100010001000100001000100100010001000100010000010000110001000010000011000001000100100100010000001010001001000010001000001010000100010010001000001001010000001100001001000001010000001001000100100000100010001010000011000010000100001001000101000010000100100001000100100010001001000001010000100010010000010100000011000001010000010100010000001000110000100'],
dtype='|S400')
Example two - uint8:
>>> lookup_2[seq.view(np.int32)]
array([2, 1, 8, 8, 8, 2, 8, 2, 2, 1, 2, 2, 1, 8, 8, 8, 8, 4, 4, 8, 8, 8,
8, 8, 2, 1, 8, 8, 4, 1, 8, 2, 2, 4, 8, 8, 1, 4, 4, 8, 4, 4, 1, 4,
2, 2, 4, 4, 1, 2, 8, 1, 8, 4, 8, 2, 8, 1, 2, 2, 4, 1, 1, 1, 4, 1,
8, 4, 2, 1, 2, 2, 8, 4, 2, 4, 2, 2, 4, 4, 4, 8, 2, 8, 4, 4, 8, 2,
8, 1, 8, 2, 8, 2, 8, 8, 1, 1, 8, 4], dtype=uint8)
Example 3 - four uint8 per letter; but let's use a different seq with multiple rows:
>>> seq
array([['CCCT'],
['GCGA']], dtype='<U4')
>>> lookup_3[seq.view(np.int32)].reshape(len(seq), -1)
array([[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1]], dtype=uint8)