0

I am reading data from a URL and trying to convert it to digits for further analysis on jupyter. It is a gene sequence where each gene would code for 4 binary digits. A --> 0001, C --> 0010, G --> 0100 and T --> 1000. For example, I want to go from CGGT to 0010010001001000. So far, I've been able to remove the empty space and convert it to a string. However I am unable to go from string to char and char to digits. I am using numpy arrays and have made these attempts but to no avail.

charGenes = [var for var in genes if var]

and

charGenes = np.char.array(genes)

Here is the rest of the code:

import pandas as pd
import numpy as np

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/molecular- biology/splice-junction-gene-sequences/splice.data"
file = pd.read_csv(url, delimiter=',', header=None,dtype='str')

X = file[2]
y = file[0]

myGenes = np.array(X)
stringGenes = myGenes.astype(str)

spaceGenes = stringGenes.reshape( stringGenes.size, 1)

genes = np.char.strip(spaceGenes)
genes

This is the output:

array([['CCAGCTGCATCACAGGAGGCCAGCGAGCAGGTCTGTTCCAAGGGCCTTCGAGCCAGTCTG'],
   ['AGACCCGCCGGGAGGCGGAGGACCTGCAGGGTGAGCCCCACCGCCCCTCCGTGCCCCCGC'],
   ['GAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATG'],
   ...,
   ['TCTCGGGGGCGGCCGGCGCGGCGGGGAGCGGTCCCCGGCCGCGGCCCCGACGTGTGTGTC'],
   ['ATTCTACTTAGTAAACATAATTTCTTGTGCTAGATAACCAAATTAAGAAAACCAAAACAA'],
   ['AGGCTGCCTATCAGAAGGTGGTGGCTGGTGTGGCTGCTGCTCTGGCTCACAAGTACCATT']],
  dtype='<U79')

Any help of pointers would be appreciated!

2 Answers 2

6

Here is a method using a lookup table:

>>> alphabet = np.array(list('ACGT'))
>>> alphabet
array(['A', 'C', 'G', 'T'], dtype='<U1')

To use a lookup table we need to reinterpret the letters as indices, this is done via view casting:

>>> alph_as_num = alphabet.view(np.int32)
>>> alph_as_num
array([65, 67, 71, 84], dtype=int32)

We can now build the lookup table it needs 85 slots of which we will actually only be using 4, namely 65, 67, 71 and 84. As for the output format we are free to choose what best meets our requirements:

Example one - output as bytestring:

>>> lookup_1 = np.zeros((alph_as_num.max()+1), dtype='S4')
>>> lookup_1[alph_as_num] = [b'0001000'[i:i+4] for i in range(4)]

Example two - output as uint8:

>>> lookup_2 = np.zeros((alph_as_num.max()+1), dtype=np.uint8)
>>> lookup_2[alph_as_num] = 1 << np.arange(4)

Example three - output as four uint8 per letter:

>>> lookup_3 = np.zeros((alph_as_num.max()+1, 4), dtype=np.uint8)
>>> lookup_3[alph_as_num[::-1]] = np.identity(4)

Now let's apply this to a 100 letter sequence:

>>> seq
array(['CATTTCTCCACCATTTTGGTTTTTCATTGATCCGTTAGGTGGAGCCGGACTATGTCTACCGAAAGATGCACCTGCGCCGGGTCTGGTCTATCTCTTAATG'],
      dtype='<U100')

The translation is compact and fast since it relies only on

  • numpy's builtin advanced indexing which gives us very fast lookup (much faster than Python dictionaries for example)

  • view casting which is essentially free since all it does is reinterpret the data buffer (no copying or transformation whatsoever)

Example one - bytestrings:

>>> lookup_1[seq.view(np.int32)]
array([b'0010', b'0001', b'1000', b'1000', b'1000', b'0010', b'1000',
       b'0010', b'0010', b'0001', b'0010', b'0010', b'0001', b'1000',
       b'1000', b'1000', b'1000', b'0100', b'0100', b'1000', b'1000',
       b'1000', b'1000', b'1000', b'0010', b'0001', b'1000', b'1000',
       b'0100', b'0001', b'1000', b'0010', b'0010', b'0100', b'1000',
       b'1000', b'0001', b'0100', b'0100', b'1000', b'0100', b'0100',
       b'0001', b'0100', b'0010', b'0010', b'0100', b'0100', b'0001',
       b'0010', b'1000', b'0001', b'1000', b'0100', b'1000', b'0010',
       b'1000', b'0001', b'0010', b'0010', b'0100', b'0001', b'0001',
       b'0001', b'0100', b'0001', b'1000', b'0100', b'0010', b'0001',
       b'0010', b'0010', b'1000', b'0100', b'0010', b'0100', b'0010',
       b'0010', b'0100', b'0100', b'0100', b'1000', b'0010', b'1000',
       b'0100', b'0100', b'1000', b'0010', b'1000', b'0001', b'1000',
       b'0010', b'1000', b'0010', b'1000', b'1000', b'0001', b'0001',
       b'1000', b'0100'], dtype='|S4')

As a matter of preference these can also be view cast into one long sequence:

>>> lookup_1[seq.view(np.int32)].view('S400')
array([b'0010000110001000100000101000001000100001001000100001100010001000100001000100100010001000100010000010000110001000010000011000001000100100100010000001010001001000010001000001010000100010010001000001001010000001100001001000001010000001001000100100000100010001010000011000010000100001001000101000010000100100001000100100010001001000001010000100010010000010100000011000001010000010100010000001000110000100'],
      dtype='|S400')

Example two - uint8:

>>> lookup_2[seq.view(np.int32)]
array([2, 1, 8, 8, 8, 2, 8, 2, 2, 1, 2, 2, 1, 8, 8, 8, 8, 4, 4, 8, 8, 8,
       8, 8, 2, 1, 8, 8, 4, 1, 8, 2, 2, 4, 8, 8, 1, 4, 4, 8, 4, 4, 1, 4,
       2, 2, 4, 4, 1, 2, 8, 1, 8, 4, 8, 2, 8, 1, 2, 2, 4, 1, 1, 1, 4, 1,
       8, 4, 2, 1, 2, 2, 8, 4, 2, 4, 2, 2, 4, 4, 4, 8, 2, 8, 4, 4, 8, 2,
       8, 1, 8, 2, 8, 2, 8, 8, 1, 1, 8, 4], dtype=uint8)

Example 3 - four uint8 per letter; but let's use a different seq with multiple rows:

>>> seq
array([['CCCT'],
       ['GCGA']], dtype='<U4')
>>> lookup_3[seq.view(np.int32)].reshape(len(seq), -1)
array([[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1]], dtype=uint8)
Sign up to request clarification or add additional context in comments.

12 Comments

Simple and effective solution, however, any simpler way to have the result as int or bytes compared to the previous method. Although more concise, i need the end array to be in the mentioned format to continue my analysis on the data
@ankit the beauty of the lookup table is that you can put what you like in it. What exactly are your requirements? You want numbers 1,2,4,8 of type np.uint8?
I will split them into training and test data, and apply classification methods to them such as k nearest neighbors and neural networks. These classification methods usually require int's or bytes. So yes, uint8 is preferred
@ankit added it as a third example.
You should use np.zero instead of np.empty to easily detect any unexpected characters.
|
3

Numpy has a char.replace method (see docs). All you need to do is this:

genes = np.char.replace(genes, 'A', '1')
genes = np.char.replace(genes, 'C', '2')
genes = np.char.replace(genes, 'G', '4')
genes = np.char.replace(genes, 'T', '8')

To convert this to an int array,

genes = genes.astype(int)

you can then use bitwise operations on the arrays.


As pointed out in the comments, the resulting sequence is limited in length. A way to solve this:

genes = np.char.replace(genes, 'A', '1')
genes = np.char.replace(genes, 'C', '2')
genes = np.char.replace(genes, 'G', '4')
genes = np.char.replace(genes, 'T', '8')

>>> genes
array([['12481248'],
       ['12481248']], dtype='|S8')

Insert a comma between the numbers

genes = np.char.join(',', genes)

>>> genes
array([['1,2,4,8,1,2,4,8'],
       ['1,2,4,8,1,2,4,8']], dtype='|S15')

Split based on the comma and convert back to a purely np.char.array

genes = np.char.array(np.char.split(genes, ','))

>>> genes
chararray([[['1', '2', '4', '8', '1', '2', '4', '8']],

           [['1', '2', '4', '8', '1', '2', '4', '8']]], dtype='|S1')

Convert to an int array:

genes = np.array(genes, dtype=int)

>>> genes
array([[[1, 2, 4, 8, 1, 2, 4, 8]],

       [[1, 2, 4, 8, 1, 2, 4, 8]]])

Remove the intermediate dimension of size 1:

genes = genes.reshape(list(genes.shape[:-2]) + [genes.shape[-1]])

>>> genes
array([[1, 2, 4, 8, 1, 2, 4, 8],
       [1, 2, 4, 8, 1, 2, 4, 8]])

6 Comments

I'm pretty sure OP wants to convert a string to a byte array.
The answer does help, as int can work for my analysis. However byte is indeed preferred. Also, astype has a length limit and since my gene is pretty long, any way to overcome that issue?
@ankit Python doesn't display bytes as 1 and 0, so you can replace ACGT with 1248 instead - you can then use bitwise operations on the NumPy arrays directly. Hopefully that also solves your length problem.
@droooze even by replacing with 1248 (single digit) per gene, my sequence still exceeds the allowed length. My sequence is 60 genes where as the limit is 32 I believe
@ankit I've updated the answer, hopefully it works better now.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.