converting numpy of string to numpy characters python

Question

I am reading data from a URL and trying to convert it to digits for further analysis on jupyter. It is a gene sequence where each gene would code for 4 binary digits. A --> 0001, C --> 0010, G --> 0100 and T --> 1000. For example, I want to go from CGGT to 0010010001001000. So far, I've been able to remove the empty space and convert it to a string. However I am unable to go from string to char and char to digits. I am using numpy arrays and have made these attempts but to no avail.

charGenes = [var for var in genes if var]

and

charGenes = np.char.array(genes)

Here is the rest of the code:

import pandas as pd
import numpy as np

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/molecular- biology/splice-junction-gene-sequences/splice.data"
file = pd.read_csv(url, delimiter=',', header=None,dtype='str')

X = file[2]
y = file[0]

myGenes = np.array(X)
stringGenes = myGenes.astype(str)

spaceGenes = stringGenes.reshape( stringGenes.size, 1)

genes = np.char.strip(spaceGenes)
genes

This is the output:

array([['CCAGCTGCATCACAGGAGGCCAGCGAGCAGGTCTGTTCCAAGGGCCTTCGAGCCAGTCTG'],
   ['AGACCCGCCGGGAGGCGGAGGACCTGCAGGGTGAGCCCCACCGCCCCTCCGTGCCCCCGC'],
   ['GAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATG'],
   ...,
   ['TCTCGGGGGCGGCCGGCGCGGCGGGGAGCGGTCCCCGGCCGCGGCCCCGACGTGTGTGTC'],
   ['ATTCTACTTAGTAAACATAATTTCTTGTGCTAGATAACCAAATTAAGAAAACCAAAACAA'],
   ['AGGCTGCCTATCAGAAGGTGGTGGCTGGTGTGGCTGCTGCTCTGGCTCACAAGTACCATT']],
  dtype='<U79')

Any help of pointers would be appreciated!

Paul Panzer · Accepted Answer · 2018-03-30 07:04:01Z

6

Here is a method using a lookup table:

>>> alphabet = np.array(list('ACGT'))
>>> alphabet
array(['A', 'C', 'G', 'T'], dtype='<U1')

To use a lookup table we need to reinterpret the letters as indices, this is done via view casting:

>>> alph_as_num = alphabet.view(np.int32)
>>> alph_as_num
array([65, 67, 71, 84], dtype=int32)

We can now build the lookup table it needs 85 slots of which we will actually only be using 4, namely 65, 67, 71 and 84. As for the output format we are free to choose what best meets our requirements:

Example one - output as bytestring:

>>> lookup_1 = np.zeros((alph_as_num.max()+1), dtype='S4')
>>> lookup_1[alph_as_num] = [b'0001000'[i:i+4] for i in range(4)]

Example two - output as uint8:

>>> lookup_2 = np.zeros((alph_as_num.max()+1), dtype=np.uint8)
>>> lookup_2[alph_as_num] = 1 << np.arange(4)

Example three - output as four uint8 per letter:

>>> lookup_3 = np.zeros((alph_as_num.max()+1, 4), dtype=np.uint8)
>>> lookup_3[alph_as_num[::-1]] = np.identity(4)

Now let's apply this to a 100 letter sequence:

>>> seq
array(['CATTTCTCCACCATTTTGGTTTTTCATTGATCCGTTAGGTGGAGCCGGACTATGTCTACCGAAAGATGCACCTGCGCCGGGTCTGGTCTATCTCTTAATG'],
      dtype='<U100')

The translation is compact and fast since it relies only on

numpy's builtin advanced indexing which gives us very fast lookup (much faster than Python dictionaries for example)
view casting which is essentially free since all it does is reinterpret the data buffer (no copying or transformation whatsoever)

Example one - bytestrings:

>>> lookup_1[seq.view(np.int32)]
array([b'0010', b'0001', b'1000', b'1000', b'1000', b'0010', b'1000',
       b'0010', b'0010', b'0001', b'0010', b'0010', b'0001', b'1000',
       b'1000', b'1000', b'1000', b'0100', b'0100', b'1000', b'1000',
       b'1000', b'1000', b'1000', b'0010', b'0001', b'1000', b'1000',
       b'0100', b'0001', b'1000', b'0010', b'0010', b'0100', b'1000',
       b'1000', b'0001', b'0100', b'0100', b'1000', b'0100', b'0100',
       b'0001', b'0100', b'0010', b'0010', b'0100', b'0100', b'0001',
       b'0010', b'1000', b'0001', b'1000', b'0100', b'1000', b'0010',
       b'1000', b'0001', b'0010', b'0010', b'0100', b'0001', b'0001',
       b'0001', b'0100', b'0001', b'1000', b'0100', b'0010', b'0001',
       b'0010', b'0010', b'1000', b'0100', b'0010', b'0100', b'0010',
       b'0010', b'0100', b'0100', b'0100', b'1000', b'0010', b'1000',
       b'0100', b'0100', b'1000', b'0010', b'1000', b'0001', b'1000',
       b'0010', b'1000', b'0010', b'1000', b'1000', b'0001', b'0001',
       b'1000', b'0100'], dtype='|S4')

As a matter of preference these can also be view cast into one long sequence:

>>> lookup_1[seq.view(np.int32)].view('S400')
array([b'0010000110001000100000101000001000100001001000100001100010001000100001000100100010001000100010000010000110001000010000011000001000100100100010000001010001001000010001000001010000100010010001000001001010000001100001001000001010000001001000100100000100010001010000011000010000100001001000101000010000100100001000100100010001001000001010000100010010000010100000011000001010000010100010000001000110000100'],
      dtype='|S400')

Example two - uint8:

>>> lookup_2[seq.view(np.int32)]
array([2, 1, 8, 8, 8, 2, 8, 2, 2, 1, 2, 2, 1, 8, 8, 8, 8, 4, 4, 8, 8, 8,
       8, 8, 2, 1, 8, 8, 4, 1, 8, 2, 2, 4, 8, 8, 1, 4, 4, 8, 4, 4, 1, 4,
       2, 2, 4, 4, 1, 2, 8, 1, 8, 4, 8, 2, 8, 1, 2, 2, 4, 1, 1, 1, 4, 1,
       8, 4, 2, 1, 2, 2, 8, 4, 2, 4, 2, 2, 4, 4, 4, 8, 2, 8, 4, 4, 8, 2,
       8, 1, 8, 2, 8, 2, 8, 8, 1, 1, 8, 4], dtype=uint8)

Example 3 - four uint8 per letter; but let's use a different seq with multiple rows:

>>> seq
array([['CCCT'],
       ['GCGA']], dtype='<U4')
>>> lookup_3[seq.view(np.int32)].reshape(len(seq), -1)
array([[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1]], dtype=uint8)

edited Mar 30, 2018 at 7:04

answered Mar 30, 2018 at 0:41

Paul Panzer

53.3k3 gold badges59 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

ankit Over a year ago

Simple and effective solution, however, any simpler way to have the result as int or bytes compared to the previous method. Although more concise, i need the end array to be in the mentioned format to continue my analysis on the data

Paul Panzer Over a year ago

@ankit the beauty of the lookup table is that you can put what you like in it. What exactly are your requirements? You want numbers 1,2,4,8 of type np.uint8?

ankit Over a year ago

I will split them into training and test data, and apply classification methods to them such as k nearest neighbors and neural networks. These classification methods usually require int's or bytes. So yes, uint8 is preferred

Paul Panzer Over a year ago

@ankit added it as a third example.

Daniel Over a year ago

You should use np.zero instead of np.empty to easily detect any unexpected characters.

|

dROOOze · Accepted Answer · 2018-03-30 01:04:25Z

3

Numpy has a char.replace method (see docs). All you need to do is this:

genes = np.char.replace(genes, 'A', '1')
genes = np.char.replace(genes, 'C', '2')
genes = np.char.replace(genes, 'G', '4')
genes = np.char.replace(genes, 'T', '8')

To convert this to an int array,

genes = genes.astype(int)

you can then use bitwise operations on the arrays.

As pointed out in the comments, the resulting sequence is limited in length. A way to solve this:

genes = np.char.replace(genes, 'A', '1')
genes = np.char.replace(genes, 'C', '2')
genes = np.char.replace(genes, 'G', '4')
genes = np.char.replace(genes, 'T', '8')

>>> genes
array([['12481248'],
       ['12481248']], dtype='|S8')

Insert a comma between the numbers

genes = np.char.join(',', genes)

>>> genes
array([['1,2,4,8,1,2,4,8'],
       ['1,2,4,8,1,2,4,8']], dtype='|S15')

Split based on the comma and convert back to a purely np.char.array

genes = np.char.array(np.char.split(genes, ','))

>>> genes
chararray([[['1', '2', '4', '8', '1', '2', '4', '8']],

           [['1', '2', '4', '8', '1', '2', '4', '8']]], dtype='|S1')

Convert to an int array:

genes = np.array(genes, dtype=int)

>>> genes
array([[[1, 2, 4, 8, 1, 2, 4, 8]],

       [[1, 2, 4, 8, 1, 2, 4, 8]]])

Remove the intermediate dimension of size 1:

genes = genes.reshape(list(genes.shape[:-2]) + [genes.shape[-1]])

>>> genes
array([[1, 2, 4, 8, 1, 2, 4, 8],
       [1, 2, 4, 8, 1, 2, 4, 8]])

edited Mar 30, 2018 at 1:04

answered Mar 29, 2018 at 23:46

dROOOze

4,1941 gold badge12 silver badges23 bronze badges

6 Comments

erip Over a year ago

I'm pretty sure OP wants to convert a string to a byte array.

ankit Over a year ago

The answer does help, as int can work for my analysis. However byte is indeed preferred. Also, astype has a length limit and since my gene is pretty long, any way to overcome that issue?

dROOOze Over a year ago

@ankit Python doesn't display bytes as 1 and 0, so you can replace ACGT with 1248 instead - you can then use bitwise operations on the NumPy arrays directly. Hopefully that also solves your length problem.

ankit Over a year ago

@droooze even by replacing with 1248 (single digit) per gene, my sequence still exceeds the allowed length. My sequence is 60 genes where as the limit is 32 I believe

dROOOze Over a year ago

@ankit I've updated the answer, hopefully it works better now.

|

Collectives™ on Stack Overflow

converting numpy of string to numpy characters python

2 Answers 2

12 Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

12 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related