Combine two columns under one header in Numpy array

Question

I have two Numpy arrays which I need to combine maintaining only certain columns from A - size (888, 1114253), depending on the rows I have in B - size (555861, 3).

The problem is that the header of A is 55730: each column has two values!

In other words I want to get only the columns of A where the header corresponds to the rows in B, but in A each column is "double"

An example will clarify:

A:

family id mum dad  rs1  rs2  rs3  rs4  rs5  rs6  rs7  rs8  rs9  rs10  rs11  rs12
     1  1   4   6   A T  A A  T T  C C  G G  A T  A G  A A  G A  T A  G G  C C 
     2  2   7   9   T A  G A  C T  C T  G A  T T  A A  A C  G G  T A  C C  C T 
     3  3   2   8   T T  G G  C T  C T  G G  A T  A G  A C  G G  T T  C C  C C 
     4  4   5   1   A A  A A  T T  C C  G A  T T  A A  A A  G A  T A  G C  C T

Since in this file each rsxxx column header has two corresponding columns, I have to find a way to put them together, so I can read the file as an array

B:

1  rs1 2345
1  rs2 2346
2  rs5 2348
4  rs8 2351
4 rs12 2360

The desired output is

Output:

 family id mum dad  rs1 rs2 rs5 rs8 rs12
  1      1   4   6  A T A A G G A A C C
  2      2   7   9  T A G A G A A C C T
  3      3   2   8  T T G G G G A C C C
  4      4   5   1  A A A A G A A A C T

Ideas?

On the console

B:

array([['1', 'rs3094315', '752566'],
       ['1', 'rs12562034', '768448'],
       ['1', 'rs3934834', '1005806'],
       ..., 
       ['23', 'rs2032612', '21866491'],
       ['23', 'rs2032621', '21872738'],
       ['23', 'rs2032617', '21896261']], 
      dtype='<S10')

Can you show an example of how your data look like in numpy (output in the console)? Because now we only see plain text. — joris
– joris, Commented May 22, 2013 at 13:39
Now I showed how the "B" file looks... actually I can't even read the "A" because the number of columns is different in the first compared to the other rows... — Alice
– Alice, Commented May 22, 2013 at 13:50
to read A you can use np.loadtxt(A_txt,skiprows=1), or create another A_txt with the proper number of columns in the first line. I still did not get what you want to do with B — Saullo G. P. Castro
– Saullo G. P. Castro, Commented May 22, 2013 at 16:45
I explained better what I want to do in the question. As for skiprows, I can't because I need those names to analyze the file; rather than creating the right number of cols in the first line I would want to "group" each couple of "A G" in one string. — Alice
– Alice, Commented May 22, 2013 at 19:36

askewchan · Accepted Answer · 2013-05-22 21:46:24Z

2

It looks like each column is separated by two spaces, but that each gene pair is separated by one space. If this is so you can use

delimiter='  '   #two spaces

in np.loadtxt:

import numpy as np
from StringIO import StringIO # for example file

a = StringIO("""family  id  mum  dad  rs1  rs2  rs3  rs4  rs5  rs6  rs7  rs8  rs9  rs10  rs11  rs12
1  1   4   6   A T  A A  T T  C C  G G  A T  A G  A A  G A  T A  G G  C C 
2  2   7   9   T A  G A  C T  C T  G A  T T  A A  A C  G G  T A  C C  C T 
3  3   2   8   T T  G G  C T  C T  G G  A T  A G  A C  G G  T T  C C  C C 
4  4   5   1   A A  A A  T T  C C  G A  T T  A A  A A  G A  T A  G C  C T """)


nrs = 12        # number of `rs` columns, for dtype
dt = 'int,'*4 + 'S10,'*nrs

A = np.genfromtxt(a, delimiter='  ', names=True, dtype=dt)

A:

array([ (1, 1, 4, 6, ' A T', 'A A', 'T T', 'C C', 'G G', 'A T', 'A G', 'A A', 'G A', 'T A', 'G G', 'C C'),
       (2, 2, 7, 9, ' T A', 'G A', 'C T', 'C T', 'G A', 'T T', 'A A', 'A C', 'G G', 'T A', 'C C', 'C T'),
       (3, 3, 2, 8, ' T T', 'G G', 'C T', 'C T', 'G G', 'A T', 'A G', 'A C', 'G G', 'T T', 'C C', 'C C'),
       (4, 4, 5, 1, ' A A', 'A A', 'T T', 'C C', 'G A', 'T T', 'A A', 'A A', 'G A', 'T A', 'G C', 'C T')], 
      dtype=[('family', '<i8'), ('id', '<i8'), ('mum', '<i8'), ('dad', '<i8'), ('rs1', 'S10'), ('rs2', 'S10'), ('rs3', 'S10'), ('rs4', 'S10'), ('rs5', 'S10'), ('rs6', 'S10'), ('rs7', 'S10'), ('rs8', 'S10'), ('rs9', 'S10'), ('rs10', 'S10'), ('rs11', 'S10'), ('rs12', 'S10')])

Then to access only the columns from B, do something like this:

b = StringIO("""1  rs1 2345
1  rs2 2346
2  rs5 2348
4  rs8 2351
4 rs12 2360""")

B = np.genfromtxt(b, usecols=[1], dtype='S10')

Now, use A[B]:

A[B]
array([(' A T', 'A A', 'G G', 'A A', 'C C'),
       (' T A', 'G A', 'G A', 'A C', 'C T'),
       (' T T', 'G G', 'G G', 'A C', 'C C'),
       (' A A', 'A A', 'G A', 'A A', 'C T')], 
      dtype=[('rs1', 'S10'), ('rs2', 'S10'), ('rs5', 'S10'), ('rs8', 'S10'), ('rs12', 'S10')])

Or, if you want the first four columns too:

A[['family', 'id', 'mum', 'dad'] + list(B)]
array([(1, 1, 4, 6, ' A T', 'A A', 'G G', 'A A', 'C C'),
       (2, 2, 7, 9, ' T A', 'G A', 'G A', 'A C', 'C T'),
       (3, 3, 2, 8, ' T T', 'G G', 'G G', 'A C', 'C C'),
       (4, 4, 5, 1, ' A A', 'A A', 'G A', 'A A', 'C T')], 
      dtype=[('family', '<i8'), ('id', '<i8'), ('mum', '<i8'), ('dad', '<i8'), ('rs1', 'S10'), ('rs2', 'S10'), ('rs5', 'S10'), ('rs8', 'S10'), ('rs12', 'S10')])

edited May 22, 2013 at 21:46

answered May 22, 2013 at 21:29

askewchan

46.7k18 gold badges125 silver badges135 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Alice Over a year ago

Awesome! Anyhows, each column is separated by one tab and not two spaces - but I couldn't type the "tab" in the question.. how can you write the tab character in the question?? Anyways, I tried delimiter=" " #tab and it works as well!

Alice Over a year ago

It's taking forever (I have a 2G file and my laptop is a Mac OS X, 2.66 GHz Intel Core 2 Duo) - is there a way to speed this up?

askewchan Over a year ago

If the rs numbers are sequential, you could load only those those columns using the usecols=[1,2,5,8,12] or usecols=4+B for example. If only a few of the rs are used compared to the size of the file, this will help reduce the size of the loaded array, but I'm not sure how much faster it will be.

Collectives™ on Stack Overflow

Combine two columns under one header in Numpy array

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related