4

I'm trying to combine these three arrays into the one below. Basically the equivalent of a SQL outer join (where the 'pos' field is the key/index)

a1 = array([('2:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('2:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('2:21801', 1.9849650921836601e-31, 0.99999999997999001),], 
      dtype=[('pos', '|S100'), ('col1', '<f8'), ('col2', '<f8')])

a2 = array([('3:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('3:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('3:21801', 1.9849650921836601e-31, 0.99999999997999001),], 
      dtype=[('pos', '|S100'), ('col1', '<f8'), ('col2', '<f8')])

a3 = array([('2:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('2:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('2:21801', 1.9849650921836601e-31, 0.99999999997999001),], 
      dtype=[('pos', '|S100'), ('col3', '<f8'), ('col4', '<f8')])

Desired result:

array([('2:6506', 4.6725971801473496e-25, 0.99999999995088695, 4.6725971801473496e-25, 0.99999999995088695),
       ('2:6601', 2.2452745388799898e-27, 0.99999999995270605, 2.2452745388799898e-27, 0.99999999995270605),
       ('2:21801', 1.9849650921836601e-31, 0.99999999997999001, 1.9849650921836601e-31, 0.99999999997999001),
       ('3:6506', 4.6725971801473496e-25, 0.99999999995088695, NaN, NaN),
       ('3:6601', 2.2452745388799898e-27, 0.99999999995270605, NaN, NaN),
       ('3:21801', 1.9849650921836601e-31, 0.99999999997999001, NaN, NaN),
        ], 
      dtype=[('pos', '|S100'), ('col1', '<f8'), ('col2', '<f8'), ('col3', '<f8'), ('col4', '<f8')])

I think this answer might be on the right track, I just can't quite see how to apply it.

Update:

I tried running unutbu's answer but I'm getting this error:

Traceback (most recent call last):
  File "fail2.py", line 21, in <module>
    a4 = recfunctions.join_by('pos', a4, a, jointype='outer')
  File "/usr/local/msg/lib/python2.6/site-packages/numpy/lib/recfunctions.py", line 973, in join_by
    current = output[f]
  File "/usr/local/msg/lib/python2.6/site-packages/numpy/ma/core.py", line 2943, in __getitem__
    dout = ndarray.__getitem__(_data, indx)
ValueError: field named col12 not found.

Update 2

I only got this error on numpy 1.5.1. I upgraded to 1.8.1 and it went away.

2
  • Would you post the dtypes of each of the arrays? Commented May 6, 2014 at 17:35
  • hmm, I just ran the code from your answer. Doesn't that specify the dtypes? Or you think something is altering them later on? Commented May 6, 2014 at 17:47

1 Answer 1

6
import numpy as np
import numpy.lib.recfunctions as recfunctions

a1 = np.array([('2:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('2:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('2:21801', 1.9849650921836601e-31, 0.99999999997999001),], 
      dtype=[('pos', '|S100'), ('col1', '<f8'), ('col2', '<f8')])

a2 = np.array([('3:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('3:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('3:21801', 1.9849650921836601e-31, 0.99999999997999001),], 
      dtype=[('pos', '|S100'), ('col1', '<f8'), ('col2', '<f8')])

a3 = np.array([('2:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('2:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('2:21801', 1.9849650921836601e-31, 0.99999999997999001),], 
      dtype=[('pos', '|S100'), ('col3', '<f8'), ('col4', '<f8')])

result = a1
for a in (a2, a3):
    cols = list(set(result.dtype.names).intersection(a.dtype.names))
    result = recfunctions.join_by(cols, result, a, jointype='outer')
print(result)

yields

[ ('2:21801', 1.98496509218366e-31, 0.99999999997999, 1.98496509218366e-31, 0.99999999997999)
 ('2:6506', 4.67259718014735e-25, 0.999999999950887, 4.67259718014735e-25, 0.999999999950887)
 ('2:6601', 2.24527453887999e-27, 0.999999999952706, 2.24527453887999e-27, 0.999999999952706)
 ('3:21801', 1.98496509218366e-31, 0.99999999997999, --, --)
 ('3:6506', 4.67259718014735e-25, 0.999999999950887, --, --)
 ('3:6601', 2.24527453887999e-27, 0.999999999952706, --, --)]

If you are doing SQL-like joins on NumPy arrays, you might want to consider using Pandas. Pandas is built on NumPy and provides a richer variety of functions for manipulating data:

import numpy as np
import pandas as pd
a1 = np.array([('2:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('2:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('2:21801', 1.9849650921836601e-31, 0.99999999997999001),], 
      dtype=[('pos', '|S100'), ('col1', '<f8'), ('col2', '<f8')])

a2 = np.array([('3:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('3:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('3:21801', 1.9849650921836601e-31, 0.99999999997999001),], 
      dtype=[('pos', '|S100'), ('col1', '<f8'), ('col2', '<f8')])

a3 = np.array([('2:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('2:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('2:21801', 1.9849650921836601e-31, 0.99999999997999001),], 
      dtype=[('pos', '|S100'), ('col3', '<f8'), ('col4', '<f8')])

dfs = [pd.DataFrame.from_records(a) for a in (a1, a2, a3)]

result = dfs[0]
for df in dfs[1:]:
    cols = list(set(result.columns).intersection(df.columns))
    result = pd.merge(result, df, how='outer', left_on=cols, right_on=cols)

print(result)

yields

       pos          col1  col2          col3  col4
0   2:6506  4.672597e-25     1  4.672597e-25     1
1   2:6601  2.245275e-27     1  2.245275e-27     1
2  2:21801  1.984965e-31     1  1.984965e-31     1
3   3:6506  4.672597e-25     1           NaN   NaN
4   3:6601  2.245275e-27     1           NaN   NaN
5  3:21801  1.984965e-31     1           NaN   NaN

[6 rows x 5 columns]

Sometimes Pandas can be a bit slower than a pure NumPy solution. But that is often because Pandas provides a more robust solution which correctly handles corner cases such as NaNs or duplicate index values -- things that an ad hoc NumPy solution may not be addressing.

Also note that Pandas DataFrames have a .values attribute which returns a NumPy array of the underlying data, and a .to_records method which returns a structured array. And as you can see above, there is a Dataframe.from_records constructor which converts structured arrays to DataFrames. So you can move between DataFrames and NumPy arrays quite easily if you really need to.

So I don't think there is any real speed disadvantage to using Pandas, and the convenience it provides should allow you to do more data analysis, much more easily.

Sign up to request clarification or add additional context in comments.

9 Comments

That seems like it should work. But I'm getting an error. I updated the question with the error since I can't format it here.
Good idea about pandas. Would there be a speed penalty? Is there something fundamentally wrong with the original numpy approach?
No, there is nothing fundamentally wrong with using NumPy structured arrays, but I must say I find working with Pandas DataFrames a pleasure while doing certain things with NumPy structured arrays seems more like a fight. For example, there is no easy way to sub-select columns.
Sorry, I don't know why you are seeing an error when running the code I posted. :( I'm using Python2.7, not 2.6, but I couldn't locate any relevant bug report, so the version difference may not be the problem.
I upgraded to 1.8.1 and it works! Thanks for your help.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.