numpy - how to outer join arrays

Question

I'm trying to combine these three arrays into the one below. Basically the equivalent of a SQL outer join (where the 'pos' field is the key/index)

a1 = array([('2:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('2:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('2:21801', 1.9849650921836601e-31, 0.99999999997999001),], 
      dtype=[('pos', '|S100'), ('col1', '<f8'), ('col2', '<f8')])

a2 = array([('3:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('3:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('3:21801', 1.9849650921836601e-31, 0.99999999997999001),], 
      dtype=[('pos', '|S100'), ('col1', '<f8'), ('col2', '<f8')])

a3 = array([('2:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('2:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('2:21801', 1.9849650921836601e-31, 0.99999999997999001),], 
      dtype=[('pos', '|S100'), ('col3', '<f8'), ('col4', '<f8')])

Desired result:

array([('2:6506', 4.6725971801473496e-25, 0.99999999995088695, 4.6725971801473496e-25, 0.99999999995088695),
       ('2:6601', 2.2452745388799898e-27, 0.99999999995270605, 2.2452745388799898e-27, 0.99999999995270605),
       ('2:21801', 1.9849650921836601e-31, 0.99999999997999001, 1.9849650921836601e-31, 0.99999999997999001),
       ('3:6506', 4.6725971801473496e-25, 0.99999999995088695, NaN, NaN),
       ('3:6601', 2.2452745388799898e-27, 0.99999999995270605, NaN, NaN),
       ('3:21801', 1.9849650921836601e-31, 0.99999999997999001, NaN, NaN),
        ], 
      dtype=[('pos', '|S100'), ('col1', '<f8'), ('col2', '<f8'), ('col3', '<f8'), ('col4', '<f8')])

I think this answer might be on the right track, I just can't quite see how to apply it.

Update:

I tried running unutbu's answer but I'm getting this error:

Traceback (most recent call last):
  File "fail2.py", line 21, in <module>
    a4 = recfunctions.join_by('pos', a4, a, jointype='outer')
  File "/usr/local/msg/lib/python2.6/site-packages/numpy/lib/recfunctions.py", line 973, in join_by
    current = output[f]
  File "/usr/local/msg/lib/python2.6/site-packages/numpy/ma/core.py", line 2943, in __getitem__
    dout = ndarray.__getitem__(_data, indx)
ValueError: field named col12 not found.

Update 2

I only got this error on numpy 1.5.1. I upgraded to 1.8.1 and it went away.

hmm, I just ran the code from your answer. Doesn't that specify the dtypes? Or you think something is altering them later on? — Greg
– Greg, Commented May 6, 2014 at 17:47

unutbu · Accepted Answer · 2014-05-06 19:23:10Z

6

import numpy as np
import numpy.lib.recfunctions as recfunctions

a1 = np.array([('2:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('2:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('2:21801', 1.9849650921836601e-31, 0.99999999997999001),], 
      dtype=[('pos', '|S100'), ('col1', '<f8'), ('col2', '<f8')])

a2 = np.array([('3:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('3:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('3:21801', 1.9849650921836601e-31, 0.99999999997999001),], 
      dtype=[('pos', '|S100'), ('col1', '<f8'), ('col2', '<f8')])

a3 = np.array([('2:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('2:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('2:21801', 1.9849650921836601e-31, 0.99999999997999001),], 
      dtype=[('pos', '|S100'), ('col3', '<f8'), ('col4', '<f8')])

result = a1
for a in (a2, a3):
    cols = list(set(result.dtype.names).intersection(a.dtype.names))
    result = recfunctions.join_by(cols, result, a, jointype='outer')
print(result)

yields

[ ('2:21801', 1.98496509218366e-31, 0.99999999997999, 1.98496509218366e-31, 0.99999999997999)
 ('2:6506', 4.67259718014735e-25, 0.999999999950887, 4.67259718014735e-25, 0.999999999950887)
 ('2:6601', 2.24527453887999e-27, 0.999999999952706, 2.24527453887999e-27, 0.999999999952706)
 ('3:21801', 1.98496509218366e-31, 0.99999999997999, --, --)
 ('3:6506', 4.67259718014735e-25, 0.999999999950887, --, --)
 ('3:6601', 2.24527453887999e-27, 0.999999999952706, --, --)]

If you are doing SQL-like joins on NumPy arrays, you might want to consider using Pandas. Pandas is built on NumPy and provides a richer variety of functions for manipulating data:

import numpy as np
import pandas as pd
a1 = np.array([('2:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('2:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('2:21801', 1.9849650921836601e-31, 0.99999999997999001),], 
      dtype=[('pos', '|S100'), ('col1', '<f8'), ('col2', '<f8')])

a2 = np.array([('3:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('3:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('3:21801', 1.9849650921836601e-31, 0.99999999997999001),], 
      dtype=[('pos', '|S100'), ('col1', '<f8'), ('col2', '<f8')])

a3 = np.array([('2:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('2:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('2:21801', 1.9849650921836601e-31, 0.99999999997999001),], 
      dtype=[('pos', '|S100'), ('col3', '<f8'), ('col4', '<f8')])

dfs = [pd.DataFrame.from_records(a) for a in (a1, a2, a3)]

result = dfs[0]
for df in dfs[1:]:
    cols = list(set(result.columns).intersection(df.columns))
    result = pd.merge(result, df, how='outer', left_on=cols, right_on=cols)

print(result)

yields

       pos          col1  col2          col3  col4
0   2:6506  4.672597e-25     1  4.672597e-25     1
1   2:6601  2.245275e-27     1  2.245275e-27     1
2  2:21801  1.984965e-31     1  1.984965e-31     1
3   3:6506  4.672597e-25     1           NaN   NaN
4   3:6601  2.245275e-27     1           NaN   NaN
5  3:21801  1.984965e-31     1           NaN   NaN

[6 rows x 5 columns]

Sometimes Pandas can be a bit slower than a pure NumPy solution. But that is often because Pandas provides a more robust solution which correctly handles corner cases such as NaNs or duplicate index values -- things that an ad hoc NumPy solution may not be addressing.

Also note that Pandas DataFrames have a .values attribute which returns a NumPy array of the underlying data, and a .to_records method which returns a structured array. And as you can see above, there is a Dataframe.from_records constructor which converts structured arrays to DataFrames. So you can move between DataFrames and NumPy arrays quite easily if you really need to.

So I don't think there is any real speed disadvantage to using Pandas, and the convenience it provides should allow you to do more data analysis, much more easily.

edited May 6, 2014 at 19:23

answered May 6, 2014 at 17:25

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Greg Over a year ago

That seems like it should work. But I'm getting an error. I updated the question with the error since I can't format it here.

Greg Over a year ago

Good idea about pandas. Would there be a speed penalty? Is there something fundamentally wrong with the original numpy approach?

unutbu Over a year ago

No, there is nothing fundamentally wrong with using NumPy structured arrays, but I must say I find working with Pandas DataFrames a pleasure while doing certain things with NumPy structured arrays seems more like a fight. For example, there is no easy way to sub-select columns.

unutbu Over a year ago

Sorry, I don't know why you are seeing an error when running the code I posted. :( I'm using Python2.7, not 2.6, but I couldn't locate any relevant bug report, so the version difference may not be the problem.

Greg Over a year ago

I upgraded to 1.8.1 and it works! Thanks for your help.

|

Collectives™ on Stack Overflow

numpy - how to outer join arrays

1 Answer 1

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related