1

I have data in the following format:

['FACTOR_1','FACTOR_2",'VALUE"]
['A'       ,'A'       ,2.0    ]
['A'       ,'B'       ,3.0    ]
['A'       ,'C'       ,2.2    ]
['A'       ,'D'       ,2.6    ]
['B'       ,'A'       ,2.6    ]
['B'       ,'B'       ,1.0    ]
['B'       ,'C'       ,6.0    ]
['B'       ,'D'       ,7.7    ]
['C'       ,'A'       ,2.1    ]
....
['D'       ,'D'       ,2.6    ]

It is in a data frame but I've been converting to a numpy array anyway.

I'd like to convert it into a matrix of the two factors.

I've coded it myself but the way I am currently doing it is very slow and inefficient, I have a nested loop and am searching for indices of the factors:

    no_of_factors = np.size(np.unique(cov_data['FACTOR_1']))
    factors = np.unique(cov_data['FACTOR_1'])

    cov_matrix = np.zeros((no_of_factors, no_of_factors))

    i = 0
    for factor_1 in factors:
        factor_indices = np.where(cov_data['FACTOR_1'] == factor_1)[0].tolist()
        j = 0
        for factor_2 in factors:
            factor_2_index = np.where(cov_data['FACTOR_2'][factor_indices] == factor_2)[0].tolist()
            if np.size(factor_2_index) > 1:
                self.log.error("Found duplicate factor")
            elif np.size(factor_2_index) == 0:
                var = 0
            else:
                factor_2_index = factor_2_index[0]
                var = cov_data['VALUE'][factor_2_index]
            cov_matrix[i][j] = var
            j += 1
        i += 1 

Annoyingly the data also isn't perfect and there aren't values for every factor, for example factor C might only have a value for A and B and D might be missing hence the check and setting to 0.

1
  • You should show the intended result; that makes it easier to understand and test. In fact the cov_data object isn't clear, though I might be able to create a usable copy. matrix is not a good description of your target, since in numpy, np.matrix is just a subclass of ndarray that must be 2d. I think you are creating a factor or feature matrix, something that's used in a package like scikit-learn. I'd suggest editing tags accordingly. Commented Jul 5, 2016 at 19:14

1 Answer 1

1

There is an error in your code, which I corrected with the sub_data array line. I also streamlined the code in some obvious ways:

def foo(cov_data):
    factors = np.unique(cov_data['FACTOR_1'])
    no_of_factors = factors.shape[0]
    cov_matrix = np.zeros((no_of_factors, no_of_factors))
    for i,factor_1 in enumerate(factors):
        factor_indices = np.where(cov_data['FACTOR_1'] == factor_1)[0]
        sub_data = cov_data[factor_indices]
        for j,factor_2 in enumerate(factors):
            factor_2_index = np.where(sub_data['FACTOR_2'] == factor_2)[0]
            if factor_2_index.shape[0]==1:
                cov_matrix[i, j] = sub_data['VALUE'][factor_2_index[0]]
            elif factor_2_index.shape[0] ==0:
                pass
            else:
                self.log.error("Found duplicate factor")
    return cov_matrix

If I make a structured array from your lists

cov_data = np.array([tuple(i) for i in factors], dtype=[('FACTOR_1','|U1'),('FACTOR_2','|U1'),('VALUE','f')])      

I get this cov_matrix:

[[ 2.          3.          2.20000005  2.5999999 ]
 [ 2.5999999   1.          6.          7.69999981]
 [ 2.0999999   0.          0.          0.        ]
 [ 0.          0.          0.          2.5999999 ]]

I haven't worked with this kind of feature matrix very much, but I think it's the bread-n-butter task in learning code such as scikit-learn.

Sometimes the sklearn people make sparse matrices. Here's a simple way of doing that:

features1, ind1 = np.unique(cov_data['FACTOR_1'], return_inverse=True)
features2, ind2 = np.unique(cov_data['FACTOR_2'], return_inverse=True)
values = cov_data['VALUE']
from scipy import sparse
M = sparse.coo_matrix((values,(ind1, ind2)))

return_inverse gives me the index of each unique value in the original array. So in effect it translates the strings into a row or column index.

The dense version of this matrix M.A is the same.

print(M) displays the index value triplets:

  (0, 0)    2.0
  (0, 1)    3.0
  (0, 2)    2.2
  (0, 3)    2.6
  (1, 0)    2.6
  (1, 1)    1.0
  (1, 2)    6.0
  (1, 3)    7.7
  (2, 0)    2.1
  (3, 3)    2.6

There are some rough edges is this calculation, such as how duplicates are handled (values are added), order of the features and what to do if one list is not as complete as the other. unique sorts them.

Constructing the dense matrix from the indices is easy too:

cov_matrix = np.zeros((len(features1), len(features2)))
cov_matrix[ind1, ind2] = values
print(cov_matrix)

(again, it may not handle duplicates right).

Sign up to request clarification or add additional context in comments.

1 Comment

Perfect thank you so much @hpaulj , also some things in Python I didn't realise you could do, it's so damn neat!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.