2

Here's the question and the example given:

You are given a 2-d array A of size NxN containing floating-point numbers. The array represents pairwise correlation between N elemenets with A[i,j] = A[j,i] = corr(i,j) and A[i,i] = 1.

Write a Python program using NumPy to find the index of the highest correlated element for each element and finally print the sum of all these indexes.

Example: The array A = [[1, 0.3, 0.4], [0.4,1,0.5],[0.1,0.6,1]]. Then, the indexes of the highest correlated elements for each element are [3, 3, 2]. the sum of these indexes is 8.

I'm having trouble understanding the question, but the example makes my confusion worse. With each array inside A having only 3 values, and A itself having only three arrays inside how can any "index of the highest correlated elements" being greater than 2 if numpy is zero indexed?

Does anyone understand the question?

5
  • 4
    It also says A[i, j] == A[j, i], but that's not true in the example. Commented Sep 6, 2022 at 19:16
  • The only thing that's correct is that the diagonals are all 1. Commented Sep 6, 2022 at 19:17
  • 2
    Whoever wrote that exercise should probably stop writing exercises. No wonder you are confused. Commented Sep 6, 2022 at 19:18
  • In any case, the intent of the problem is (almost) clear. For each row, find the index of the largest element that isn't 1.0. Sum up those indices. The only ambiguity is to whether to add the number of rows to the final answer because they want you to use 1-indexing instead of 0-indexing, or if that was just a mistake. Commented Sep 6, 2022 at 19:20
  • @FrankYellin you are interpreting the arrays as lists of pre-calculated correlation values? and not that they want correlations to be calculated in any way to answer the question? Commented Sep 6, 2022 at 19:26

1 Answer 1

1

To reiterate, the example is wrong in multiple ways.

Correlation matrices are by definition symmetric, yet the example is not:

array([[1. , 0.3, 0.4],
       [0.4, 1. , 0.5],
       [0.1, 0.6, 1. ]])

Also you are right, numpy arrays (like everything else I know in Python that supports indexing) are zero-indexed. So the solution is off by one.

The exercise wants you to find the index j of the random variable with the greatest correlation for each random variable with index i. Obviously excluding itself (the correlation coefficient of 1 on the diagonal).

Here is one way to do that given your numpy array a:

np.where(a != 1, a, 0).argmax(axis=1)

Here np.where produces an array identical to a except we replace the ones with zeroes. This is based on the assumption that if i != j, the correlation is always < 1. If that does not hold, the solution will obviously be wrong.

Then argmax gives the indices of the greatest values in each row. Although, in an actual correlation matrix, axis=0 would work just as well, since it would be... you know... symmetrical.

The result is array([2, 2, 1]). To get the sum, you just add a .sum() at the end.

EDIT:

Now that I think about it, the assumption is too strong. Here is a better way:

b = a.copy()
np.fill_diagonal(b, -1)
b.argmax(axis=1)

Now we only assume that actual correlations can never be < 0, which I think is reasonable. If you don't care about mutating the original array, you could obviously omit the copy and fill the diagonal of a with -1. instead.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.