4

I have a 2D numpy array with about 12 columns and 1000+ rows and each cell contains a number from 1 to 5. I'm searching for the best sextuple of columns according to my point system where 1 and 2 generate -1 point and 4 and 5 gives +1.

If a row in a certain sextuple contains, for example, [1, 4, 5, 3, 4, 3] the point for this row should be +2, because 3*1 + 1*(-1) = 2. Next row may be [1, 2, 2, 3, 3, 3] and should be -3 points.

At first, I tried a strait forward loop solution but I realized there are 665 280 possible combinations of columns to compare and when I also need to search for the best quintuple, quadruple etc. the loop is taking forever.

Is there perhaps a smarter numpy-way of solving my problem?

5
  • 2
    Can you post your loop solution? Sometimes it's easier to optimize already working code, rather than trying to re-invent the wheel ... Commented Sep 5, 2012 at 14:43
  • Another advantage to posting your solution is that it resolves ambiguities. For example, I'm not sure if you want to find the six columns which give the maximum total if you sum over those columns for each row (which is very easy) or something else. Commented Sep 5, 2012 at 14:51
  • It might also help to know more about your dataset. For example, it sounds like you're willing to accept any six answers from one row- if each row is one observation, why can the rest be rejected? Can your data array be restructured in some way to simplify the search space? Commented Sep 5, 2012 at 14:52
  • I can't think of an interpretation of this question where the result of searching through the combinations is ever going to be different from sorting and taking the largest N. Commented Sep 5, 2012 at 16:52
  • are you looking for the combination of columns that maximizes just a single row or the whole matrix? Commented Sep 5, 2012 at 23:07

3 Answers 3

1
import numpy as np
import itertools

N_rows = 10
arr = np.random.random_integers(5, size=(N_rows,12))
x = np.array([0,-1,-1,0,1,1])
y = x[arr]

print(y)

score, best_sextuple = max((y[:,cols].sum(), cols)
                           for cols in itertools.combinations(range(12),6))
print('''\
score: {s}
sextuple: {c}
'''.format(s = score, c = best_sextuple))

yields, for example,

score: 6
sextuple: (0, 1, 5, 8, 10, 11)

Explanation:

First, let's generate a random example, with 12 columns and 10 rows:

N_rows = 10
arr = np.random.random_integers(5, size=(N_rows,12))

Now we can use numpy indexing to convert the numbers in arr 1,2,...,5 to the values -1,0,1 (according to your scoring system):

x = np.array([0,-1,-1,0,1,1])
y = x[arr]

Next, let's use itertools.combinations to generate all possible combinations of 6 columns:

for cols in itertools.combinations(range(12),6)

and

y[:,cols].sum()

then gives the score for cols, a choice of columns (sextuple).

Finally, use max to pick off the sextuple with the best score:

score, best_sextuple = max((y[:,cols].sum(), cols)
                           for cols in itertools.combinations(range(12),6))
Sign up to request clarification or add additional context in comments.

Comments

1
import numpy

A = numpy.random.randint(1, 6, size=(1000, 12))
points = -1*(A == 1) + -1*(A == 2) + 1*(A == 4) + 1*(A == 5)
columnsums = numpy.sum(points, 0)

def best6(row):
    return numpy.argsort(row)[-6:]

bestcolumns = best6(columnsums)
allbestcolumns = map(best6, points)

bestcolumns will now contain the best 6 columns in ascending order. By similar logic, allbestcolumns will contain the best six columns in each row.

3 Comments

This is how I originally interpreted the question but other people have given an equally plausible reading. I would use .argsort()[-6:] instead, though.
I've changed it to argsort, but I'm a bit new here, so I'm not sure about the etiquette of incorporating suggestions like that in my answer. This comment serves as disclosure.
Welcome to SO, then! If it's obvious from the comments, then there's no need to credit (unless it's really super-clever, but this was just a minor variation). Often higher-karma types will make comments on the answer closest in spirit to what they would write rather than writing their own, I've noticed. Aside: rather than using best6, though, I might use best and make the 6 a parameter.
0

Extending on unutbu's longer answer above, it's possible to generate the masked array of scores automatically. Since your scores for values are consistent every pass through the loop, so the scores for each value only need to be calculated once. Here's slightly inelegant way to do it on an example 6x10 array, before and after your scores are applied.

>>> import numpy
>>> values = numpy.random.randint(6, size=(6,10))
>>> values
array([[4, 5, 1, 2, 1, 4, 0, 1, 0, 4],
       [2, 5, 2, 2, 3, 1, 3, 5, 3, 1],
       [3, 3, 5, 4, 2, 1, 4, 0, 0, 1],
       [2, 4, 0, 0, 4, 1, 4, 0, 1, 0],
       [0, 4, 1, 2, 0, 3, 3, 5, 0, 1],
       [2, 3, 3, 4, 0, 1, 1, 1, 3, 2]])
>>> b = values.copy()
>>> b[ b<3 ] = -1

>>> b[ b==3 ] = 0
>>> b[ b>3 ] = 1
>>> b
array([[ 1,  1, -1, -1, -1,  1, -1, -1, -1,  1],
       [-1,  1, -1, -1,  0, -1,  0,  1,  0, -1],
       [ 0,  0,  1,  1, -1, -1,  1, -1, -1, -1],
       [-1,  1, -1, -1,  1, -1,  1, -1, -1, -1],
       [-1,  1, -1, -1, -1,  0,  0,  1, -1, -1],
       [-1,  0,  0,  1, -1, -1, -1, -1,  0, -1]])

Incidentally, this thread claims that creating the combinations directly within numpy will yield around 5x faster performance than itertools, though perhaps at the expense of some readability.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.